Removing duplicate lines with awk

Often I’ve found the need to manipulate text files in various ways and I’ve found the powerful awk/gawk program invaluable saving me many hours of repetitive work. The latest one-liner I used though left me stumped as to how it actually works. Very simply I had an unsorted list of several hundred email addresses and I needed to remove the duplicates from the file, this was the solution I found on a number of websites…

awk '!x[$0]++' mail-list.txt

It look good and worked great, but nowhere could I find an explanation of what it’s actually doing until I came across this thread on unix.com.

Sorting is not necessary. All it does is create an (associative) array element with the entire line as the index without a value (or 0 is you will). The exclamation mark negates that value so the outcome is 1 (true). The value of 1 in awk means perform the default action which is {print $0} so the entire line gets printed. Afterwards the ++ comes into action and 1 is added to the array value, which now becomes 1. So that next time the same line is encountered the value returned by the array is 1 which is then negated to 0 by the exclamation mark, so nothing will get printed

Thanks to Scrutinizer on unix.com

Advertisements

One thought on “Removing duplicate lines with awk

  1. I think my brain just melted. However I think I do get it, and have bookmarked this very useful command for future reference. I can think of quite a few times where this could have saved a bit of time AND prevented me from having to learn about pivot tables…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s