Tag Archives: awk

Removing duplicate lines with awk

Often I’ve found the need to manipulate text files in various ways and I’ve found the powerful awk/gawk program invaluable saving me many hours of repetitive work. The latest one-liner I used though left me stumped as to how it actually works. Very simply I had an unsorted list of several hundred email addresses and I needed to remove the duplicates from the file, this was the solution I found on a number of websites…

awk '!x[$0]++' mail-list.txt

It look good and worked great, but nowhere could I find an explanation of what it’s actually doing until I came across this thread on unix.com.

Sorting is not necessary. All it does is create an (associative) array element with the entire line as the index without a value (or 0 is you will). The exclamation mark negates that value so the outcome is 1 (true). The value of 1 in awk means perform the default action which is {print $0} so the entire line gets printed. Afterwards the ++ comes into action and 1 is added to the array value, which now becomes 1. So that next time the same line is encountered the value returned by the array is 1 which is then negated to 0 by the exclamation mark, so nothing will get printed

Thanks to Scrutinizer on unix.com

Advertisements