Wednesday, July 18, 2007

Awk-ing A Lot of Email Addresses

What statistics can you carry out if you have thousands of email addresses? You may want to find out the top 5 domains, and do you know that you can do it all in UNIX with these one-liners.

All these one-lines are tested on Solaris 10. One thing I want to point out that awk field separator (FS) does not honour what it documented in the man page: "FS: input field separator regular expression (default blank and tab)". However, nawk works perfectly well.

$ echo "chihungchan@somewhere.com.sg" | awk -F"[@.]" '{print NF}'
1
$ echo "chihungchan@somewhere.com.sg" | awk 'BEGIN{FS="[@.]"}{print NF}'
1
$ echo "chihungchan@somewhere.com.sg" | nawk -F"[@.]" '{print NF}'
4
$ echo "chihungchan@somewhere.com.sg" | nawk 'BEGIN{FS="[@.]"}{print NF}'
4

Back to the subject of finding the top 5 domain names:

$ nawk -F"@" '{++s[$2]}END{for(i in s){print i,s[i]}}' lots_of_emails.txt | sort -n -k 2 | tail -5
singnet.com.sg 83
yahoo.com.sg 137
yahoo.com 148
gmail.com 197
hotmail.com 221

$ nawk -F"[@.]" '{domain=sprintf("%s.%s",$(NF-1),$NF-1);++s[domain]}END{for(i in s){print i,s[i]}}' lots_of_emails.txt | sort -n -k 2 | tail -5
net.sg 50
yahoo.com 148
gmail.com 197
hotmail.com 221
com.sg 265

$ nawk -F"[@.]" '{++s[$NF]}END{for(i in s){print i,s[i]}}' lots_of_emails.txt | sort -n -k 2 | tail -5
id 6
net 7
my 8
sg 345
com 702

AWK is really powerful. You may want to read sed & awk from O'Relly to start with.

Labels: ,

0 Comments:

Post a Comment

<< Home