Awk-ing A Lot of Email Addresses
What statistics can you carry out if you have thousands of email addresses? You may want to find out the top 5 domains, and do you know that you can do it all in UNIX with these one-liners.
All these one-lines are tested on Solaris 10. One thing I want to point out that awk field separator (FS) does not honour what it documented in the man page: "FS: input field separator regular expression (default blank and tab)". However, nawk works perfectly well.
$ echo "chihungchan@somewhere.com.sg" | awk -F"[@.]" '{print NF}' 1 $ echo "chihungchan@somewhere.com.sg" | awk 'BEGIN{FS="[@.]"}{print NF}' 1 $ echo "chihungchan@somewhere.com.sg" | nawk -F"[@.]" '{print NF}' 4 $ echo "chihungchan@somewhere.com.sg" | nawk 'BEGIN{FS="[@.]"}{print NF}' 4
Back to the subject of finding the top 5 domain names:
$ nawk -F"@" '{++s[$2]}END{for(i in s){print i,s[i]}}' lots_of_emails.txt | sort -n -k 2 | tail -5 singnet.com.sg 83 yahoo.com.sg 137 yahoo.com 148 gmail.com 197 hotmail.com 221 $ nawk -F"[@.]" '{domain=sprintf("%s.%s",$(NF-1),$NF-1);++s[domain]}END{for(i in s){print i,s[i]}}' lots_of_emails.txt | sort -n -k 2 | tail -5 net.sg 50 yahoo.com 148 gmail.com 197 hotmail.com 221 com.sg 265 $ nawk -F"[@.]" '{++s[$NF]}END{for(i in s){print i,s[i]}}' lots_of_emails.txt | sort -n -k 2 | tail -5 id 6 net 7 my 8 sg 345 com 702
AWK is really powerful. You may want to read sed & awk from O'Relly to start with.
0 Comments:
Post a Comment
<< Home