Thursday, November 19, 2009

The AWK Way

Today I was given the task of converting few hundred files (743 to be exact) into CSV format. The filename is prefixed with hostname with a fix suffix and the content contains all the local user names. The task is to put them in rows with hostname in the 1st column and usernames in the 2nd column onwards. One more requirement is to exclude a few users in the output. My initial solution is very much unix shell script-based. Although this is an one-off 'throw-away' solution, it is pretty inefficient because there is a lot of process creation within a for loop. It took 1 min 39.453 sec. After some thoughts, I reckoned it is possible to do it efficiently in just AWK. With the help of some of the built-in variables like FILENAME / NR / FNR, we can process all the input files within a single AWK code. The below code works in Cygwin. The runtime for the AWK code is 2.797 sec, that's 35 times faster !
$ ls *txt
host1_root.txt  host2_root.txt  host3_root.txt  host4_root.txt

$ paste *txt
usera   usere   userm   userx
userb   userx   userx   userw
userc   userf   usern   usery
userd   userg   usero   userz
userdx  usery   userp
userdy  userh   userx
        userz   userq
        useri   userqx
        userj   userr
        userk   userz
        userl   users
        userx   usert
                usery

$ cat a.awk
#! /usr/bin/awk -f


BEGIN {
        suffix="_root.txt"
        len=length(suffix)
}
#
# print CR if first line in input file except first file
FNR==1 && NR>1 {
        printf("\n")
}
#
# print hostname
FNR==1 {
        host=substr(FILENAME, 0, length(FILENAME)-len)
        printf("%s", host)
}
#
# print users, but exclude certain users
$0 !~ /^(userx|usery|userz)$/ {
        printf(",%s", $0)
}


$ ./a.awk *.txt
host1,usera,userb,userc,userd,userdx,userdy
host2,usere,userf,userg,userh,useri,userj,userk,userl
host3,userm,usern,usero,userp,userq,userqx,userr,users,usert
host4,userw

Labels: , ,

1 Comments:

Blogger Raymond Tay said...

Sounds like a job for Python!

6:14 PM  

Post a Comment

<< Home