Monday, February 25, 2008

Paste, My Way

In UNIX, you have the command paste to join files horizontally. However, in certain cases you may want to join them by having both file contents to be aligned according to a particular field (keyword).

For example, we have two files a.txt and b.txt (they can be output from two sets of commands)

$ cat a.txt
2008-01-02 12
2008-01-06 31
2008-01-08 9
2008-01-09 41
2008-01-10 48
2008-01-12 28

$ cat b.txt
2008-01-02 43
2008-01-05 78
2008-01-09 23
2008-01-11 33
2008-01-12 39
2008-01-13 11

What you want is an output like this:

2008-01-02      12      43
2008-01-05              78
2008-01-06      31
2008-01-08      9
2008-01-09      41      23
2008-01-10      48
2008-01-11              33
2008-01-12      28      39
2008-01-13              11
but "paste" gives you this:
$ paste a.txt b.txt
2008-01-02 12   2008-01-02 43
2008-01-06 31   2008-01-05 78
2008-01-08 9    2008-01-09 23
2008-01-09 41   2008-01-11 33
2008-01-10 48   2008-01-12 39
2008-01-12 28   2008-01-13 11

What you can do is to have these two outputs in a sub-shell and introduce a unique tag for each of the output. In this case, I introduced a unique tag, f1, for output a.txt and f2 for output b.txt. By having both output (with unique tag) to become an input to awk, I am able to differentiate the two output within awk. In the final awk, I introduced 3 associative arrays, f0 for storing the key field (in this case the date), f1 for a.txt output and f2 for b.txt output. At the END block of awk, I can make use of the f0 associative array to loop through all the key fields and print both outputs of f1 and f2. To output it as a tab-separated, I told awk OFS (output field separator) that my desire output separator is tab. Bear in mind that the output from awk can be random and that's why we need to pipe the output to sort.

$ cat ab.sh
#! /bin/sh


(
 awk '{print "f1", $0}' a.txt;
 awk '{print "f2", $0}' b.txt;
) | awk '
BEGIN {
        OFS="\t"
}
{
        f0[$2]=1
        if ( $1 == "f1" ) {
                f1[$2]+=$3
        }
        if ( $1 == "f2" ) {
                f2[$2]+=$3
        }
}
END {
        for ( i in f0 ) {
                print i, f1[i], f2[i]
        }
}'

$ ./ab.sh | sort -n -k 1
2008-01-02      12      43
2008-01-05              78
2008-01-06      31
2008-01-08      9
2008-01-09      41      23
2008-01-10      48
2008-01-11              33
2008-01-12      28      39
2008-01-13              11

Labels: ,

0 Comments:

Post a Comment

<< Home