Saturday, September 27, 2008

Paste a Few Files Into One and Plot ... Take 2

My previous blog may help you to merge a few files together, but it only works based on the assumption that all files have to have the same amount of records and each record has to be sampled at the time stamp. As we all know, the world is not perfect and you are bound to have various types of data set. Some data set may have a different start and end time, some may have missing data during certain interval, some may have an extra few data points within the sampling interval. Certainly my previous script will break.

I was trying to work on this new requirement based on my previous work, but I was not getting anywhere and the code started to become 'spaghetti'. My new approach will be to work with epoch time as integer instead of YYYYmmddHHMMSs time format as string. Hey, this is another chance for me to brush up my Python skill. In this exercise, I tried to tap onto the wealth of Python modules (eg. sets, datetime, time) and they proved to be very handy.

Below is my data set and I am trying to put them side by side for easy comparison. As you can see they are all having different samping time, start time, end time. My python program not only will merge the 3 files together, it also round-off the timestamp based on the defined resolution. Average value will be calculated if more than one sampling data within the sampling time resolution (see last 2 lines in t1.txt).

$ paste t1.txt t2.txt t3.txt
20080725101223:14       20080725083225:23       20080725083225:32
20080725102225:15       20080725084230:21       20080725084230:32
20080725103227:15       20080725085228:29       20080725085228:34
20080725104233:19       20080725090235:29       20080725090235:36
20080725105235:18       20080725091233:28       20080725091233:37
20080725110236:11       20080725092232:27       20080725092232:37
20080725111237:12       20080725093225:27       20080725093225:37
20080725112231:12       20080725094229:28       20080725094229:33
20080725113241:13       20080725095225:26       20080725095225:32
20080725114236:14       20080725100221:21       20080725100221:34
20080725115241:14       20080725115241:21       20080725101223:33
20080725120235:17       20080725120235:21       20080725102225:33
20080725121231:19       20080725121231:22       20080725103227:33
20080725122232:12       20080725122232:23       20080725104233:37
20080725123238:13       20080725123238:23       20080725105235:37
20080725124237:14       20080725124237:23       20080725110236:37
20080725125237:15       20080725125237:22       20080725111237:39
20080725130239:16       20080725130239:22       20080725112231:34
20080725131233:16       20080725131233:22
20080725132226:14       20080725132226:22
20080725132236:99

$ ./merge2.py t1.txt t2.txt t3.txt
20080725T083200:0:23:32
20080725T084200:0:21:32
20080725T085200:0:29:34
20080725T090200:0:29:36
20080725T091200:0:28:37
20080725T092200:0:27:37
20080725T093200:0:27:37
20080725T094200:0:28:33
20080725T095200:0:26:32
20080725T100200:0:21:34
20080725T101200:14:0:33
20080725T102200:15:0:33
20080725T103200:15:0:33
20080725T104200:19:0:37
20080725T105200:18:0:37
20080725T110200:11:0:37
20080725T111200:12:0:39
20080725T112200:12:0:34
20080725T113200:13:0:0
20080725T114200:14:0:0
20080725T115200:14:21:0
20080725T120200:17:21:0
20080725T121200:19:22:0
20080725T122200:12:23:0
20080725T123200:13:23:0
20080725T124200:14:23:0
20080725T125200:15:22:0
20080725T130200:16:22:0
20080725T131200:16:22:0
20080725T132200:56.5:22:0

Here is my merge2.py code:

#! /usr/bin/python
#
# merge files together based on round-off timestamp
# no assumption taken regarding having same timestamp for each corresponding row
# data file format: yyyymmddHHMMSS:value
#


import sys, os
import datetime, time
import sets


if len(sys.argv) < 2:
    sys.stderr.write("Usage: %s file1 file2 [file ...]" % (sys.argv[0]))
    sys.exit(1)



def avg(n):
    if len(n)==0:
        return 0
    else:
        return float(sum(n,0.0))/len(n)


resolution = 60


#
# tset - stores the time (round off based on resolution) in epoch
# data - dict with key=[filename,epoch], value=list of values
#
tset = sets.Set()
data = {}
outfmt = '%s'
for f in sys.argv[1:]:
    outfmt = '%s:%s' % (outfmt, '%g')
    for line in open(f):
        ts,value = line.strip().split(':')

        yr  = int(ts[0:4])
        mth = int(ts[4:6])
        day = int(ts[6:8])
        hr  = int(ts[8:10])
        min = int(ts[10:12])
        sec = int(ts[12:14])

        t = datetime.datetime(yr, mth, day, hr, min, sec)
        epoch = int(time.mktime(t.timetuple()))
        epoch = epoch/resolution*resolution
        tset.add(epoch)
        if (f,epoch) not in data:
            data[f,epoch]=[]
        data[f,epoch].append(int(value))


for t in sorted(tset):
    tt = time.localtime(t)
    ts = "%04d%02d%02dT%02d%02d%02d" % (tt[0:6])
    lvalue = [ts]

    for f in sys.argv[1:]:
        if (f,t) in data:
            lvalue.append( avg(data[f,t]) )
        else:
            lvalue.append(0)
    print outfmt % tuple(lvalue)

Below plot is created based on my generic plotting tool. BTW, if you do not need the merged output as text, you can simply populate the data into the RRDtool. IBM developerWorks has a very nice article on RRDtool - Expose Web performance problems with the RRDtool

Labels: ,

0 Comments:

Post a Comment

<< Home