An Old Task in Python
Couple of months ago, my colleague passed me IIS web access log from a rather busy web server. I managed to extract the session concurrency information and visualised the result using
Gnuplot. This trick was derived some years ago when I was doing a performance testing project. Basically I extracted all the timestamps for individual session ID, assuming that the session IDs are unique. It is possible to 'stack up' all the sessions by increment the per second counter between the start and end of the session duration. The end result will be the session concurrency.
The input data is a 170MB web access log with 227K lines of log based on a single day web traffic. My previous solution was based on AWK script and the run time was 3min 6sec. With Python 2.5.2, the run time is 20 sec, almost 10 folds in performance gain.
Here is my python script:
#! /usr/bin/python import datetime import time import sys if len(sys.argv) != 2: print "Usage:", sys.argv[0], "<web-log>" exit(1) # convert string to integer, leading zeros are stripped # '00'->0, '08'->8, '11'->11 def str2int(s): t=s.lstrip('0') if t=='': t=0 return int(t) # determine epoch from web access log timestamp # 2008-07-21 00:00:07 myserver 1.2.3.4 GET / ..... def findEpoch(line): yr=str2int(line[0:4]) mth=str2int(line[5:7]) day=str2int(line[8:10]) hh=str2int(line[11:13]) mm=str2int(line[14:16]) ss=str2int(line[17:19]) t=datetime.datetime(yr,mth,day,hh,mm,ss) epoch=int(time.mktime(t.timetuple())) return epoch sessions={} concurrency={} # start time of log # read first line (no need to determine first line in for loop) fp=open(sys.argv[1],"r") line=fp.readline() starttime=findEpoch(line) fp.close() sess="PROD_JSESSION_UID" sessN=len(sess) for line in open(sys.argv[1],"r"): cookie=line.rstrip().split(' ')[12] if sess in cookie: epoch=findEpoch(line) # get session id i1=cookie.index(sess) try: i2=cookie.index(";",i1) except: i2=len(cookie) uid=cookie[i1+sessN+2:i2] # store sessions try: sessions[uid]="%s,%s" % (sessions[uid],str(epoch)) except: sessions[uid]="%s" % str(epoch) endtime=epoch # initialise to zero count=starttime while count<=endtime: concurrency[count]=0 count+=1 # add up concurrency for all sessions for key in sessions.keys(): ltime=sessions[key].split(',') t0=int(ltime[0]) t1=int(ltime[-1]) count=t0 while count<=t1: concurrency[count]+=1 count+=1 count=starttime while count<=endtime: print count,concurrency[count] count+=1
The plot from Python looked the same as the AWK-based program:
0 Comments:
Post a Comment
<< Home