Monday, July 28, 2008

An Old Task in Python

Finding a real problem to brush up my skill on Python is not an easy task. Instead of waiting for a new problem to come, I look for old problem that I still have the input data set. This also enables me to compare Python with my previous solution.

Couple of months ago, my colleague passed me IIS web access log from a rather busy web server. I managed to extract the session concurrency information and visualised the result using Gnuplot. This trick was derived some years ago when I was doing a performance testing project. Basically I extracted all the timestamps for individual session ID, assuming that the session IDs are unique. It is possible to 'stack up' all the sessions by increment the per second counter between the start and end of the session duration. The end result will be the session concurrency.

The input data is a 170MB web access log with 227K lines of log based on a single day web traffic. My previous solution was based on AWK script and the run time was 3min 6sec. With Python 2.5.2, the run time is 20 sec, almost 10 folds in performance gain.

Here is my python script:

#! /usr/bin/python

import datetime
import time
import sys


if len(sys.argv) != 2:
 print "Usage:", sys.argv[0], "<web-log>"
 exit(1)


# convert string to integer, leading zeros are stripped
# '00'->0, '08'->8, '11'->11
def str2int(s):
 t=s.lstrip('0')
 if t=='': t=0
 return int(t)


# determine epoch from web access log timestamp
# 2008-07-21 00:00:07 myserver 1.2.3.4 GET / .....
def findEpoch(line):
 yr=str2int(line[0:4])
 mth=str2int(line[5:7])
 day=str2int(line[8:10])
 hh=str2int(line[11:13])
 mm=str2int(line[14:16])
 ss=str2int(line[17:19])
 t=datetime.datetime(yr,mth,day,hh,mm,ss)
 epoch=int(time.mktime(t.timetuple()))
 return epoch


sessions={}
concurrency={}


# start time of log
# read first line (no need to determine first line in for loop)
fp=open(sys.argv[1],"r")
line=fp.readline()
starttime=findEpoch(line)
fp.close()


sess="PROD_JSESSION_UID"
sessN=len(sess)
for line in open(sys.argv[1],"r"):
  
 cookie=line.rstrip().split(' ')[12]
 if sess in cookie:

  epoch=findEpoch(line)


  # get session id
  i1=cookie.index(sess)
  try:
   i2=cookie.index(";",i1)
  except:
   i2=len(cookie)
  uid=cookie[i1+sessN+2:i2]


  # store sessions
  try:
   sessions[uid]="%s,%s" % (sessions[uid],str(epoch))
  except:
   sessions[uid]="%s" % str(epoch)

endtime=epoch


# initialise to zero
count=starttime
while count<=endtime:
 concurrency[count]=0
 count+=1


# add up concurrency for all sessions
for key in sessions.keys():
 ltime=sessions[key].split(',')
 t0=int(ltime[0])
 t1=int(ltime[-1])
 count=t0
 while count<=t1:
  concurrency[count]+=1
  count+=1


count=starttime
while count<=endtime:
 print count,concurrency[count]
 count+=1

The plot from Python looked the same as the AWK-based program:

Labels: , ,

0 Comments:

Post a Comment

<< Home