Chi Hung Chan: July 2008

Tuesday, July 29, 2008

How To Parse Timestamp

Whenever I am 'free' (am I always free? I am not going to tell you :-), I will visit UNIX.com shell programming forum and find some challenging tasks to solve. One of the questions posted today is: how to calculate the time difference between two timestamps.

Very often people "echo" date format to log file at the beginning and end in a shell script and the output will be stored in a log file. Suppose you are given the below output, how do you work out the elapsed time:
Oracle Q03 Begin Hot BACKUP Time: 07/23/08 18:35:46 Oracle Q03 End Hot BACKUP Time: 07/24/08 14:18:15

I know it is trivial to convert the timestamps to epoch in Tcl. How about other scripting languages such as Perl and Python

Let's find out.
Perl way:

#! /usr/bin/perl

use Time::Local;

while (<>) {
 if ( /(\d\d)\/(\d\d)\/(\d\d) (\d\d):(\d\d):(\d\d)$/ ) {
  $yr=$3+2000; $mth=$1; $day=$2;
  $hh=$4; $mm=$5; $ss=$6;

  if (/Begin Hot BACKUP/) {
   $t0=timelocal($ss,$mm,$hh,$day,$mth,$year);
  }
  if (/End Hot BACKUP/) {
   $t1=timelocal($ss,$mm,$hh,$day,$mth,$year);
  }
 }
}
print $t1-$t0,"\n";

Python way:

#! /usr/bin/python

import sys
import re
import datetime
import time

for newline in sys.stdin.readlines():
 line=newline.rstrip()
 t=re.findall('(\d\d)/(\d\d)/(\d\d) (\d\d):(\d\d):(\d\d)$',line)
 ts=t[0]
 if t:
  if 'Begin Hot BACKUP' in line:
   _t=datetime.datetime(int(ts[2])+2000, int(ts[0]), int(ts[1]), int(ts[3]), int(ts[4]), int(ts[5]))
   t0=int(time.mktime(_t.timetuple()))
  if 'End Hot BACKUP' in line:
   _t=datetime.datetime(int(ts[2])+2000, int(ts[0]), int(ts[1]), int(ts[3]), int(ts[4]), int(ts[5]))
   t1=int(time.mktime(_t.timetuple()))

print t1-t0

Tcl:

#! /usr/bin/tclsh

while { [gets stdin line] >= 0 } {
 if { [regexp {(\d\d/\d\d/\d\d \d\d:\d\d:\d\d)$} $line x ts] } {
  if { [string match {*Begin Hot BACKUP*} $line] } {
   set t0 [clock scan $ts]
  }
  if { [string match {*End Hot BACKUP*} $line] } {
   set t1 [clock scan $ts]
  }
 }
}
puts [expr $t1-$t0]

UNIX shell script way:

#! /bin/sh

: NO WAY!! Let me know if you have the solution simpler than above

In fact, you may want to consider working out the elapsed time in your original shell script. Here is the skeleton you may want to adopt:

#! /bin/sh
...
starttime=`perl -e 'print time()'`
...
# do you stuff
...
endtime=`perl -e 'print time()'`
...
echo "Elapsed time in seconds = `expr $endtime - $starttime`"

BTW, if you want sample codes as a reference, you should visit PLEAC (Programming Language Examples Alike Cookbook)

Labels: Perl, python, Tcl

Monday, July 28, 2008

My First OO in Python

This job is associated my previous blogs, this, this and this. The difference is I am converting portion of it to OO in Python.

The input files are two mapping files namely the sitename mapping to Address and sitename mapping to ID. The main loop will go through every record in the input file and substitute the sitename with the mapping.

#! /usr/bin/python


import sys
if len(sys.argv) != 2:
 print "Usage: %s <file>" % sys.argv[0]
 exit(1)

class Map:
 def __init__(self,mapfile,sep):
  import os
  self.mapfile=mapfile
  self.sep=sep
  self.mapping={}

  for line in open(self.mapfile,'r'):
   line=line.rstrip()
   [k,v]=line.split(self.sep,1)
   self.mapping[k]=v

 def getValue(self,k):
  try:
   v=self.mapping[k]
  except:
   v=''
  return v



key2add=Map('sitename-address-mapping.txt',':')
key2id=Map('sitename-id-mapping.txt',':')

for line in open(sys.argv[1],'r'):
 key=line.rstrip()
 print "%s\t%s\t%s" % (key, key2id.getValue(key), key2add.getValue(key))

It is a lot cleaner in OO than the procedural way in Tcl, but it requires more planning upfront. I am definitely looking for more opportunities to do things in OO

Labels: python

An Old Task in Python

Finding a real problem to brush up my skill on Python is not an easy task. Instead of waiting for a new problem to come, I look for old problem that I still have the input data set. This also enables me to compare Python with my previous solution.

Couple of months ago, my colleague passed me IIS web access log from a rather busy web server. I managed to extract the session concurrency information and visualised the result using Gnuplot. This trick was derived some years ago when I was doing a performance testing project. Basically I extracted all the timestamps for individual session ID, assuming that the session IDs are unique. It is possible to 'stack up' all the sessions by increment the per second counter between the start and end of the session duration. The end result will be the session concurrency.

The input data is a 170MB web access log with 227K lines of log based on a single day web traffic. My previous solution was based on AWK script and the run time was 3min 6sec. With Python 2.5.2, the run time is 20 sec, almost 10 folds in performance gain.

Here is my python script:

#! /usr/bin/python

import datetime
import time
import sys


if len(sys.argv) != 2:
 print "Usage:", sys.argv[0], "<web-log>"
 exit(1)


# convert string to integer, leading zeros are stripped
# '00'->0, '08'->8, '11'->11
def str2int(s):
 t=s.lstrip('0')
 if t=='': t=0
 return int(t)


# determine epoch from web access log timestamp
# 2008-07-21 00:00:07 myserver 1.2.3.4 GET / .....
def findEpoch(line):
 yr=str2int(line[0:4])
 mth=str2int(line[5:7])
 day=str2int(line[8:10])
 hh=str2int(line[11:13])
 mm=str2int(line[14:16])
 ss=str2int(line[17:19])
 t=datetime.datetime(yr,mth,day,hh,mm,ss)
 epoch=int(time.mktime(t.timetuple()))
 return epoch


sessions={}
concurrency={}


# start time of log
# read first line (no need to determine first line in for loop)
fp=open(sys.argv[1],"r")
line=fp.readline()
starttime=findEpoch(line)
fp.close()


sess="PROD_JSESSION_UID"
sessN=len(sess)
for line in open(sys.argv[1],"r"):
  
 cookie=line.rstrip().split(' ')[12]
 if sess in cookie:

  epoch=findEpoch(line)


  # get session id
  i1=cookie.index(sess)
  try:
   i2=cookie.index(";",i1)
  except:
   i2=len(cookie)
  uid=cookie[i1+sessN+2:i2]


  # store sessions
  try:
   sessions[uid]="%s,%s" % (sessions[uid],str(epoch))
  except:
   sessions[uid]="%s" % str(epoch)

endtime=epoch


# initialise to zero
count=starttime
while count<=endtime:
 concurrency[count]=0
 count+=1


# add up concurrency for all sessions
for key in sessions.keys():
 ltime=sessions[key].split(',')
 t0=int(ltime[0])
 t1=int(ltime[-1])
 count=t0
 while count<=t1:
  concurrency[count]+=1
  count+=1


count=starttime
while count<=endtime:
 print count,concurrency[count]
 count+=1

The plot from Python looked the same as the AWK-based program:

Labels: awk, gnulot, python

Wednesday, July 23, 2008

I Am Naive In ...

Yesterday my colleague was telling me that my program did not work for a specific mapping in converting BMC Remedy CSV data. After understand the input mapping file, I realised that what he told me was incorrect. He told me that the mapping file data record is colon separated and each record consists of two fields. With this information in mind, I took a short cut in programming and did not check the integrity of the mapping file.

I admit I am naive in this task because I thought this is just a a throw away solution and therefore no need to bother about rigorous testing. Obviously I was wrong. Anyway, since I am learning Python, I just wonder whether Python can catch this type of exception. Below showed code snippet how I did it Tcl. If I were to use the "id" after the foreach loop, the "id" is no longer "CCH" as it should be for the bad data set.

$ tclsh
% set gooddata {CCH:Chan Chi Hung Pte Ltd}
CCH:Chan Chi Hung Pte Ltd

% set badData {CCH:Chan Chi Hung Pte Ltd, Tel:1234567}
CCH:Chan Chi Hung Pte Ltd, Tel:1234567

% foreach { id value } [split $goodData :] {}

% puts "id=$id, value=$value"
id=CCH, value=Chan Chi Hung Pte Ltd

% foreach { id value } [split $badData :] {}

% puts "id=$id, value=$value"
id=1234567, value=

%

In order to split into just two fileds based on the first colon, I need to use non-greedy quantifiers in the regular expression syntax, which match the same possibilities but prefer the smallest number rather than the largest number of matches.

% regexp {^(.*?):(.*)$} $badData x id value
1

% puts "id=$id, value=$value"
id=CCH, value=Chan Chi Hung Pte Ltd, Tel:1234567
%

With Python, it throws exception if you are not getting 2 strings in the return list. DO you know that split method for string can limit the amount of separator fields to be splitted. This gives you the flexibility to choose how you want to split the string.

$ python
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> goodData='CCH:Chan Chi Hung Pte Ltd'

>>> badData='CCH:Chan Chi Hung Pte Ltd, Tel: 1234567'

>>> [id,value]=goodData.split(':')

>>> print 'id=%s, value=%s' % (id,value)
id=CCH, value=Chan Chi Hung Pte Ltd

>>> [id,value]=badData.split(':')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack

>>> [id,value]=badData.split(':',1)

>>> print 'id=%s, value=%s' % (id,value)
id=CCH, value=Chan Chi Hung Pte Ltd, Tel: 1234567

>>>

Moral of the story is:
Do not trust your users totally.
Make sure you test your program with all kinds of data set to ensure it can handle various situation.
Program in defensive manner, even though it is a throw away solution.

Labels: python, Tcl

Saturday, July 19, 2008

Different ways to skin a /etc/passwd file

Suppose we need to summarise how many users are using various shells. You can run a one-liner, AWK script or Python program. This exercise is just another opportunity for me to think in Python, that's all.

$ wc -l /etc/passwd
      74 /etc/passwd

$ cut -d: -f7 /etc/passwd | sort | uniq -c
  13
  14 /bin/bash
  38 /bin/sh
   1 /sbin/sh
   7 /usr/lib/rsh
   1 /usr/lib/uucp/uucico


$ awk -F: '
{
 ++s[$NF]
}
END {
 for ( i in s ) {
  printf("%4d %s\n",s[i],i)
 }
}' /etc/passwd
  13
  14 /bin/bash
  38 /bin/sh
   7 /usr/lib/rsh
   1 /usr/lib/uucp/uucico
   1 /sbin/sh


$ ./skin.py /etc/passwd
  13
   1 /sbin/sh
   7 /usr/lib/rsh
  14 /bin/bash
   1 /usr/lib/uucp/uucico
  38 /bin/sh


$ cat skin.py
#! /usr/sfw/bin/python

import sys

if len(sys.argv) != 2:
        print "Usage: ", sys.argv[0], "<colon_separated_file>"
        exit(1)


sdict={}
for line in open(sys.argv[1],"r"):
        shell = line.rstrip().split(":")[-1]
        try:
                sdict[shell] += 1
        except:
                sdict[shell] = 1

for i in sdict.keys():
        print "%4d %s" % (sdict[i],i)

I am about to start the OOP chapter in Learning Python, 3rd Edition and hope that I can program in OO in the future :-).

Labels: python, shell script

Thursday, July 17, 2008

OpenSolaris Curriculum Development Resources

I stumbled upon the OpenSolaris Curriculum Development Resources while I was trying to find other stuff on Solaris. Tonnes of material on almost every aspect of OpenSolaris.

Labels: opensolaris

Tuesday, July 15, 2008

SQLite

If your standalone application requires a database, you may want to consider using SQLite instead of flat file or a full-fledge database like MySQL, PostgreSQL, MS-SQL, Oracle, etc. As what the official site mentioned, SQLite is a in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. So you don't need a database administrator to help you to install and configure.

So, who is using it anyway, some well-know users are deploying it in their applications. BTW, it is included in the Python 2.5 distribution. Simply "import sqlite3" in the shell to load in the SQLite module. Also, Solaris 10 is using it to manage their smf - service management facility.

In this blog, I will try to show you what SQLite can offer.

I am going to create a new database and populate it with a few records. After that I will query the database. As you can see, all the commands are typical SQL statements. You can even turn on the timer command to benchmark your SQL query. If you are a shell script guy like me, you can run "sqlite3" as command line to fetch the records.

$ ls

$ sqlite3 newdb.sqlite3
SQLite version 3.5.2
Enter ".help" for instructions
sqlite> .help
.bail ON|OFF           Stop after hitting an error.  Default OFF
.databases             List names and files of attached databases
.dump ?TABLE? ...      Dump the database in an SQL text format
.echo ON|OFF           Turn command echo on or off
.exit                  Exit this program
.explain ON|OFF        Turn output mode suitable for EXPLAIN on or off.
.header(s) ON|OFF      Turn display of headers on or off
.help                  Show this message
.import FILE TABLE     Import data from FILE into TABLE
.indices TABLE         Show names of all indices on TABLE
.mode MODE ?TABLE?     Set output mode where MODE is one of:
                         csv      Comma-separated values
                         column   Left-aligned columns.  (See .width)
                         html     HTML <table> code
                         insert   SQL insert statements for TABLE
                         line     One value per line
                         list     Values delimited by .separator string
                         tabs     Tab-separated values
                         tcl      TCL list elements
.nullvalue STRING      Print STRING in place of NULL values
.output FILENAME       Send output to FILENAME
.output stdout         Send output to the screen
.prompt MAIN CONTINUE  Replace the standard prompts
.quit                  Exit this program
.read FILENAME         Execute SQL in FILENAME
.schema ?TABLE?        Show the CREATE statements
.separator STRING      Change separator used by output mode and .import
.show                  Show the current values for various settings
.tables ?PATTERN?      List names of tables matching a LIKE pattern
.timeout MS            Try opening locked tables for MS milliseconds
.timer ON|OFF          Turn the CPU timer measurement on or off
.width NUM NUM ...     Set column widths for "column" mode

sqlite> CREATE TABLE staff (lastname TEXT, firstname TEXT, age INT);

sqlite> INSERT INTO staff VALUES ('Chan','Chi Hung',45);

sqlite> INSERT INTO staff VALUES ('Chi','Chan Hung',21);

sqlite> INSERT INTO staff VALUES ('Hung','Chi Chan',11);

sqlite> SELECT * FROM staff;
Chan|Chi Hung|45
Chi|Chan Hung|21
Hung|Chi Chan|11

sqlite> .schema
CREATE TABLE staff (lastname text, firstname text, age int);

sqlite> .timer ON

sqlite> SELECT * FROM staff WHERE lastname like 'C%' ORDER BY age ASC;
Chi|Chan Hung|21
Chan|Chi Hung|45
CPU Time: user 0.000225 sys 0.000116

sqlite> .quit

$ ls -l
total 4
-rw-r--r--   1 chihung  Chihung      2048 Jul 15 20:21 newdb.sqlite3

$ sqlite3 newdb.sqlite3 "SELECT * FROM staff where lastname='Chan'"
Chan|Chi Hung|45

If you have been reading my blog, you will realise that I started using SQLite when I try to tackle the Netflix Prize competition. Although I did not make it to the leaderboard, I learned how to make use of SQLite to work with massive dataset (100+ million). If you design your database schema properly with indices, your query (even inner join) can be extremely fast. Below shows you SQLite from Tcl shell. BTW, it has binding to most of the programming languages.

$ ls -lh movie-64bit.db
-r--r--r--   2 chihung  chihung       6.5G Nov 20  2007 movie-64bit.db

$ tclsh
% package require sqlite3
3.5.2

% sqlite3 dbHandler movie-64bit.db -readonly 1

% set movieN [dbHandler eval {SELECT COUNT(*) FROM movie}]
100480507

% set m1m2 [dbHandler eval {
SELECT t1.customer_id FROM movie t1
INNER JOIN movie t2 ON t1.customer_id=t2.customer_id WHERE
t1.movie_id=1 AND t2.movie_id=2}]
305344 387418 515436 636262 1374216 1398626 1664010 1806515 2118461 2439493

% time "dbHandler eval {
SELECT t1.customer_id FROM movie t1
INNER JOIN movie t2 ON t1.customer_id=t2.customer_id WHERE
t1.movie_id=1 AND t2.movie_id=2}"
4688 microseconds per iteration

Solaris 10 is using SQLite, if you are not aware of that.

$ uname -a
SunOS thumper 5.10 Generic_118855-36 i86pc i386 i86pc

$ cat /etc/release
                        Solaris 10 11/06 s10x_u3wos_10 X86
           Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 14 November 2006


$ /lib/svc/bin/sqlite /lib/svc/seed/global.db
SQLite version 2.8.15-repcached-Generic Patch
Enter ".help" for instructions
sqlite> .schema
CREATE TABLE id_tbl (id_name         STRING NOT NULL,id_next         INTEGER NOT NULL);
CREATE TABLE instance_tbl (instance_id     INTEGER PRIMARY KEY,instance_name   CHAR(256) NOT NULL,instance_svc    INTEGER NOT NULL);
CREATE TABLE pg_tbl (pg_id           INTEGER PRIMARY KEY,pg_parent_id    INTEGER NOT NULL,pg_name         CHAR(256) NOT NULL,pg_type         CHAR(256) NOT NULL,pg_flags        INTEGER NOT NULL,pg_gen_id       INTEGER NOT NULL);
CREATE TABLE prop_lnk_tbl (lnk_prop_id     INTEGER PRIMARY KEY,lnk_pg_id       INTEGER NOT NULL,lnk_gen_id      INTEGER NOT NULL,lnk_prop_name   CHAR(256) NOT NULL,lnk_prop_type   CHAR(2) NOT NULL,lnk_val_id      INTEGER);
CREATE TABLE schema_version (schema_version  INTEGER);
CREATE TABLE service_tbl (svc_id          INTEGER PRIMARY KEY,svc_name        CHAR(256) NOT NULL);
CREATE TABLE snaplevel_lnk_tbl (snaplvl_level_id INTEGER NOT NULL,snaplvl_pg_id    INTEGER NOT NULL,snaplvl_pg_name  CHAR(256) NOT NULL,snaplvl_pg_type  CHAR(256) NOT NULL,snaplvl_pg_flags INTEGER NOT NULL,snaplvl_gen_id   INTEGER NOT NULL);
CREATE TABLE snaplevel_tbl (snap_id                 INTEGER NOT NULL,snap_level_num          INTEGER NOT NULL,snap_level_id           INTEGER NOT NULL,snap_level_service_id   INTEGER NOT NULL,snap_level_service      CHAR(256) NOT NULL,snap_level_instance_id  INTEGER NULL,snap_level_instance     CHAR(256) NULL);
CREATE TABLE snapshot_lnk_tbl (lnk_id          INTEGER PRIMARY KEY,lnk_inst_id     INTEGER NOT NULL,lnk_snap_name   CHAR(256) NOT NULL,lnk_snap_id     INTEGER NOT NULL);
CREATE TABLE value_tbl (value_id        INTEGER NOT NULL,value_type      CHAR(1) NOT NULL,value_value     VARCHAR NOT NULL);
CREATE INDEX id_tbl_id ON id_tbl (id_name);
CREATE INDEX instance_tbl_name ON instance_tbl (instance_svc, instance_name);
CREATE INDEX pg_tbl_name ON pg_tbl (pg_parent_id, pg_name);
CREATE INDEX pg_tbl_parent ON pg_tbl (pg_parent_id);
CREATE INDEX pg_tbl_type ON pg_tbl (pg_parent_id, pg_type);
CREATE INDEX prop_lnk_tbl_base ON prop_lnk_tbl (lnk_pg_id, lnk_gen_id);
CREATE INDEX prop_lnk_tbl_val ON prop_lnk_tbl (lnk_val_id);
CREATE INDEX service_tbl_name ON service_tbl (svc_name);
CREATE INDEX snaplevel_lnk_tbl_id ON snaplevel_lnk_tbl (snaplvl_pg_id);
CREATE INDEX snaplevel_lnk_tbl_level ON snaplevel_lnk_tbl (snaplvl_level_id);
CREATE INDEX snaplevel_tbl_id ON snaplevel_tbl (snap_id);
CREATE INDEX snapshot_lnk_tbl_name ON snapshot_lnk_tbl (lnk_inst_id, lnk_snap_name);
CREATE INDEX snapshot_lnk_tbl_snapid ON snapshot_lnk_tbl (lnk_snap_id);
CREATE INDEX value_tbl_id ON value_tbl (value_id);

sqlite> select * from service_tbl;
2|system/boot-archive
6|system/device/local
10|milestone/devices
14|system/identity
20|system/filesystem/local
24|system/manifest-import
28|system/filesystem/minimal
32|milestone/multi-user
36|milestone/name-services
40|network/initial
44|network/loopback
48|network/physical
52|system/svc/restarter
56|system/filesystem/root
60|milestone/single-user
64|system/filesystem/usr
68|network/rpc/bind
72|system/console-login
76|milestone/multi-user-server
80|network/inetd-upgrade
84|system/utmp
88|system/metainit
92|network/pfil
93|system/sysidtool

sqlite> .q

If you want to know more about SQLite, our National Library carries a copy of The Definitive Guide to SQLite. Alternatively, watch An Introduction to SQLite by the author (Richard Hipp) who delivered the talked at Google TechTalks May 31, 2006.

SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine. SQLite implements a large subset of SQL-92 and stores a complete database in a single disk file. The library footprint is less than 250 KB making is suitable for use in embedded devices and applications where memory space is scarce. This talk provides a quick overview of SQLite, its history, its strengths and weaknesses, and describes situations where it is much more useful than a traditional client/server database. The talk concludes with a discussion of the lessons learned from the development of SQLite and how those lessons can be applied to other projects

Labels: SQLite, Tcl

Monday, July 14, 2008

Solaris Zones Survived A Year

One of my servers has survived a year without shutdown/reboot. The server is running a virtualised grid environment on Solaris 10 container. As you can see in the screen dump, it has 3 SGE (Sun Grid Engine) execution hosts (virtual) with 4 slots for each host. All users (54 of them) are given a Solaris zone as their development environment for grid computing and they are able to submit jobs (including MPI) to the virtualised grid.

The whole setup runs on Sun Fire X4600

Labels: Solaris

Friday, July 11, 2008

Python Runs A Lot Faster Than Tcl

As I mentioned in this morning blog that I will take on this task in Python. Guess what, Python runs 7 times fast than Tcl on the same small subset of CSV dataset. My colleague was so impressed when he applied the program with the real dataset of 190+MB CSV file. It took just under 15 seconds to do all the conversions and create a new CSV file. BTW, it took more than 15-20 seconds to load that CSV file into Excel.

With the charactistics of list being mutable in Python, it has very little penalty in changing the content of individual item in the list. Python does not have to re-create a separate list object when the content is changed, whereas Tcl has to do that. In Tcl, performance will definitely deteriorate when we have to deal with long list.

Here is my second Python code snippet, not very fantastic but it works and run fast. I am very impressed with Python's performance. BTW, the exception handling ( try: except:) in Python has less CPU overhead and finer control than Tcl's catch

import csv, sys

if len(sys.argv) != 4:
 print "Usage:", sys.argv[0], "csv(in)", "mapping", "csv(out)"
 exit(1)


#
# mapping as dict object
#
map={}
for line in open(sys.argv[2],"r"):
 k,v=line.rstrip().split(":")
 map[k]=v


reader=csv.reader(open(sys.argv[1],"rb"))
writer=csv.writer(open(sys.argv[3],"wb"))


#
# find header indices
#
header=reader.next()
i_email    =header.index("Email Address")
i_telephone=header.index("Telephone")
i_assigned =header.index("* Assigned-to (Person)")
....

writer.writerow(header)
for row in reader:
 if row[i_email] == "":
  row[i_email]="default@somewhere"
 if row[i_telephone] == "":
  row[i_telephone]="123456789"

 try:
  row[i_assigned]=map[row[i_assigned]]
 except:
  row[i_assigned]=""

....

 writer.writerow(row)

While I am still trying to finish the Learning Python, 3rd Ed book (only managed to finish half of it), I am always looking for opportunities to apply what I learn.

Labels: performance, python, Tcl

Tcl Code Refactoring, Part 2

Yesterday I talked about refactoring my Tcl code in manipulating CSV data. The problem was circling aroung my head while I was driving home last night. It is pretty inefficient to loop through 285 columns for each row just to test and change a few columns, in my case 6 columns. So after finishing all my "routine duty" at home, I managed to find time to modify my code so that I only have to replace the 6 columns in the list. Although the lreplace has to re-create another list after the replacement, it is still better than going through the whole list.

With this modification, I managed to squeeze another 0.7 second out from the run time. Ok, I think I am kind of hitting the performance limit, so what will be the next step. Since I am learning Python, may be the snake has something to offer.

Bingo! The default Python installation comes with CSV module. Hey that's a good opportunity for me to practise my Python skill. Below is a simple skeleton to read and write in CSV format.

#! /usr/bin/python


import csv, sys

if len(sys.argv) != 3:
        print "Usage:", sys.argv[0], "csv(in) csv(out)"
        exit(1)

reader=csv.reader(open(sys.argv[1],"r"))
writer=csv.writer(open(sys.argv[2],"w"))

for row in reader:
        writer.writerow(row)

Comparing the above functionality with Tcl, Python is 6 times faster! With the characteristics of list object in Python being mutable, it is very efficient to replace values in Python list. However, in Tcl you will have to recreate another list object.

For this exercise, I will definitely go the Python way. So stay tune for more performance news. To be continued ...

Labels: performance, python, Tcl

Thursday, July 10, 2008

Tcl Code Refactoring

My colleague is doing the BMC Remedy migration and has dumped out the data from the old version in CSV format. Some of the migration requirements are that certain fields used to be optional and now have to be mandatory. Also, the user IDs have to be replaced by real user name. Just to name a few.

It is not difficult to parse the mapping file to store that in Tcl associate array so that it can be used for dynamic user name substitution.

CSV module from Tcllib proved to be extremely useful to parse CSV output. To ensure the mapping work properly, I need to dynamically generate the switch body to find out whether I need to substitute the user ID to real user name or set the default email address / telephone if it is blank. Why I need to do that dynamically because the switch pattern cannot work with variable substitution. However, it is very inefficient to build the Tcl code dynamically every time within the while loop.

It took 10 seconds to manipulate a 554 rows x 285 columns CSV file. Definitely I am not satify with the run time and I am sure Tcl can do better than that. It is code refactoring time. By taking the switch body out of the loop and have it dynamically generated using subst, we can avoid a lot of computation in building that part of code over and over again. Also, we can collapse all the matching cases into a single command body using the "-" trick in switch to avoid repeating code. Below is a code snippet:

set switchBody [subst -nocommands {
 $index(email) { 
  if { [string length \$cell] == 0 } {
   set cell $defaultEmail
  }
 }
 $index(telephone) { 
  if { [string length \$cell] == 0 } {
   set cell $defaultTelephone
  }
 }
 $index(assigned) -
 $index(closed) -
 $index(fixed) -
 $index(response) { 
  if { [info exists map(\$cell)] == 1 } {
   set cell \$map(\$cell)
  }
 }
}]

...

while { [gets $fp line] >= 0 } {
 set lcsv [::csv::split -alternate $line]
 set lcsvN [llength $lcsv]
 set new {}
 for { set i 0 } { $i < $lcsvN } { incr i } {
  set cell [lindex $lcsv $i]
  switch $i $switchBody
  lappend new $cell
 }
 puts [::csv::join $new]
}

Now the run time is down to 3.8 seconds and that is 2.5 times more efficient. I may have to tune this code further when my colleague provide me the real data source with few hundred thousand records.

Labels: Tcl

Monday, July 07, 2008

Three Hundred Thousand Files In A Single Directory

A colleague of mine is managing a set of Redhat Enterprise Linux servers. One of web servers the / partition is running low in disk space. The obvious way to find out which directory is the culprit is to run
find / -mount -type d -exec du -sk {} \; | sort -n -k 1
This will show you which directory occupies the most disk space

My colleague managed to locate the /var/spool/mqueue directory and this is the directory to store all the temporary file for sendmail. Apparently it contains 314,000+ files in that single directory. These are the mails that cannot be delivered for some reason and got stuck in the temp directory. If you were to do a "ls -l", you will have to wait for ages to get the directory listing. (See this blog to understand how to minimise the amount of system calls in directory listing).

By reading one of the 314,000+ files, we understand that the content of the file is actually generated from the rsync command in the crontab. The rsync is supposed to syncronise the web content between two servers and it is carried out every minute. The crontab entry looks like this
* * * * * rsync ....
Since the rsync command does not redirect standard output and error, cron will automatically help user to send that via email. Since the sendmail is not running, the mail has to be queued in the system (/var/spool/mqueue/*). For rsync to run every minute, you are talking about having 1440 mails (24*60) in the queue every day. If the server is up for 218 days, you will have 314,000+ files in /var/spool/mqueue.

By appending >/dev/null 2>1& in the rsync entry in crontab, it will simply get ride of all the output from the rsync command. Therefore, no mail will be sent out from cron.

If you intend to clean up all the files in the /var/spool/mqueue, you may want to follow what my colleague did by removing the directory /var/spool/mqueue and re-create the directory. If you were to remove the files using wild card (*), you will get "-bash: /bin/ls: Argument list too long" error.

BTW, do you think Windows can function proper with 314K files in a single directory ?

Labels: unix

Friday, July 04, 2008

Flash Storage Memory, Interview with Donald Knuth, ... from Communications of ACM

July 2008 Issue of the Communications of the ACM from Association for Computing Machinery has a lot of interesting articles. Here is the list of articles that I will have to find time to read and I hope you will share the same interest:

Clouding Computing
Beautiful Code Exists, If You Know Where to Look
Interview, The ‘Art’ of Being Donald Knuth
XML Fever
Flash Storage Memory
Web Science: An Interdisciplinary Approach to Understanding the Web

The PDF format for this Issue is available for download, look for the download link on top of the page to get an offline version before they take it away.

How I knew about it ? It was mentioned in Jonathan Schwartz's blog regarding Solaris on Wall Street - Faster and Faster. He mentioned Adam Leventhal (one of the DTrace developers) wrote an article in Communications of the ACM:

our own Adam Leventhal has added a far more fulfilling technical perspective in Communications of the ACM: Flash Storage Memory.Worth the read...

Do you know that Sun Microsystems is working on Flash SSD Product Line. See this official press release, and for technical information you can download the BluePrint - Optimizing Systems to Use Flash Memory as a Hard Drive Replacement

Labels: storage, Sun

Wednesday, July 02, 2008

Understand Linux Memory

My colleague is managing a set of Redhat Linux servers for customer. An ex-colleague of ours gave him a script to run as a cron job to monitor free memory and email will be sent out if it is below certain threshold. What the script does is basically work out the percentage of free memory from /proc/meminfo based on this equation - 100*MemFree/MemTotal. Guess what happen, tonnes of email alerts will flood your Inbox and you will start to worry about not having enough memory.

However, the author of the script basically do not have a deep understanding of how memory is allocated, used, buffered, cached and released.

If you have an hour to spare, I strongly recommend you to watch the video in this blog - The Answer to Free Memory, Swap, Oracle, and Everything. If you want to convince your customer that the server has enough memory to run whatever application, just ask them to watch that video.

Labels: Linux, performance

Chi Hung Chan