Monday, June 16, 2008

My First Python Program

This is my second attempt in trying to learn Python since 2000. You may be wondering what's the motivation behind it and whether I will "dump" my favourite scripting, Tcl, to go full steam with Python. Tcl is still my "mother tongue" and definitely no harm to learn another "foreign language".

The motivation comes from "The Zen of Python" and the way they do multi-precision integer calculation. Below shows Python in action and compare with Perl (with Bignum module) & UNIX bc:

$ /cygdrive/c/Python25/python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
>>> 2**1000
10715086071862673209484250490600018105614048117055336074437503883703510511249361
22493198378815695858127594672917553146825187145285692314043598457757469857480393
45677748242309854210746050623711418779541821530464749835819412673987675591655439
46077062914571196477686542167660429831652624386837205668069376L
>>>exit()

$ perl -v

This is perl, v5.8.8 built for cygwin-thread-multi-64int
(with 8 registered patches, see perl -V for more detail)

Copyright 1987-2006, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

$ echo "use Bignum; print 2**1000" | perl
1.07150860718627e+301

$ echo "2^1000" | bc
10715086071862673209484250490600018105614048117055336074437503883703\
51051124936122493198378815695858127594672917553146825187145285692314\
04359845775746985748039345677748242309854210746050623711418779541821\
53046474983581941267398767559165543946077062914571196477686542167660\
429831652624386837205668069376

Recently my colleague passed me a pretty big (170MB in size, 227K lines) IIS log file and I thought this is a good time to practice my Python skill. BTW, this is my first not so trivial Python program. The objective of the program is to work out the hourly byte sent, byte received and hits. Also, I wanted to compare Python with AWK and Tcl (8.4.12).

Here is the "battle field" for Python vs Tcl vs AWK in my Cygwin. To be fair, each program will be executed 3 times.

$ time ./sum.py iis.log > a

real    0m14.672s
user    0m0.015s
sys     0m0.015s

$ time ./sum.py iis.log > a

real    0m15.391s
user    0m0.031s
sys     0m0.031s

$ time ./sum.py iis.log > a

real    0m15.094s
user    0m0.015s
sys     0m0.031s

$ time ./sum.sh iis.log > b

real    0m18.704s
user    0m15.170s
sys     0m0.373s

$ time ./sum.sh iis.log > b

real    0m18.219s
user    0m14.951s
sys     0m0.233s

$ time ./sum.sh iis.log > b

real    0m18.390s
user    0m14.873s
sys     0m0.483s

$ time ./sum.tcl iis.log > c

real    0m15.781s
user    0m0.015s
sys     0m0.015s

$ time ./sum.tcl iis.log > c

real    0m14.641s
user    0m0.015s
sys     0m0.000s

$ time ./sum.tcl iis.log > c

real    0m15.031s
user    0m0.015s
sys     0m0.000s


# verify the output are the same
# btw, python and tcl treated the default end of line to be CRLF (native platform is Windows)
$ for i in a b c
do
dos2unix < a | md5sum
done
83211bf4faa32495ca9eb52c6b520974 *-
83211bf4faa32495ca9eb52c6b520974 *-
83211bf4faa32495ca9eb52c6b520974 *-

It is clear the both Python and Tcl come in neck to neck. A comprehesive scripting language like Python and Tcl is definitely more versatile than a specific tool such as AWK. Below are the source codes for the various programs in case you are interested in the details:

$ cat sum.py
#! /cygdrive/c/Python25/python

import sys

if len(sys.argv) != 2:
        print "Usage:", sys.argv[0], ""
        exit(1)


sc={}
cs={}
cnt={}
for i in range(24):
        index='%02d' % i
        sc[index]=0
        cs[index]=0
        cnt[index]=0


file=open(sys.argv[1],'r')
line=file.readline()
while line:
        fields=line.split()
        times=fields[1].split(':')
        hour=times[0]
        sc[hour] += int(fields[18])
        cs[hour] += int(fields[19])
        cnt[hour] += 1
        line=file.readline()
file.close()


k=sc.keys()
k.sort()
for i in k:
        print i,sc[i],cs[i],cnt[i]




$ cat sum.sh
#! /bin/sh

if [ $# -ne 1 ]; then
        echo "Usage: $0 <input-log>"
        exit 1
fi

awk '
{
        split($2,t,":")
        hr=t[1]
        sc[hr]+=$19
        cs[hr]+=$20
        hit[hr]++
}
END {
        for ( h=0 ; h<24 ; ++h ) {
                hh=sprintf("%02d",h)
                print hh, sc[hh], cs[hh], hit[hh]
        }
}' $1




$ cat sum.tcl
#! /cygdrive/c/ActiveTcl/8.4.12.0/bin/tclsh

if { $argc != 1 } {
        puts stderr "Usage: $argv0 "
        exit 1
}
set logfile [lindex $argv 0]
if { ![file exists $logfile] } {
        puts stderr "Error. $logfile does not exist"
        exit 2
}


# initialise to 0
set hours {}
for { set h 0 } { $h < 24 } { incr h } {
        lappend hours [format {%02d} $h]
}
foreach hr $hours {
        set sc($hr) 0
        set cs($hr) 0
        set hit($hr) 0
}


set fp [open $logfile r]
while { [gets $fp line] >= 0 } {
        set time [lindex $line 1]
        set hr [lindex [split $time :] 0]
        incr sc($hr) [lindex $line 18]
        incr cs($hr) [lindex $line 19]
        incr hit($hr)
}
close $fp


foreach hr $hours {
        puts "$hr $sc($hr) $cs($hr) $hit($hr)"
}

I just covered 200 pages (out of 746 pages) of the Learning Python, 3rd Edition and hope to explore more features as I go into the details. So far, I particularly like the feature-rich OO methods available in their core objects. However, I still have not figure out how to differentiate between attribute and method of an object.

Labels: , , ,

0 Comments:

Post a Comment

<< Home