Chi Hung Chan: May 2008

Saturday, May 31, 2008

AWK Can Do Lookup

It is possible to program in AWK to do direct lookup via an input file. All you have to do is to establish the associate array (in my case, I store them in array L) in the BEGIN block.

I choose the web access log as an example and the lookup is based on the Hypertext Transfer Protocol -- HTTP/1.1 Status Code Definitions, eg, 200 -> OK

My initial version is based on some shell tricks which are very inefficient and error-prone. After browsing through the "The AWK Programming Language" (written by the AWK author - Alfred V. Aho, Peter J. Weinberger, and Brian W. Kerninghan), I am able to come up with this clean and readable code. Although the book was written in 1988, IMHO it is still the best book for AWK

#! /bin/sh

if [ $# -ne 2 ]; then
 echo "Usage: $0 <lookup-file> <data-file>"
 exit 1
fi


if [ ! -f $1 ]; then
 echo "Error. \"$1\" lookup file does not exit"
 exit 2
fi
if [ ! -f $2 ]; then
 echo "Error. \"$2\" data file does not exit"
 exit 3
fi


gawk '
BEGIN {
 # establish lookup
 while ( getline < "'$1'" > 0 ) {
  V=$2
  for ( i=3 ; i<=NF ; ++i ) {
   V=sprintf("%s %s",V,$i)
  }
  L[$1]=V
 }

}
{
 # HTTP status code summary
 ++s[$9]
}
END {
 for ( i in s ) {
  printf("\"%s\" has %d counts\n", L[i], s[i])
 }
}' $2

See the lookup file and access log, and how the above script generates the lookup dynamically

$ cat lookup.txt
200 OK
201 Created
202 Accepted
203 Non Authoritative Information
204 No Content
205 Reset Content
206 Partial Content
300 Multiple Choices
301 Moved Permanently
302 Found
303 See Other
304 Not Modified
305 Use Proxy
306 Unused
307 Temporary Redirect
400 Bad Request
401 Unauthorized
402 Payment Required
403 Forbidden
404 Not Found
405 Method Not Allowed
406 Not Acceptable
407 Proxy Authentication Required
408 Request Timeout
409 Conflict
410 Gone
411 Length Required
412 Precondition Failed
413 Request Entity Too Large
414 Request URI Too Long
415 Unspported Media Type
416 Request Range Not Satisfiable
417 Expectation Failed
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported

$ head access_log
127.0.0.1 - - [01/Mar/2006:15:30:26 +0800] "GET / HTTP/1.1" 200 1456
127.0.0.1 - - [01/Mar/2006:15:30:26 +0800] "GET /apache_pb.gif HTTP/1.1" 200 2326
127.0.0.1 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/ HTTP/1.1" 200 9187
127.0.0.1 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/images/pixel.gif HTTP/1.1" 200 61
127.0.0.1 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/images/apache_header.gif HTTP/1.1" 200 4084
127.0.0.1 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/images/index.gif HTTP/1.1" 200 1540
127.0.0.1 - - [01/Mar/2006:15:30:38 +0800] "GET /manual/howto/cgi.html HTTP/1.1" 200 22388
127.0.0.1 - - [01/Mar/2006:15:30:38 +0800] "GET /manual/images/home.gif HTTP/1.1" 200 1465
127.0.0.1 - - [01/Mar/2006:15:30:38 +0800] "GET /manual/images/sub.gif HTTP/1.1" 200 6083
127.0.0.1 - - [01/Mar/2006:15:33:15 +0800] "GET /manual/howto/cgi.html HTTP/1.1" 200 22388

$ ./lookup.sh lookup.txt access_log
"Not Modified" has 239 counts
"Bad Request" has 1 counts
"Unauthorized" has 18 counts
"Forbidden" has 23 counts
"OK" has 11378 counts
"Not Found" has 3257 counts
"Internal Server Error" has 4 counts
"Bad Gateway" has 2 counts

Labels: awk, http, shell script

Swapping Row to Column, Code Refactoring

While I was mopping the floor this morning (I am not kidding), I realised that I can improve on my previous blog, Swapping Row to Column, by defining a user function in AWK. Previously I was trying to use shell function to generalise the special separator code instead of repeating it for both the input and output separators. However, it is not as easy as I think it is because these separator will upset the shell syntax.

In the function, I also include a few extra separators like COLON, COMMA, DOUBLEQUOTE and SINGLEQUOTE. Single quote (') is not as straightforward as the other separators because the shell will interpret it as if you want to close the awk statement. In order for AWK to work with single quote, I need AWK to sprintf the single quote ASCII representation (which is 39) to a local variable (sp)

Here is the revised code:

#! /bin/sh


usage()
{
 echo "Usage: $0 [-h] [-i sep] [-o sep] [input-file]"
 echo "       -h : to print this help message"
 echo "       -i : input field separator  [default: whitespace]"
 echo "       -o : output field separator [default: space]"
 echo "Note: special field separator"
 echo "      NULL, SPACE, PIPE, COLON, COMMA, SINGLEQUOTE, DOUBLEQUOTE"
}


set -- `getopt i:o:h $* 2>/dev/null`
if [ $? -ne 0 ]; then
 usage
 exit 1
fi


isep="[ \t]+"
osep=" "
for i in $*; do
 case $i in 
  -i)
   isep=$2
   shift 2
   ;;
  -o)
   osep=$2
   shift 2
   ;;
  -h)
   usage
   exit 0
   ;;
  --)
   shift
   ;;
 esac
done



gawk -v isep="$isep" -v osep="$osep" '
function separator(sep, sq)
{
 if ( sep == "NULL" ) {
  return ""
 }
 if ( sep == "SPACE" ) {
  return " "
 }
 if ( sep == "PIPE" ) {
  return "|"
 }
 if ( sep == "COMMA" ) {
  return ","
 }
 if ( sep == "COLON" ) {
  return ":"
 }
 if ( sep == "DOUBLEQUOTE" ) {
  return "\""
 }
 if ( sep == "SINGLEQUOTE" ) {
  # you cannot return a single quote because the shell will
  # think that you are trying to close the awk command
  sq=sprintf("%c",39)
  return sq
 }
 return sep
}
BEGIN {
 FS=separator(isep)
 max=0
}
{
 for ( i=1 ; i<=NF ; ++i ) {
  a[i,NR]=$i
 }
 if ( NF > max ) { max=NF }
}
END {
 for ( i=1 ; i<=max ; ++i ) {
  for ( j=1 ; j<NR ; ++j ) {
   printf("%s%s", a[i,j], separator(osep))
  }
  print a[i,j]

 }
}' $1

Labels: awk, Cygwin, shell script

Friday, May 30, 2008

Swapping Row to Column

If you need to swap rows to columns, you may want to take a look at this script. Basically I stored the data in a matrix form, a[1,1], a[1,2], ... inside AWK. During the processing of each row, I also determine the maximium number of fields in each row. The rows and columns are swapped and output at the AWK's END block. In linear algebra, it is called Transpose, ie A_ij to A_ji

The below script works on my Cygwin under Windows XP.

#! /bin/sh


usage()
{
 echo "Usage: $0 [-h] [-i sep] [-o sep] [input-file]"
 echo "       -h : to print this help message"
 echo "       -i : input field separator  [default: whitespace]"
 echo "       -o : output field separator [default: space]"
 echo "Note: special field separator - NULL, SPACE, PIPE"
}


set -- `getopt i:o:h $* 2>/dev/null`
if [ $? -ne 0 ]; then
 usage
 exit 1
fi


isep="[ \t]+"
osep=" "
for i in $*; do
 case $i in 
  -i)
   isep=$2
   if [ "$isep" = "NULL" ]; then
    isep=""
   elif [ "$isep" = "SPACE" ]; then
    isep=" "
   elif [ "$isep" = "PIPE" ]; then
    isep="|"
   fi
   shift 2
   ;;
  -o)
   osep=$2
   if [ "$osep" = "NULL" ]; then
    osep=""
   elif [ "$osep" = "SPACE" ]; then
    osep=" "
   elif [ "$osep" = "PIPE" ]; then
    osep="|"
   fi
   shift 2
   ;;
  -h)
   usage
   exit 0
   ;;
  --)
   shift
   ;;
 esac
done



gawk -v isep="$isep" -v osep="$osep" '
BEGIN {
 FS=isep
 max=0
}
{
 for ( i=1 ; i<=NF ; ++i ) {
  a[i,NR]=$i
 }
 if ( NF > max ) { max=NF }
}
END {
 for ( i=1 ; i<=max ; ++i ) {
  for ( j=1 ; j<NR ; ++j ) {
   printf("%s%s", a[i,j], osep)
  }
  print a[i,j]

 }
}' $1

The script in action:

$ uname -a
CYGWIN_NT-5.1 chihung 1.5.25(0.156/4/2) 2007-12-14 19:21 i686 Cygwin

$ ./rowcol.sh -h
Usage: ./rowcol.sh [-h] [-i sep] [-o sep] [input-file]
       -h : to print this help message
       -i : input field separator  [default: whitespace]
       -o : output field separator [default: space]
Note: special field separator - NULL, SPACE, PIPE

$ echo a b c | ./rowcol.sh
a
b
c

$ cat rowcol-1.txt
a b c d e f A B C D E F
g h i j k l G H I J K L
m n o p q r M N O P Q R s h o r t
s t u v w x S T U V W X l o n g e r
y z 1 2 3 4 Y Z 1 2 3 4
5 6 7 8 9 0 5 6 7 8 9 0

$ ./rowcol.sh -o PIPE rowcol-1.txt
a|g|m|s|y|5
b|h|n|t|z|6
c|i|o|u|1|7
d|j|p|v|2|8
e|k|q|w|3|9
f|l|r|x|4|0
A|G|M|S|Y|5
B|H|N|T|Z|6
C|I|O|U|1|7
D|J|P|V|2|8
E|K|Q|W|3|9
F|L|R|X|4|0
||s|l||
||h|o||
||o|n||
||r|g||
||t|e||
|||r||

$ ./rowcol.sh -o , rowcol-1.txt
a,g,m,s,y,5
b,h,n,t,z,6
c,i,o,u,1,7
d,j,p,v,2,8
e,k,q,w,3,9
f,l,r,x,4,0
A,G,M,S,Y,5
B,H,N,T,Z,6
C,I,O,U,1,7
D,J,P,V,2,8
E,K,Q,W,3,9
F,L,R,X,4,0
,,s,l,,
,,h,o,,
,,o,n,,
,,r,g,,
,,t,e,,
,,,r,,

$ cat rowcol-2.txt
chanchihung
chihungchan
hungchichan
chanhungchi

$ ./rowcol.sh -i NULL -o : rowcol-2.txt
c:c:h:c
h:h:u:h
a:i:n:a
n:h:g:n
c:u:c:h
h:n:h:u
i:g:i:n
h:c:c:g
u:h:h:c
n:a:a:h
g:n:n:i

Labels: awk, Cygwin, shell script

Saturday, May 24, 2008

Computer History

Not much to blog about regarding scripting, but there are tonnes of video on YouTube.com regarding computer history that I would like to share with you.

Just finished watching "The Origins of Linux - Linus Torvalds" from the YouTube's Computer History Channel, brought to you by Computer History Museum

Do you want to know why a brilliant guy like Linus Torvalds took 8.5 years to finish his master degree.

Another video I watched 2 years ago was Odysseys in Technology, Sun Microsystems Founders Panel

Sunday, May 18, 2008

SGE qstat XML Stylesheet

You can query the Sun Grid Engine status with XML output using qstat -xml

<?xml version="1.0"?>
<job_info xmlns:xsd="http://www.w3.org/2001/XMLScenehema">
  <queue_info>
    <Queue-List>
      <name>all.q@l1</name>
      <qtype>BIP</qtype>
      <slots_used>1</slots_used>
      <slots_total>2</slots_total>
      <arch>lx24-amd64</arch>
      <job_list state="running">
        <JB_job_number>652</JB_job_number>
        <JAT_prio>0.55500</JAT_prio>
        <JB_name>Scene149_V01</JB_name>
        <JB_owner>renderer</JB_owner>
        <state>r</state>
        <JAT_start_time>2008-04-29T22:21:41</JAT_start_time>
        <slots>1</slots>
        <tasks>38</tasks>
      </job_list>
    </Queue-List>
    <Queue-List>
      <name>all.q@l10</name>
      <qtype>BIP</qtype>
      <slots_used>1</slots_used>
      <slots_total>2</slots_total>
      <arch>lx24-amd64</arch>
      <job_list state="running">
        <JB_job_number>652</JB_job_number>
        <JAT_prio>0.55500</JAT_prio>
        <JB_name>Scene149_V01</JB_name>
        <JB_owner>renderer</JB_owner>
        <state>r</state>
        <JAT_start_time>2008-04-29T22:21:56</JAT_start_time>
        <slots>1</slots>
        <tasks>45</tasks>
      </job_list>
    </Queue-List>
    ...
  <job_info>
    <job_list state="pending">
      <JB_job_number>652</JB_job_number>
      <JAT_prio>0.55500</JAT_prio>
      <JB_name>Scene149_V01</JB_name>
      <JB_owner>renderer</JB_owner>
      <state>qw</state>
      <JB_submission_time>2008-04-29T09:13:09</JB_submission_time>
      <slots>1</slots>
      <tasks>65-84:1</tasks>
    </job_list>
    <job_list state="pending">
      <JB_job_number>653</JB_job_number>
      <JAT_prio>0.55500</JAT_prio>
      <JB_name>Scene150_V01</JB_name>
      <JB_owner>renderer</JB_owner>
      <state>qw</state>
      <JB_submission_time>2008-04-29T09:13:11</JB_submission_time>
      <slots>1</slots>
    </job_list>
    ...
  </job_info>
</job_info>

However, there is no corresponding XML stylesheet. It is trivial to wrap the qstat -xml with a CGI program to include a customised stylesheet so that the output can be visulaised in a browser. A shell script with AWK will do the job.

#! /bin/sh

echo "Content-type: text/xml"
echo
SGE_ROOT=/gridware/sge /gridware/sge/bin/lx24-amd64/qstat -f -xml | \
awk '
{
        print $0
        if ( NR==1 ) {
                printf("<?xml-stylesheet type="text/xsl" href="qstat.xsl"?>\n")
        }
}'

Below is my XML stylesheet (qstat.xsl) and it's corresponding CSS (qstat.css)

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/job_info">
<html>
<head>
<title>SGE qstat</title>
<link rel="stylesheet" href="qstat.css" type="text/css"/>
</head>


<body background="#000">

<table cellspacing="10" border="0">
<tr><td align="left" valign="top">

<!-- Running -->
<h1>Running</h1>
<table border="1" cellpadding="4" cellspacing="0">
<tr>
<th>Queue Name</th>
<th>Job ID</th>
<th>Job Name</th>
<th>Task ID</th>
<th>State</th>
<th>Start Time</th>
</tr>
<xsl:for-each select="queue_info/Queue-List">
<tr>
<td class="value"><xsl:value-of select="name/text()" /></td>
<td class="value"><xsl:value-of select="job_list/JB_job_number/text()" /></td>
<td class="value"><xsl:value-of select="job_list/JB_name/text()" /></td>
<td class="value"><xsl:value-of select="job_list/tasks/text()" /></td>
<td class="value"><span class="running"><xsl:value-of select="job_list/@state" /></span></td>
<td class="value"><xsl:value-of select="job_list/JAT_start_time/text()" /></td>
</tr>
</xsl:for-each>
</table>


</td>
<td align="left" valign="top">


<!-- Pending -->
<h1>Pending</h1>
<table border="1" cellpadding="4" cellspacing="0">
<tr>
<th>Job ID</th>
<th>Job Name</th>
<th>Task ID</th>
<th>State</th>
<th>Submission Time</th>
</tr>
<xsl:for-each select="job_info/job_list">
<tr>
<td class="value"><xsl:value-of select="JB_job_number/text()" /></td>
<td class="value"><xsl:value-of select="JB_name/text()" /></td>
<td class="value"><xsl:value-of select="tasks/text()" /></td>
<td class="value"><span class="pending"><xsl:value-of select="@state" /></span></td>
<td class="value"><xsl:value-of select="JB_submission_time/text()" /></td>
</tr>
</xsl:for-each>
</table>

</td>
</tr>
</table>


</body>
</html>
</xsl:template>
</xsl:stylesheet>

body { 
 background: #000;
 color: #fff;
 font-family: Arial;
 font-size: 12px;
}
h1 { 
 color: #ff7800;
 font-size: 32px;
 font-family: Times;
 font-weight: bold;
}
.value { 
 color: #aaa; 
 font-size: 12px;
 font-family: Arial;
}
.running { 
 color: #0b0; 
 font-size: 12px;
 font-family: Arial;
 font-weight: bold;
}
.pending { 
 color: #b00; 
 font-size: 12px;
 font-family: Arial;
 font-weight: bold;
}

The output on the browser will look like this:

Labels: SGE, XML

Sun Open Storage

Sun Microsystems is going to change the landscape of storage from proprietary to open standard and they call it Sun Open Storage. As mentioned in the OpenSolaris's Storage Community web page:

The importance of OpenSolaris as a storage operating system has really emerged with the new and updated storage features, such as ZFS, NFS, pNFS, Shared QFS, Storage Archive Manager, Honeycomb fixed content management, Availability Suite, iSCSI, etc. OpenSolaris is now being embedded in storage appliances or used on hybrid server/storage devices to manage very large collections of data.

A White Paper: What is Open Storage describes the concept, cost saving and value proposition of having open standard for the storage software stack (Click here to view image from original site). Sun claims that open storage architecture offers 90 percent less $/GB storage compared to a closed storage architecture.

With the advance features in hardware and softwware like: OpenSolaris, ZFS, Common Internet File System (CIFS) server, other storage projects and Sun Fire X4500 (a.k.a Thumper), it is not hard to envisage how all these pieces fit together to realise an open standard storage.

There is another article in Sun Developers Network describing how to Set Up an OpenSolaris Storage Server in 10 Minutes or Less. BTW, a new startup company, Nexenta Systems, has been developing open source based storage software solutions.

Also, a video featuring Sun's Andy Bechtolsheim, Matt Baier, and Jeff Bonwick join John Fowler to discuss advancements in Open Storage.

Labels: opensolaris, storage, ZFS

Thursday, May 15, 2008

OpenSolaris 200805, Missing Header Files

As you know, I am now running OpenSolaris 2008.05 under VirtualBox in my Windows Vista. Today I had time to install the SunStudio 12 and Sun HPC Cluster Tools 7.1 in my virtual opensolaris so that I can compile all my favourite open source and HPC tools.

I realised that the this OpenSolaris 2008.05 does not come with OS header files (eg. stdio.h, ....) and X Windows include files (eg, X11/Xlib.h, ...), probably because the LiveCD version cannot afford to put too many things in 700MB. Anyway, you need to install SUNWhea and SUNWxwinc packages using "packagemanager" CLI. BTW, I was not able to launch the "Package Manager" from the "System" -> "Administration". The GUI Package Manager screwed up my keyboard input for no obvious reason.

See screen dumps for packagemanager in action.

Labels: opensolaris, virtualbox

Wednesday, May 14, 2008

SGE Accounting

Sun Grid Engine accounting (5) provides a very nice utility to summarise accounting information of all your grid jobs. In my recent blog regarding SGE for rendering, I recommeneded to include "-A" flag (for accounting string) and "-P" flag (for project name) in qsub. This will help you to extract the accounting information for a particular job. Now it is time to reap the benefits. Below shows how you can extract the total accounting information per project as well as per accounting string. In my case, the accounging string and job name are the same as the scene file name without the file extension.

$ qacct -P myproject
PROJECT     WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
========================================================================================================================
myproject    33970465      47273362       1178602      48638827       56454795.153              0.000              0.000

$ qacct -A Scene1
Total System Usage
    WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW
================================================================================================================
       101114        169445          4951        174490         216975.065              0.000              0.000

In my recent rendering project, I have to deal with 1000+ scene files. To be extact, 1224 unique scene files. We should be able to find out the run time information of every single job by looping through them and run qacct on that. However, it is going to be very inefficient 'cos we have to read the accounting file 1224 times. Also the output format cannot be imported to any spreadsheet program for further analysis.

$ awk -F: '$32~/^myproject$/{print $7}' accounting | sort | uniq | wc -l
1224

$ for i in `awk -F: '$32~/^myproject$/{print $7}' accounting | sort | uniq`
do
 echo $i
 qacct -A $i
done

accounting (5) has documented clearly every field in the accounting file. We can write an awk program to loop through the file once and print out all these information as CVS (Comma Separated Variable).

#! /bin/sh


awk '
BEGIN {
        FS=":"
        OFS=","
}
$32 ~ /^myproject$/ {
        scene=$7
        wallclock[scene]+=$14
        utime[scene]+=$15
        stime[scene]+=$16
        cpu[scene]+=$37
        mem[scene]+=$38
        io[scene]+=$39
        iow[scene]+=$41

        # no of jobs per scene
        ++job[scene]
}
END {
        print "SCENE_NAME","NO_OF_JOBS","WALLCLOCK","UTIME","STIME","CPU","MEMORY","IO","IOW"
        for ( i in wallclock ) {
                print i,job[i],wallclock[i],utime[i],stime[i],cpu[i],mem[i],io[i],iow[i]
        }
}' accounting

However, we are still dealing with lot of data. How about visualising the jobs using Gnuplot. With the raw accounting data, you can find out the start and end time of each job. Tcl has excellent utility to convert epoch time to other date/time format. Although gnuplot can handle epoch time plotting, I realised that they always based on GMT+0 and that will mess up the x-axis label. Anyway, here is the Tcl program to extract the start time and summarise it based on per day per hour.

set fp [open accounting r]
while { [gets $fp line] >= 0 } {
        set lline [split $line :]

        set project [lindex $lline 31]

        if { $project != "myproject" } {
                continue
        }

        set jobname [lindex $lline 4]
        set starttime [lindex $lline 9]

        set ymdh [clock format $starttime -format {%Y-%m-%d %H:00:00}]

        if { [info exists stats($ymdh)] == 0 } {
                set stats($ymdh) 1
        } else { 
                incr stats($ymdh)
        }
}
close $fp


foreach i [lsort [array names stats]] {
        puts "$i $stats($i)"
}

And the corresponding output (stats.txt) is like this:

...
2008-04-25 22:00:00 137
2008-04-25 23:00:00 83
2008-04-26 00:00:00 83
2008-04-26 01:00:00 86
2008-04-26 02:00:00 84
...

Below gnuplot file will visualise the number of jobs started hourly for your entire project.

set terminal png
set output 'stats.png'
set xdata time
set size 1,0.5
set timefmt '%Y-%m-%d %H:%M:%S'
set xrange ['2008-04-23 00:00:00':'2008-05-14 23:59:00']
set yrange [0:]
set title 'Rendering - myproject'
set ylabel 'Jobs per hour'
set xtics 86400 offset 2
set format x "%d\n%b\n%a"
set grid
plot 'stats.txt' using 1:3 with impulses title '#Jobs Started'

Labels: awk, SGE, shell script

Tuesday, May 06, 2008

VirtualBox with the new OpenSolaris

Downloaded the latest OpenSolaris from the new OpenSolaris.com. That's the latest build 86 with ZFS as root filesystem. Everything is smooth except I cannot resolve hostname, with this error message "node name or service name not known". nslookup is fine, wget based on ip address is ok, but not via full-qualified name. This workaround resolved the issue.

It seems there is quite a fair bit of different between this build and Solaris 10 U4. Need to find time to explore further.

Labels: Solaris, virtualbox

Chi Hung Chan