Wednesday, March 19, 2008

HTTP State Codes Summary, The AWK Way

A colleague of mine is writing an awk program to get the monthly summary of HTTP status codes. This enable him to use that as a reference to cross-check with another commercial web log tool.

His code is something like this

$ awk '{++s[$(NF-1)]}END{for(i in s){print i,s[i]}}' access_log | sort
200 916952
302 10031
304 265012
400 22
401 323
404 253
500 3048

This may serve his purpose for checking. However, I think it is possible to write an entire HTTP status code summary using gawk to present the summary in a per-day basis. "asort" (Sorting Array Values and Indices) function in gawk is very handy in sorting array so that rows and columns can be displayed in order. Below is my implementation:

$ cat ncode.sh
#! /bin/bash


if [ $# -ne 1 ]; then
 echo "Usage: $0 <access_log>"
 echo "       <access_log> can be either plain text or gzip compressed"
 exit 1
fi
log=$1


if [ ! -f "$log" ]; then
 echo "Error. $log does not exist"
 exit 2
fi


file $log | grep "gzip compressed data" > /dev/null 2>&1
if [ $? -eq 0 ]; then
 cmd="zcat"
else
 cmd="cat"
fi


$cmd $log | gawk '
function separator(n)
{
 for ( i=1 ; i<=n ; ++i ) {
  printf("-")
 }
 printf("\n")
}
$(NF-1)>=100 && $(NF-1)<=505 {
 date=substr($4,2,11)
 code=$(NF-1)
 a_code[code]=code
 a_date[date]=date
 a_cd[date,code]++
}
END {
 nc=asort(a_code)
 nd=asort(a_date)

 separator(80)

 # header for http code
 printf("HTTP Codes:")
 for ( c=1 ; c<=nc ; ++c ) {
  printf("%8d", a_code[c])
 }
 printf("   Total\n")

 separator(80)

 # result per date
 for ( d=1 ; d<=nd ; ++d ) {
  printf("%s", a_date[d])
  total=0
  for ( c=1 ; c<=nc ; ++c ) {
   value=a_cd[a_date[d],a_code[c]]
   printf("%8d", value)
   total+=value
  }
  printf("%8d\n", total)
 }

 separator(80)

 # total by code
 printf("Total:     ")
 all=0
 for ( c=1 ; c<=nc ; ++c ) {
  total=0
  for ( d=1 ; d<=nd ; ++d ) {
   value=a_cd[a_date[d],a_code[c]]
   total+=value
  }
  all+=total
  printf("%8d", total)
 }
 printf("%8d\n", all)

 separator(80)
}'

It took just under 15 seconds on my notebook (Intel Celeron 1.4GHz, 512 MB memory) to summarise 1,192,178 lines of web access log with gawk 3.1.6 in Cygwin.

$ ./ncode.sh
Usage: ./ncode.sh <access_log>
       <access_log> can be either plain text or gzip compressed

$ ./ncode.sh access_log.gz
--------------------------------------------------------------------------------
HTTP Codes:     200     302     304     400     401     404     500   Total
--------------------------------------------------------------------------------
01/Jan/2008   22038       8     290       0       2       0       0   22338
02/Jan/2008   30732     499   11427       0      14      10     100   42782
03/Jan/2008   31988     529   11718       0      14       6     203   44458
04/Jan/2008   23525      81    2199       0       3       2       1   25811
05/Jan/2008   21865       1     246       1       0       1       0   22114
06/Jan/2008   29891     184    7874       2       0       5      60   38016
07/Jan/2008   30866     370   10107       4      11       8      23   41389
08/Jan/2008   32001     608   12380       1      24      22      67   45103
09/Jan/2008   33043     586   14069       0      42      11     151   47902
10/Jan/2008   34076     438   12374       0      28      14     128   47058
11/Jan/2008   23703      63    2604       0       0       1       5   26376
12/Jan/2008   21811      17     393       0       1       1       3   22226
13/Jan/2008   30458     341    7867       1      18      12      89   38786
14/Jan/2008   32659     348   10302       0       9      11      65   43394
15/Jan/2008   34758     539   13515       2      15      13      79   48921
16/Jan/2008   32477     457   13728       0      22      11     924   47619
17/Jan/2008   33215     406   10919       0      15       8      75   44638
18/Jan/2008   23717      90    1275       0       0       3      80   25165
19/Jan/2008   21947       0      42       0       0       0      54   22043
20/Jan/2008   33129     378   11618       0      15      24     102   45266
21/Jan/2008   32149     493   14163       0       8      18      78   46909
22/Jan/2008   34153     477   13045       1       9       9      82   47776
23/Jan/2008   32234     312   10560       0       5       6      77   43194
24/Jan/2008   34076     533   12402      10      12       4      70   47107
25/Jan/2008   23917      98    1724       0       4       2     106   25851
26/Jan/2008   22046       0       6       0       0       1      46   22099
27/Jan/2008   32851     329   12652       0      17       8      58   45915
28/Jan/2008   36528     447   14036       0       6       8      83   51108
29/Jan/2008   36627     664   14179       0       9      18      97   51594
30/Jan/2008   33731     522   10825       0       8      10      87   45183
31/Jan/2008   20741     213    6473       0      12       6      55   27500
--------------------------------------------------------------------------------
Total:       916952   10031  265012      22     323     253    3048 1195641
--------------------------------------------------------------------------------

You must be wondering why I need to 'reinvent the wheel' when there are free open source tools (eg. Analog, AWStats, Webalizer, ... ) that can do a much better job because I believe

"I hear and I forget; I see and I remember; I do and I understand"
- Chinese Proverb
"Willing is not enough; we must do. Knowing is not enough; we must apply."
- Bruce Lee
The equivalent in the IT world will be
"I install and I am just an Installer; I use and I am just a User; I write and I am proud to call myself a Software Engineer"
- Chihung's proverb, hopefully someone will quote it in the future :-)

Labels: , ,

0 Comments:

Post a Comment

<< Home