Saturday, May 31, 2008

AWK Can Do Lookup

It is possible to program in AWK to do direct lookup via an input file. All you have to do is to establish the associate array (in my case, I store them in array L) in the BEGIN block.

I choose the web access log as an example and the lookup is based on the Hypertext Transfer Protocol -- HTTP/1.1 Status Code Definitions, eg, 200 -> OK

My initial version is based on some shell tricks which are very inefficient and error-prone. After browsing through the "The AWK Programming Language" (written by the AWK author - Alfred V. Aho, Peter J. Weinberger, and Brian W. Kerninghan), I am able to come up with this clean and readable code. Although the book was written in 1988, IMHO it is still the best book for AWK

#! /bin/sh

if [ $# -ne 2 ]; then
 echo "Usage: $0 <lookup-file> <data-file>"
 exit 1

if [ ! -f $1 ]; then
 echo "Error. \"$1\" lookup file does not exit"
 exit 2
if [ ! -f $2 ]; then
 echo "Error. \"$2\" data file does not exit"
 exit 3

gawk '
 # establish lookup
 while ( getline < "'$1'" > 0 ) {
  for ( i=3 ; i<=NF ; ++i ) {
   V=sprintf("%s %s",V,$i)

 # HTTP status code summary
 for ( i in s ) {
  printf("\"%s\" has %d counts\n", L[i], s[i])
}' $2

See the lookup file and access log, and how the above script generates the lookup dynamically

$ cat lookup.txt
200 OK
201 Created
202 Accepted
203 Non Authoritative Information
204 No Content
205 Reset Content
206 Partial Content
300 Multiple Choices
301 Moved Permanently
302 Found
303 See Other
304 Not Modified
305 Use Proxy
306 Unused
307 Temporary Redirect
400 Bad Request
401 Unauthorized
402 Payment Required
403 Forbidden
404 Not Found
405 Method Not Allowed
406 Not Acceptable
407 Proxy Authentication Required
408 Request Timeout
409 Conflict
410 Gone
411 Length Required
412 Precondition Failed
413 Request Entity Too Large
414 Request URI Too Long
415 Unspported Media Type
416 Request Range Not Satisfiable
417 Expectation Failed
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported

$ head access_log - - [01/Mar/2006:15:30:26 +0800] "GET / HTTP/1.1" 200 1456 - - [01/Mar/2006:15:30:26 +0800] "GET /apache_pb.gif HTTP/1.1" 200 2326 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/ HTTP/1.1" 200 9187 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/images/pixel.gif HTTP/1.1" 200 61 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/images/apache_header.gif HTTP/1.1" 200 4084 - - [01/Mar/2006:15:30:30 +0800] "GET /manual/images/index.gif HTTP/1.1" 200 1540 - - [01/Mar/2006:15:30:38 +0800] "GET /manual/howto/cgi.html HTTP/1.1" 200 22388 - - [01/Mar/2006:15:30:38 +0800] "GET /manual/images/home.gif HTTP/1.1" 200 1465 - - [01/Mar/2006:15:30:38 +0800] "GET /manual/images/sub.gif HTTP/1.1" 200 6083 - - [01/Mar/2006:15:33:15 +0800] "GET /manual/howto/cgi.html HTTP/1.1" 200 22388

$ ./ lookup.txt access_log
"Not Modified" has 239 counts
"Bad Request" has 1 counts
"Unauthorized" has 18 counts
"Forbidden" has 23 counts
"OK" has 11378 counts
"Not Found" has 3257 counts
"Internal Server Error" has 4 counts
"Bad Gateway" has 2 counts

