Wednesday, March 26, 2008

Tcl Is As Fast As Perl

Netflixprize came with a utility check_format.pl to check the submission format before you upload your predictions.

The Perl utility took 11.437 seconds to check my submit file with a total of 2,834,601 lines. So how is Tcl compare with Perl in this "race". It took me about a couple of minutues to crank out a similar utility with almost similar functionalities. Guess what, Tcl is slightly faster than Perl with a run time of 11.128 seconds, that is 0.309 second faster.

My version of check_format:

#! /usr/local/bin/tclsh
if { $argc != 1 } {
    puts stderr "Usage: $argv0 <file>"
    exit 1
}
array set m {
1 53
10 10
...
...
9998 5
9999 3
}
set f [lindex $argv 0]
if { ![file exists $f] } {
    puts stdeer "Error. $f does not exist"
    exit 2
}
set fp [open $argv r]
set lineno 1
while { [gets $fp line] >= 0 } {
    if { [string match {*:} $line] } {
        set movie_id [string range $line 0 end-1]
        if { ![string is integer $movie_id] } {
            puts "Error. Movie_id=$movie_id not integer"
            exit 3
        }
        if { [info exists count] && $count != $m($mid) } {
            puts "Error. Movie_id=$mid, count=$count, should be $m($mid)"
            exit 4
        }

        set mid $movie_id
        set count 0
        set required $m($mid)
    } else {
        incr count
        if { $line < 1.0 || $line > 5.0 } {
            puts "Error. LINE#=$lineno, Movie id=$mid, count=$count"
            exit 5
        }
    }
    incr lineno
}
close $fp
puts "OK!"

See Tcl and Perl in action:

$ perl -v

This is perl, v5.8.4 built for i86pc-solaris-64int
(with 28 registered patches, see perl -V for more detail)

Copyright 1987-2004, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.

$ tclsh
% set tcl_patchLevel
8.4.16
% parray tcl_platform
tcl_platform(byteOrder) = littleEndian
tcl_platform(machine)   = i86pc
tcl_platform(os)        = SunOS
tcl_platform(osVersion) = 5.10
tcl_platform(platform)  = unix
tcl_platform(threaded)  = 1
tcl_platform(user)      = chihung
tcl_platform(wordSize)  = 8
% set tcl_patchLevel
8.4.16
% exit

$ wc -l netflix-submission.txt
  2834601 2834601 11379859 netflix-submission.txt

$ time ../../check_format.pl netflix-submission.txt
OK!

real    0m11.437s
user    0m11.417s
sys     0m0.017s

$ time ./check_format.tcl netflix-submission.txt
OK!

real    0m11.128s
user    0m11.093s
sys     0m0.029s

Tcl has changed a lot and the statement of "Everything is string" is not really true. Internally it has more than one representation. As Donal Fellows once mentioned "Tcl_Obj's are like storks. They have two legs, the internal representation and the string representation. They can stand on either leg, or on both." See the typedef struct Tcl_Obj in the tcl.h (8.4):

typedef struct Tcl_Obj {
    int refCount;               /* When 0 the object will be freed. */
    char *bytes;                /* This points to the first byte of the
                                 * object's string representation. The array
                                 * must be followed by a null byte (i.e., at
                                 * offset length) but may also contain
                                 * embedded null characters. The array's
                                 * storage is allocated by ckalloc. NULL
                                 * means the string rep is invalid and must
                                 * be regenerated from the internal rep.
                                 * Clients should use Tcl_GetStringFromObj
                                 * or Tcl_GetString to get a pointer to the
                                 * byte array as a readonly value. */
    int length;                 /* The number of bytes at *bytes, not
                                 * including the terminating null. */
    Tcl_ObjType *typePtr;       /* Denotes the object's type. Always
                                 * corresponds to the type of the object's
                                 * internal rep. NULL indicates the object
                                 * has no internal rep (has no type). */
    union {                     /* The internal representation: */
        long longValue;         /*   - an long integer value */
        double doubleValue;     /*   - a double-precision floating value */
        VOID *otherValuePtr;    /*   - another, type-specific value */
        Tcl_WideInt wideValue;  /*   - a long long value */
        struct {                /*   - internal rep as two pointers */
            VOID *ptr1;
            VOID *ptr2;
        } twoPtrValue;
    } internalRep;
} Tcl_Obj;

Labels: , ,

0 Comments:

Post a Comment

<< Home