Shell Script Performance, Take 2
Suppose you need to change 1000 files' extension to another extension (eg, from .txt to .log). What you need to do is to write a script to do that. Let's start with a straightforward approach like this:
$ls -1 *.txt | wc -l 1000 $cat ext1.sh #! /bin/bash if [ $# -ne 2 ]; then echo "Usage: $0 <original-extension> <new-extension>" exit 1 fi ext_o=$1 ext_n=$2 for i in *.$ext_o do mv $i `basename $i $ext_o`$ext_n done $time ./ext1.sh txt log real 4m45.135s user 0m10.040s sys 0m58.621s $ls -1 *.log | wc -l 1000
As you can see, you need to run "basename" command 1000 times to take away the extension before you can rename all these files and that's the overhead of forking out 1000 processes. As I mentioned in my previous blog, it needs to make 160+ system calls for an empty shell with 0.13 seconds run. All these will add up to slow down your script when you are working with large data set. In order to increase your script's performance, you need to basically follow these principles:
- Use the most appropriate tools for the job (more efficient)
- Try to reduce the amount of commands to run (less forking)
- Explore all the options and capability of every command (may reduce no. of commands)
- Explore the functionality of built-in shell construct (so no forking)
In my second attempt, I will explore a built-in pattern matching in Bash shell.
${variable%pattern}, - If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest.
Example:
$f=abc.txt $echo ${f%.txt}.log abc.log
Here is the run time:
$ls -1 *log | wc -l 1000 $cat ext2.sh #! /bin/bash if [ $# -ne 2 ]; then echo "Usage: $0 <original-extension> <new-extension>" exit 1 fi ext_o=$1 ext_n=$2 for i in *.$ext_o do mv $i ${i%.$ext_o}.$ext_n done $time ./ext2.sh log txt real 2m6.684s user 0m4.765s sys 0m28.804s $ls -1 *.txt | wc -l 1000
That's 2.2 times faster than the first version. Is this good enough ? Let's explore other scripting languages. The advantage of modern scripting languages (like Perl, Python, Tcl, ... just to name a few) has a lot of capabilities built-in without having the script to fork processes. Below shows my Tcl and Python implementation.
In Tcl:
$ls -1 *.txt | wc -l 1000 $cat ext.tcl #! /cygdrive/c/Tcl8.4.19/bin/tclsh if { $argc != 2 } { puts stderr "Usage: $argv0 <original-ext> <new-ext>" exit 1 } set ext_o [lindex $argv 0] set ext_n [lindex $argv 1] foreach ofile [glob -nocomplain *.$ext_o] { file rename $ofile "[file rootname $ofile].$ext_n" } $time ./ext.tcl txt log real 0m12.149s user 0m0.000s sys 0m0.015s $ls -1 *log | wc -l 1000
In Python:
$ls -1 *log | wc -l 1000 $cat ext.py #! /usr/bin/python import os, sys if len(sys.argv) != 3: print "Usage:", sys.argv[0], "<original-ext> <new-ext>" sys.exit(1) ext_o = os.path.extsep + sys.argv[1] ext_n = os.path.extsep + sys.argv[2] for f in os.listdir('.'): [fname,ext] = os.path.splitext(f) if ext == ext_o: f_new = fname + ext_n os.rename(f,f_new) $time ./ext.py log txt real 0m4.307s user 0m0.062s sys 0m1.653s $ls -1 *.txt | wc -l 1000
Wow, we are able to reduce the runtime from less than 5min to just over 4sec. That's over 66 times faster than the first version.
- 4m45.135s (1st version of shell script)
- 2m6.684s (2nd version, using built-in Bash function)
- 0m12.149s (Tcl version)
- 0m4.307s (Python version)
Next time when you write scripts, make sure your script is tested with a large data set for performance reason. BTW, can you come up with the fastest script to create 1000 files (eg, file-0001.txt ... file-10000.txt) in some running sequence ? Stay tune for "Shell Script Performance, Take 3" 'cos I will show you how to take advantage of built-in functions in shell so that your script can run as fast as Python.
Labels: performance, python, shell script, Tcl
6 Comments:
How does this compare the rename command?
rename .oldext .newext *
touch `seq -f file-%g.txt`
bah. of course I meant:
touch `seq -f file-%05g.txt 1 10000`
Thanks for all the comments.
Mathieu:
rename command is only available in Linux, not Solaris. Apparently rename is more efficient than mv. In my CentOS box, rename took 26 system calls to rename a single file and mv took 106. mv is definitely more powerful than rename and therefore they have to link with more sharable objects. This resulted in more open/close/... in system calls. (pre tag is not available in comment, therefore cannot show you the details)
Hi Pjz,
Thanks for highlighting the "-f" formatting flag. It is indeed very useful. Sad to say that seq is introduced in the OpenSolaris, not yet in the mainstream of Solaris. However, I think we can write a shell function based on AWK to achieve the same effect.
Very nice read, two notes.
First, if you are iterating over a huge number of files you can also use the following "idiom":
ls -1 *pattern | while read f; do ...; done
this way you have the file listing happening in parallel to the rest of the script, and you also prevent the number of members in the for enumeration from overflowing. For small number of files this is actually slower as you pay the cost of spawning an additional process (ls) and setting up the pipeline.
Second, the seq command can be implemented (input error checking aside) also relying on bash arithmetics:
myseq () { c=$1; while [ $c -lt $2 ]; do echo $c; c=$(( c + 1 )); done; };
Thanks for the notes, Marco. I was preparing for my "Take 3" when your comment came. I implemented my own "seq" using awk, which I only have to execute 1 command to create N number of file names.
Post a Comment
<< Home