Tuesday, February 24, 2009

Sed is Slow for Very Long Stream in Solaris

Today, I realised that my customer is running sed for more than an hour and the strange thing is that the input file is no more than a few MB. Also the pattern in sed is pretty straightforward doing global substitution. BTW, it is running on Solaris 10

Is this the natural of the problem that takes sed to run that long or sed is inefficient in certain circumstances?

In this exercise, I created a file with 2000 lines. The first line has 12 characters and all subsequent lines are having an increment of 12 characters with the last line of 24000 characters.

sed 's/\\\\/@/g;s/\\/@/g' took 35+ minutes on my Sun Fire V440. That's really inefficient. Okay, sed is definitely not the right tool for his job. Let's take a look at the other alternative.

Perl has this "-p" flag that allow your in-line code to be wrap around a
while (<>) { ... # your script } loop so that you can write a one-liner. Guess what, Perl took only 5 seconds to finish that substitution. Hey, that's a lot of CPU cycles saved!

Here is the code and the run time info:

$ cat run.sh
#! /bin/bash

comma()
{
        perl -e 'print "c:\\\\a\\\\b\\\\c,"x'${1:-1}
        echo ""
}

n=1
while [ $n -le $1 ]
do
        comma $n
        ((++n))
done


$ ./run.sh 2000 > run2000.txt


$ wc run2000.txt
    2000    2000 24014000 run2000.txt


$ time sed 's/\\\\/@/g;s/\\/@/g' run2000.txt > run1.txt

real    35m6.692s
user    35m5.559s
sys     0m0.430s



$ time perl -pe 's/\\\\/@/g;s/\\/@/g' run2000.txt > run2.txt

real    0m4.948s
user    0m4.491s
sys     0m0.145s


$ digest -a md5 run1.txt run2.txt
(run1.txt) = 8820c914e0e038cec9da6f0883b6d964
(run2.txt) = 8820c914e0e038cec9da6f0883b6d964


$ uname -a
SunOS chihung 5.10 Generic_118822-11 sun4u sparc SUNW,Sun-Fire-V440


$ psrinfo -v
Status of virtual processor 0 as of: 02/25/2009 00:14:28
  on-line since 12/13/2008 00:37:43.
  The sparcv9 processor operates at 1281 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 02/25/2009 00:14:28
  on-line since 12/13/2008 00:37:43.
  The sparcv9 processor operates at 1281 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 02/25/2009 00:14:28
  on-line since 12/13/2008 00:37:43.
  The sparcv9 processor operates at 1281 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 3 as of: 02/25/2009 00:14:28
  on-line since 12/13/2008 00:37:41.
  The sparcv9 processor operates at 1281 MHz,
        and has a sparcv9 floating point processor.

Labels: , ,

2 Comments:

Blogger balazs.deak said...

On my laptop running linux, there is no such big difference in the performace of sed and perl regarding your experiment. See

b@pet014204:~$ uname -a
Linux pet014204 2.6.27-11-generic #1 SMP Thu Jan 29 19:24:39 UTC 2009 i686 GNU/Linux

b@pet014204:~$ time ./run.sh 2000 >run2000.txt

real 0m15.492s
user 0m7.552s
sys 0m7.576s

b@pet014204:~$ wc run2000.txt
2000 2000 24014000 run2000.txt

b@pet014204:~$ time sed 's/\\\\/@/g;s/\\/@/g' run2000.txt > run1.txt

real 0m5.293s
user 0m5.120s
sys 0m0.088s

b@pet014204:~$ time perl -pe 's/\\\\/@/g;s/\\/@/g' run2000.txt > run2.txt

real 0m2.224s
user 0m2.156s
sys 0m0.060s

---
Could repeat your test on some different systems?

5:06 AM  
Blogger chihungchan said...

Thanks for the info. It happened in my Solaris servers. I think Linux has better well-tuned utilities than Solaris.

8:06 AM  

Post a Comment

<< Home