Chi Hung Chan: February 2008

Monday, February 25, 2008

Paste, My Way

In UNIX, you have the command paste to join files horizontally. However, in certain cases you may want to join them by having both file contents to be aligned according to a particular field (keyword).

For example, we have two files a.txt and b.txt (they can be output from two sets of commands)

$ cat a.txt
2008-01-02 12
2008-01-06 31
2008-01-08 9
2008-01-09 41
2008-01-10 48
2008-01-12 28

$ cat b.txt
2008-01-02 43
2008-01-05 78
2008-01-09 23
2008-01-11 33
2008-01-12 39
2008-01-13 11

What you want is an output like this:

2008-01-02      12      43
2008-01-05              78
2008-01-06      31
2008-01-08      9
2008-01-09      41      23
2008-01-10      48
2008-01-11              33
2008-01-12      28      39
2008-01-13              11

but "paste" gives you this:

$ paste a.txt b.txt
2008-01-02 12   2008-01-02 43
2008-01-06 31   2008-01-05 78
2008-01-08 9    2008-01-09 23
2008-01-09 41   2008-01-11 33
2008-01-10 48   2008-01-12 39
2008-01-12 28   2008-01-13 11

What you can do is to have these two outputs in a sub-shell and introduce a unique tag for each of the output. In this case, I introduced a unique tag, f1, for output a.txt and f2 for output b.txt. By having both output (with unique tag) to become an input to awk, I am able to differentiate the two output within awk. In the final awk, I introduced 3 associative arrays, f0 for storing the key field (in this case the date), f1 for a.txt output and f2 for b.txt output. At the END block of awk, I can make use of the f0 associative array to loop through all the key fields and print both outputs of f1 and f2. To output it as a tab-separated, I told awk OFS (output field separator) that my desire output separator is tab. Bear in mind that the output from awk can be random and that's why we need to pipe the output to sort.

$ cat ab.sh
#! /bin/sh


(
 awk '{print "f1", $0}' a.txt;
 awk '{print "f2", $0}' b.txt;
) | awk '
BEGIN {
        OFS="\t"
}
{
        f0[$2]=1
        if ( $1 == "f1" ) {
                f1[$2]+=$3
        }
        if ( $1 == "f2" ) {
                f2[$2]+=$3
        }
}
END {
        for ( i in f0 ) {
                print i, f1[i], f2[i]
        }
}'

$ ./ab.sh | sort -n -k 1
2008-01-02      12      43
2008-01-05              78
2008-01-06      31
2008-01-08      9
2008-01-09      41      23
2008-01-10      48
2008-01-11              33
2008-01-12      28      39
2008-01-13              11

Labels: awk, shell script

Web Page Response Time

From the view point of an end-user, the total response time of a web page consists of one or more of the below sequences (depending on how many dependencies in the main request)

Host name resolution (web server host name resolves to IP address), if not in system DNS cache
Connect time (Browser establishes 3-way TCP handshake with web server)
Server response time (once TCP connection establised, browser will send HTTP protocol to server and wait for the response)
Delivery time (time between the first and the last bytes of requested content to be sent by server)

Two tools that you can used to find out all these timings.

IBM Page Detailer for web page response time breakdown
cURL to monitor a specific URL, IBM developerWorks has an excellent article in applying cURL to monitor URL response time, see Expose Web performance problems with the RRDtool

Below diagram shows the correlation between the time recorded by these tools.

To further enhance your knowledge in HTTP protocol, EventHelix.com has documented a detailed HTTP Sequence Diagram to assist you to understand how the browser interacts with the web server. It even goes into details of how the browser spawns off 2nd thread to parallelise the download of other dependencies. Do you know that the HTTP/1.1 Specification stated that "A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy". Therefore, all modern browsers have their default connections set to 2. Of course you can always modify the default settings to gain parallelism. Alternatively, you can introduce an alias host name to 'fool' the browser to initiate another two more threads for this alias host.

Click on this Google search link to find out all the other protocol sequence diagrams from EventHelix.com.

Now you can have a deeper understanding of what's under the hood.

Labels: curl, performance

Wednesday, February 20, 2008

Web Performance Analogy

IBM Page Detailer Basic Version 5.2.6's help file has a very good analogy to help you understand why one page can load faster than another

This is best explained with an analogy. It is a simplification of the argument, but should help you understand why one page can load faster than another. You are a member of a large family. Your relatives love to come visit you for a week at a time. There are several strange traditions (some would call them rules or protocols) that must be followed by all members of your family when moving your relative's belongings into the guest room.

The hostess or host (you) must carry all of the belongings into the guest room.
The visitor must put all of the things they will need for this visit in the trunk of their car.
People in your family may carry only one item at a time. (I warned you the traditions were strange.)
Your guest room is up on the second floor.
Larger items move more slowly, and have a higher incidence of being dropped (delaying your mission even more), although you have been known to drop even the smallest item at random.

You receive five visitors:

You invite Uncle Frank to come visit. Uncle Frank loads his trunk with eight changes of clothes, a bathing suit, toiletries, a picture album, camera, and film. He pulls into the driveway and you go out to greet him. He opens the trunk and tells you which item to take to the guest room first. Uncle Frank feels a little uneasy in your neighborhood, so he closes the trunk each time you leave with an item. Because uncle Frank doesn't believe in suitcases, he has sixty items in his trunk. Uncle Frank must stay by the trunk to lock and unlock it each time you get back for the next item. When you have fetched the last item, he follows you in and you can begin the more important parts of the visit. It will take you at least 60 trips up and down the stairs to get Uncle Frank's things into the guest room.
You invite Aunt Lorraine to come visit. She loads her trunk up with enough items for two weeks. Like Uncle Frank, Aunt Lorraine doesn't believe in suitcases either. She does feel a little better about leaving the trunk open for you though. Once she has opened the trunk she will leave it open for you. If someone else comes up to get things, she will close the trunk until she is sure who they are. She waits until the trunk is empty before coming into the house.
You invite Uncle Dan to come visit. He believes in suitcases, so he has twenty or so items in his trunk. He doesn't feel like leaving his trunk open and he will wait until the last item is on the way before he comes inside.
You invite Uncle Ian to come visit. He believes in suitcases, and is comfortable with your neighborhood. He has twenty something items in the trunk and keeps it open for you but waits until the trunk is empty to come inside.
You invite Aunt Joan to come visit. She believes in suitcases, and is comfortable with your neighborhood. She asks you to take in 6 out of the twenty something items in for her, and tells you the rest are not so important and you can get them later. She comes in with the sixth item.

If you care about how long it takes to get the visitor in the house, which relative would you prefer to come for a visit? This is a somewhat unfair narrow focus because it is the visitor's personality and what you can do with, and learn from them that most influences a lasting relationship, but for now we will concentrate on this first part of the visit.
Can you predict which visitor will come inside the soonest? How would the times change if your guest room was on the first floor and the visitor backed their trunk right up to the guest room door so you didn't have to walk at all to unload the trunk? What if your apartment was on the 5th floor with no elevator?
Would having your spouse and children help unload the visitor's trunk, (four at a time) using the same rules, change which visitor would take less time?
We could generalize that in most cases Aunt Joan followed by Uncle Ian would be available to come inside before anyone else, the exception being if Aunt Joan packed extremely heavy suitcases.
We can also say that the closer the car trunk is to the guest room, the less it matters how the trunk is packed. Said another way; the care taken in packing the trunk can have a great influence on how long it takes to get things inside, and matters more as the guest room moves further from the trunk.
OK, so what does this story have to do with IBM Page Detailer? In the analogy above, you are playing the web browser. The visitors are web pages. The car trunk is the web server. And the things in the trunk are the items on a web page. (e.g., gifs or jpeg images, html text, ...). The family traditions are the protocols the web browser must follow to obtain items from the web server. The visitors who close their trunk all of the time are sites that do not keep connections open (see Too Many Connections). And the visitor who comes in after only part of the baggage is loaded, is a page that is useable before the last item is completed.
Many factors go into how a web page should be packaged. It is not possible to generalize about the correct number of items on any page, but if designers limit the number of separate items on a page when possible, the end user will see better performance from your site. An example of a common page improvement is to send a menu as a browser or client-side map instead of a table with individual graphic elements. While the use of mouse rollovers, which dynamically change the displayed GIF, looks interesting, it also means that additional GIFs must be downloaded for the effect to operate differently over each menu item. Eliminating the rollover GIF action can reduce the number of objects required to load the page. There are other similar tradeoffs that can be taken to reduce the number of objects on a page. Most of them, like the example above, trade off some amount of interface function for a reduction in the number of items on the page.
The "side effects" of combining page contents for delivery may be significant too. Your server will require fewer machine cycles to retrieve and deliver your content to the end user. Think of the same number of users hitting your site over a period of time, if each repackaged page requires one half the number of objects, the hit rate your site needs to support is also halved, or the capacity you have for more users in that period increases dramatically, assuming your server and infrastructure can handle the increased content delivery bandwidth required.
The repackaged page also will use a little less network bandwidth. The resources required for eliminated objects are saved. For objects that have been combined, the new object may be smaller than the sum of the parts. The new combined object, if smaller, will require fewer resources to deliver and may take better advantage of TCP/IP delivery windows. The small packets sent back and forth along with the overhead of each item are eliminated. The biggest difference can be seen when the server doesn't keep connections open, when a socks server is involved, or when SSL is being used.
Web designers and developers tend to gravitate close to the web server they are working with for efficiency. Most of them try to be on the same LAN. Real web end users tend to be further away and may be connected via dial-up at considerably slower speeds. The web designer may not see much difference in response time for some of these changes from their viewpoint, but the real end user will see the benefits of a thoughtful repackaging with fewer items for delivery. It is a good Web site development policy that developers should regularly view pages under development by using connections that are typical for the target users.
If the link between the Web server and the browser is saturated all of the time while delivering data (relatively rare with large numbers of items, but possible with slow dial-up links), repackaging may not benefit the end user experience unless the repackaged page becomes useable sooner.
These arguments could be extended to include proxy servers, SOCKS enabled firewalls and SSL, but the lessons remain the same. These extra steps taken in item delivery only amplify the importance of using fewer items on the page.
A closer analogy to item delivery over the Internet may be "move-in day" at the dorm. Hundreds of individuals, with the same goal, get every item from the car into their room in the least amount of time using common elevators, stairs and hallways. One can even envision the lost "packets" in the halls. Even in this chaos, we can understand that having fewer items can help minimize the inevitable delays due to infrastructure saturation.

Labels: performance

Wednesday, February 13, 2008

CMG Past Proceedings Are Available Online

I got an email today from Computer Measurement Group (CMG) saying that they are putting all the past proceedings (1997 to 2005) online. You need to be a registered user (it is free) to view those PDF files.

If you have an interest in performance, you should start browsing.

Labels: performance

Tuesday, February 05, 2008

Table^11

Last October I blogged about one of the Singapore sites has a deep HTML <table> hierarchy (10 levels deep).

Today's Digital Life (a weekly supplement from The Straits Times) featured an article on page 11 that caught my attention. The title is "Slow Loading Speeds, Unstable Site Irk Sistic Users". After visited their home page and analysed it with YSlow for Firebug under Firefox and IBM Page Detailer, I am suprised how inefficient the HTML code is.

By running the same code that I posted previously, I realised that the Sistic site has a 11 level deep table hierarchy!!.

Couple of ineffiencies that I spotted:

60+KB of white space in the home page (a simple one-liner can remove all these)
bullet lists are rendered using table and • as a dot (a css apply on a <ul> can do the job: 'list-style-type: disc' or 'list-style-image: url()')
images generated via a HTTP GET with query string (IMO, these are static images that can be linked directly to the source)
No expires HTTP header for images
lots of commented HTML code
images supposed to be JPEG but got the image/gif content-type
lots and lots of <table>

Simple recommendations can help to ease the load on the server and this equates to better user experience

use a separate URL host (can be the same host as the main site) to serve static content(eg, http://i.sistic.com.sg/). This will help to parallelise the download, default setting in browser is two connections per server host name (see YSlow rules for details)
convert all image URLs (HTTP GET + query string) to static links and have them served from the above mentioned host
develop a simple script to remove all comments & white space in the HTML page before uploading it to the production server.

If you have more times to spare, you should consider this:

convert all the <table> to <div>
use CSS to control the paddings and borders, not using <table> !!

BTW, IBM Page Detailer is an extremely useful tool to profile your HTML page loading time. It is not like other profiling tools that work as a HTTP proxy to capture all the web activities. As mentioned in the IBM Page Detailer page, ... "places a probe in the client's Windows Socket Stack to capture data about timing, size, and data flow, and then it presents the collected data in both graphical and tabular views, clearly showing how the page was delivered to the browser." See below screen capture.

Labels: html, http, performance

Monday, February 04, 2008

Print a Sequence of Dates

Every now and then my boss will ask me to generate log summary between certain dates. What I normally do is to select those log files and manually put that in a 'for' loop to process. Most of the time I can shorten the input to 'for' loop using either wild card or regular expression to get the shell to expand the files selection.

For example, I need to process those gzipped log files between 2008-01-23 to 2008-02-03. This is what I did in the past:

for i in access_log-2008012[3-9].gz access_log-2008013*.gz access_log-2008020[1-3].gz
do
gunzip < $i 
done | awk ' ...

That can be quite tedious and error-prone. Do you know that in Linux, you can print sequence of numbers using seq (see other implementations). It would be nice to have similar command for dates as for numbers.

Below is a shell function (dateseq) that can help you to do all that (using Tcl)

dateseq() {
echo "set s [clock scan $1];set e [clock scan $2];for {set i \$s} {\$i<=\$e} {incr i 86400} {\
puts [clock format \$i -format ${3:-%Y%m%d}]}" | tclsh
}

I will show you how to use it within a 'for' loop and how to specify your own format

$ for i in `dateseq 20080123 20080203`
do
echo $i
done
20080123
20080124
20080125
20080126
20080127
20080128
20080129
20080130
20080131
20080201
20080202
20080203

$ for i in `dateseq 20080123 20080203 %Y-%b-%d`
do
echo $i
done
2008-Jan-23
2008-Jan-24
2008-Jan-25
2008-Jan-26
2008-Jan-27
2008-Jan-28
2008-Jan-29
2008-Jan-30
2008-Jan-31
2008-Feb-01
2008-Feb-02
2008-Feb-03

$ for i in `dateseq 20080123 20080203`
do
f="access_log-$i.gz"
[ -f $f ] && gunzip < $f
done | wc -l
12892723

Labels: shell script, Tcl

Chi Hung Chan

Monday, February 25, 2008

Paste, My Way

Web Page Response Time

Wednesday, February 20, 2008

Web Performance Analogy

Wednesday, February 13, 2008

CMG Past Proceedings Are Available Online

Tuesday, February 05, 2008

Table^11

Monday, February 04, 2008

Print a Sequence of Dates

About Me

Search My Blog

Other Blogs

Previous Posts

Archives