Thursday, May 03, 2007

Web Scraping, the Tcl way

My office's resource booking system does not provide a birdeye view of the availability of resource (meeting rooms). This seems to be another good opportunity to web scrap it with Tcl (with tDOM) to reformat the data into something more user friendly.

HTML data is read and DOM tree is built (set doc [dom parse -html $data]). Once we have the DOM, it is pretty easy to locate the information using XPath

This is a snippet of the HTML code:

<tr class='DashNavCurDashArea'>
<td align=left valign=top height=60><b>3</b>
<br><font size=1>
<a Title='boardroom: user1' href="display_event.asp?Pkey=13931"><img src=image/R.gif border=0 alt='boardroom: user1'>
<font size=1>
<a Title='room3: user2' href="display_event.asp?Pkey=13948"><img src=image/R.gif border=0 alt='room3: user2'>
<font size=1>
<a Title='boardroom: user3' href="display_event.asp?Pkey=13938"><img src=image/R.gif border=0 alt='boardroom: user3'>
With tDOM, you can locate all the resources (or nodes) booked today and then loop through them to extract the username and timeslot.
set todayNode [$root selectNode "//table/tr\[@class='DashNavCurDashArea'\]/td\[b/text()=\"$today\"]"]
foreach node [$todayNode selectNode {font/a/img[@src="image/R.gif"]}] {
 foreach { r u } [split [$node getAttribute alt] {:}] {}
 set room [string trim $r]
 set user [string trim $u]
 set time [string trim [[$node nextSibling] nodeValue]]

Since the resource booking timeslot interval is 10 minute, I create a HTML table with 144 (24*6) columns to represent each and every interval in a day. If a particular resource at a particular interval is taken, the table cell will be filled by a 1x1 pixel image (in red, but resize to 5x10). Also, they will be hyperlinked to itself (#) with attribute title set to the username so that the username will be displayed when mouse over it.



Labels: , , , ,


Post a Comment

<< Home