Web Scraping, the Tcl way
HTML data is read and DOM tree is built (set doc [dom parse -html $data]). Once we have the DOM, it is pretty easy to locate the information using XPath
This is a snippet of the HTML code:
<table> <tr class='DashNavCurDashArea'> <td align=left valign=top height=60><b>3</b> <br><font size=1> <a Title='boardroom: user1' href="display_event.asp?Pkey=13931"><img src=image/R.gif border=0 alt='boardroom: user1'> 10:00-12:00 </font></a><br> <font size=1> <a Title='room3: user2' href="display_event.asp?Pkey=13948"><img src=image/R.gif border=0 alt='room3: user2'> 13:30-15:00 </font></a><br> <font size=1> <a Title='boardroom: user3' href="display_event.asp?Pkey=13938"><img src=image/R.gif border=0 alt='boardroom: user3'> 16:30-18:00 </font></a><br> ...With tDOM, you can locate all the resources (or nodes) booked today and then loop through them to extract the username and timeslot.
set todayNode [$root selectNode "//table/tr\[@class='DashNavCurDashArea'\]/td\[b/text()=\"$today\"]"] foreach node [$todayNode selectNode {font/a/img[@src="image/R.gif"]}] { foreach { r u } [split [$node getAttribute alt] {:}] {} set room [string trim $r] set user [string trim $u] set time [string trim [[$node nextSibling] nodeValue]] }
Since the resource booking timeslot interval is 10 minute, I create a HTML table with 144 (24*6) columns to represent each and every interval in a day. If a particular resource at a particular interval is taken, the table cell will be filled by a 1x1 pixel image (in red, but resize to 5x10). Also, they will be hyperlinked to itself (#) with attribute title set to the username so that the username will be displayed when mouse over it.
Labels: DOM, Tcl, tDOM, Web Scraping, XML
0 Comments:
Post a Comment
<< Home