I went to the recent
Sun's
Project Blackbox launch the other day and met my marcom colleague on my way to the train station. We had a nice chat and exchanged a lot of ideas regarding how our internal MIS can be improved. One of the things that bothers them for a long time is email. As marcom, they need a system to blast off emails to our customers regarding events, news, and anything that we want them to know about us. MIS told them they do not have such a system available and therefore marcom has to do this using their existing Outlook.
The current customer database is in a Excel spreadsheet and the marcom folks have to literally copy and paste them one by one into the marcom Outlook a/c in the desktop pc. I am sure there is a better way in doing this, automation is the way to go. I explored both using Outlook as well as the web-based email. With a bit of Googling, it is possible to export the MS Excel to CSV (comma separated variable) and have the CSV file imported into the Outlook. It would be easier to introduce a header in the first row so that the mapping of your Excel data to Outlook contact can be done exactly.
What happen if the person is on leave and the other marcom folks do not have access to the desktop. AFAIK, the Outlook contacts are stored in your local desktop, not in the server. BTW, we are running Sun Messaging Server, not MS Exchange. It would be nice to import all these email addresses into the system so that it can be access anywhere by anyone. However, such functionality is not available in the system.
With many years of Web Scraping experience, it will not be too hard to do that, right? First, I installed
LiveHTTPheaders extension in
FireFox. Second, I login to the mail server and add a contact to the system. I picked a particular syntax (like this _XXXXX_) for all the values in the fields. This syntax enables me to easily identify the value in the encoded string in the POST data.
I break them into lines for ease of reading
sid=&security=false&ldapurl=pab&filter=cmd%3DPAB_CMD_ADD_ENTRY%7Cnew%3Dgivenname%3A_FIRSTNAME_%0D%0A
sn%3A_LASTNAME_%0D%0Acn%3A_DISPLAY_%0D%0Aobjectclass%3Apabperson%0D%0Amemberofpab%3AAddressBookc6e5131%0D%0A
mail%3A_EMAIL_%0D%0Atelephonenumber%3A_OFFICE_%0D%0Ahomephone%3A_HOME_%0D%0Amobile%3A_MOBILE_%0D%0A
pager%3A_PAGER_%0D%0Afacsimiletelephonenumber%3A_FAX_%0D%0A%0D%0A%7C
If you do a "man ascii" in solaris, you will see the hex character encoding
Hexadecimal - Character
00 NUL 01 SOH 02 STX 03 ETX 04 EOT 05 ENQ 06 ACK 07 BEL
08 BS 09 HT 0A NL 0B VT 0C NP 0D CR 0E SO 0F SI
10 DLE 11 DC1 12 DC2 13 DC3 14 DC4 15 NAK 16 SYN 17 ETB
18 CAN 19 EM 1A SUB 1B ESC 1C FS 1D GS 1E RS 1F US
20 SP 21 ! 22 " 23 # 24 $ 25 % 26 & 27 '
28 ( 29 ) 2A * 2B + 2C , 2D - 2E . 2F /
30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7
38 8 39 9 3A : 3B ; 3C < 3D = 3E > 3F ?
40 @ 41 A 42 B 43 C 44 D 45 E 46 F 47 G
48 H 49 I 4A J 4B K 4C L 4D M 4E N 4F O
50 P 51 Q 52 R 53 S 54 T 55 U 56 V 57 W
58 X 59 Y 5A Z 5B [ 5C \ 5D ] 5E ^ 5F _
60 ` 61 a 62 b 63 c 64 d 65 e 66 f 67 g
68 h 69 i 6A j 6B k 6C l 6D m 6E n 6F o
70 p 71 q 72 r 73 s 74 t 75 u 76 v 77 w
78 x 79 y 7A z 7B { 7C | 7D } 7E ~ 7F DEL
Somehow the characters in the encoded POST data include non-printable characters, eg %0D, %0A.... So shall I do decoding to find out the name value pairs. The answser is no. What I did was to simply prepend a "$" (dollar) sign in front of the value so that I can dynamically get
Tcl to do the substitution.
sid=&security=false&ldapurl=pab&filter=cmd%3DPAB_CMD_ADD_ENTRY%7Cnew%3Dgivenname%3A$_FIRSTNAME_%0D%0A
sn%3A$_LASTNAME_%0D%0Acn%3A$_DISPLAY_%0D%0Aobjectclass%3Apabperson%0D%0Amemberofpab%3AAddressBookc6e5131%0D%0A
mail%3A$_EMAIL_%0D%0Atelephonenumber%3A$_OFFICE_%0D%0Ahomephone%3A$_HOME_%0D%0Amobile%3A$_MOBILE_%0D%0A
pager%3A$_PAGER_%0D%0Afacsimiletelephonenumber%3A$_FAX_%0D%0A%0D%0A%7C
All I have to is to export the Excel data into CSV (I choose tab separated to avoid comma character in the data) and get the program to loop through every line. For each line, all I have to do is to extract the email address, name, ... and set the corresponding variables. The correct POST data will immediately fall into places. BTW, the login process requires to capture the Cookie, which is the session key for the entire process.
Below shows the script and the screen dumps (add contact, livehttpheaders)
#! /usr/bin/tclsh
if { $argc != 1 } {
puts stderr "Usage: $argv0 <tab-separated-csv>"
exit 1
}
set tabcsv [lindex $argv 0]
proc getCookieHeader { s } {
set cookie {}
foreach {n v} [set ${s}(meta)] {
if { [string equal -nocase {Set-Cookie} $n] } {
lappend cookie $v
}
}
set set_cookie {}
foreach c $cookie {
append set_cookie "[lindex [split $c {;}] 0]; "
}
return [string range $set_cookie 0 end-2]
}
set u username; set p password
package require http
set url "http://some.mail.server"
set url_login "$url/login.msc"
set url_addcontact "$url/ldap.msc"
set url_logout "$url/cmd.msc?sid=&mbox=&cmd=logout&security=false&lang=en"
# login
set s [http::geturl $url_login -query [http::formatQuery user $u password $p]]
set header [getCookieHeader $s]
http::cleanup $s
set fp [open $tabcsv r]
# skip first line
gets $fp line
while { [gets $fp line] >= 0 } {
set linelist [split $line "\t"]
set company [lindex $linelist 0]
set custname [lindex $linelist 1]
set custemail [lindex $linelist 2]
if { [string length $custemail]==0 || [llength $linelist]<=3 } {
continue
}
puts $custemail
set _FIRSTNAME_ ""
set _LASTNAME_ "$custname"
set _DISPLAY_ "$company - $custname"
set _EMAIL_ "$custemail"
set _OFFICE_ ""
set _HOME_ ""
set _MOBILE_ ""
set _PAGER_ ""
set _FAX_ ""
# add contact
set contact_query "sid=&security=false&ldapurl=pab&filter=cmd%3DPAB_CMD_ADD_ENTRY%7Cnew%3Dgivenname%3A$_FIRSTNAME_%0D%0Asn%3A$_LASTNAME_%0D%0Acn%3A$_DISPLAY_%0D%0Aobjectclass%3Apabperson%0D%0Amemberofpab%3AAddressBookc6e5131%0D%0Amail%3A$_EMAIL_%0D%0Atelephonenumber%3A$_OFFICE_%0D%0Ahomephone%3A$_HOME_%0D%0Amobile%3A$_MOBILE_%0D%0Apager%3A$_PAGER_%0D%0Afacsimiletelephonenumber%3A$_FAX_%0D%0A%0D%0A%7C"
set s [http::geturl $url_addcontact -headers [list Cookie $header] -query $contact_query]
http::cleanup $s
}
close $fp
# logout
set s [http::geturl $url_logout -headers [list Cookie $header]]
http::cleanup $s
Labels: Tcl, Web Scraping