Building a High Performance Cluster with Amazon Web Services
Labels: HPC
The Scripting Guy in the Lion City with a performance sense.
Labels: HPC
$ cat ns.xml
<?xml version="1.0"?>
<Tests xmlns="http://www.adatum.com">
<Test TestId="0001" TestType="CMD">
<Name>Convert number to string</Name>
<CommandLine>Examp1.EXE</CommandLine>
<Input>1</Input>
<Output>One</Output>
</Test>
<Test TestId="0002" TestType="CMD">
<Name>Find succeeding characters</Name>
<CommandLine>Examp2.EXE</CommandLine>
<Input>abc</Input>
<Output>def</Output>
</Test>
<Test TestId="0003" TestType="GUI">
<Name>Convert multiple numbers to strings</Name>
<CommandLine>Examp2.EXE /Verbose</CommandLine>
<Input>123</Input>
<Output>One Two Three</Output>
</Test>
<Test TestId="0004" TestType="GUI">
<Name>Find correlated key</Name>
<CommandLine>Examp3.EXE</CommandLine>
<Input>a1</Input>
<Output>b1</Output>
</Test>
<Test TestId="0005" TestType="GUI">
<Name>Count characters</Name>
<CommandLine>FinalExamp.EXE</CommandLine>
<Input>This is a test</Input>
<Output>14</Output>
</Test>
<Test TestId="0006" TestType="GUI">
<Name>Another Test</Name>
<CommandLine>Examp2.EXE</CommandLine>
<Input>Test Input</Input>
<Output>10</Output>
</Test>
</Tests>
$ xmllint --shell ns.xml
/ > cd Tests
Tests is a 0 Node Set
/ >
In order to traverse XML file with namespace defined, you need to set it with a prefix.
$ head -2 ns.xml
<?xml version="1.0"?>
<Tests xmlns="http://www.adatum.com">
$ xmllint --shell ns.xml
/ > setns a=http://www.adatum.com
/ > cd a:Tests
Tests > cd a:Test
a:Test is a 6 Node Set
Tests > cd a:Test[3]
Test > dir
ELEMENT Test
ATTRIBUTE TestId
TEXT
content=0003
ATTRIBUTE TestType
TEXT
content=GUI
Test > cat
<Test TestId="0003" TestType="GUI">
<Name>Convert multiple numbers to strings</Name>
<CommandLine>Examp2.EXE /Verbose</CommandLine>
<Input>123</Input>
<Output>One Two Three</Output>
</Test>
Test >
If you have more than 1 namespace to work with, just set it with a different prefix name. You do not have to use the same namespace declaration mapping.
$ cat ns2.xml
<h:html xmlns:xdc="http://www.xml.com/books"
xmlns:h="http://www.w3.org/HTML/1998/html4">
<h:head><h:title>Book Review</h:title></h:head>
<h:body>
<xdc:bookreview>
<xdc:title>XML: A Primer</xdc:title>
<h:table>
<h:tr align="center">
<h:td>Author</h:td><h:td>Price</h:td>
<h:td>Pages</h:td><h:td>Date</h:td></h:tr>
<h:tr align="left">
<h:td><xdc:author>Simon St. Laurent</xdc:author></h:td>
<h:td><xdc:price>31.98</xdc:price></h:td>
<h:td><xdc:pages>352</xdc:pages></h:td>
<h:td><xdc:date>1998/01</xdc:date></h:td>
</h:tr>
</h:table>
</xdc:bookreview>
</h:body>
</h:html>
$ xmllint --shell ns2.xml
/ > cd h:html
h:html is a 0 Node Set
/ > setns h=http://www.w3.org/HTML/1998/html4
/ > setns xdc=http://www.xml.com/books
/ > cd h:html/h:body/xdc:bookreview/xdc:title
title > cat
<xdc:title>XML: A Primer</xdc:title>
title >
Labels: XML
After some thoughts, I think I should validate my anwser. I downloaded a pretty sizeable XML file from Mondial project for my test. I deliberately removed one of the closing tags to make it not well-formed. Both tdom and Firefox are not able to identify the exact location of the missing closing tag. It is only xmllint is able to pinpoint the location
$ diff mondial.xml mondial-malformed.xml
16819d16818
< </country>
$ firefox mondial-malformed.xml
Firefox
XML Parsing Error: mismatched tag. Expected: </country>.
Location: file:///home/chihung/Projects/xmllint/mondial-malformed.xml
Line Number 39564, Column 3:</mondial>
--^
$ tclsh
% package require tdom
0.8.3
% set doc [dom parse [tDOM::xmlReadFile mondial-malformed.xml]]
error "mismatched tag" at line 39564 character 2
"ude>
</desert>
</m <--Error-- ondial>
"
$ xmllint --shell mondial-malformed.xml
mondial-malformed.xml:39564: parser error : Opening and ending tag mismatch: country line 16795 and mondial
</mondial>
^
mondial-malformed.xml:39565: parser error : Premature end of data in tag mondial line 3
^
OK, xmllint is sure the winner in this exercise. Below shows xmllint in action:
$ xmllint --shell mondial.xml
/ > help
base display XML base of the node
setbase URI change the XML base of the node
bye leave shell
cat [node] display node or current node
cd [path] change directory to path or to root
dir [path] dumps informations about the node (namespace, attributes, content)
du [path] show the structure of the subtree under path or the current node
exit leave shell
help display this help
free display memory usage
load [name] load a new document with name
ls [path] list contents of path or the current directory
set xml_fragment replace the current node content with the fragment parsed in context
xpath expr evaluate the XPath expression in that context and print the result
setns nsreg register a namespace to a prefix in the XPath evaluation context
format for nsreg is: prefix=[nsuri] (i.e. prefix= unsets a prefix)
setrootns register all namespace found on the root element
the default namespace if any uses 'defaultns' prefix
pwd display current working directory
quit leave shell
save [name] save this document to name or the original name
write [name] write the current node to the filename
validate check the document for errors
relaxng rng validate the document agaisnt the Relax-NG schemas
grep string search for a string in the subtree
/ > validate
mondial.xml:35144: element island: validity error : Syntax of value for attribute sea of island is not valid
validity error : attribute sea line 35144 references an unknown ID ""
/ > base
mondial.xml
/ > dir
DOCUMENT
version=1.0
encoding=UTF-8
URL=mondial.xml
standalone=true
/ > grep Singapore
/mondial/country[105]/name : t-- 9 Singapore
/mondial/country[105]/city/name : t-- 9 Singapore
/mondial/island[163]/name : t-- 9 Singapore
/ > cd /mondial/country[105]
country > cat
<country car_code="SGP" area="632.6" capital="cty-Singapore-Singapore" memberships="org-AsDB org-ASEAN org-Mekong-Group org-CP org-C org-CCC org-ESCAP org-G-77 org-IAEA org-IBRD org-ICC org-ICAO org-ICFTU org-Interpol org-IFRCS org-IFC org-ILO org-IMO org-Inmarsat org-IMF org-IOC org-ISO org-ICRM org-ITU org-Intelsat org-NAM org-PCA org-UN org-UNIKOM org-UPU org-WHO org-WIPO org-WMO org-WTrO">
<name>Singapore</name>
<population>3396924</population>
<population_growth>1.9</population_growth>
<infant_mortality>4.7</infant_mortality>
<gdp_total>66100</gdp_total>
<gdp_ind>28</gdp_ind>
<gdp_serv>72</gdp_serv>
<inflation>1.7</inflation>
<indep_date>1965-08-09</indep_date>
<government>republic within Commonwealth</government>
<encompassed continent="asia" percentage="100"/>
<ethnicgroups percentage="6.4">Indian</ethnicgroups>
<ethnicgroups percentage="76.4">Chinese</ethnicgroups>
<ethnicgroups percentage="14.9">Malay</ethnicgroups>
<city id="cty-Singapore-Singapore" is_country_cap="yes" country="SGP">
<name>Singapore</name>
<longitude>103.833</longitude>
<latitude>1.3</latitude>
<population year="87">2558000</population>
<located_at watertype="sea" sea="sea-SouthChinaSea"/>
<located_on island="island-Singapore"/>
</city>
</country>
Finding countries with infant_mortality less than Singapore.
country > xpath //country[infant_mortality<4.7]/name/text()
Object is a Node Set :
Set contains 9 nodes:
1 TEXT
content=Andorra
2 TEXT
content=Sweden
3 TEXT
content=Iceland
4 TEXT
content=Jersey
5 TEXT
content=Man
6 TEXT
content=Hong Kong
7 TEXT
content=Japan
8 TEXT
content=Anguilla
9 TEXT
content=Bermuda
country > quit
This can be turned into command line too.
$ xmllint --xpath '//country[infant_mortality<4.7]/name' --format mondial.xml <name>Andorra</name><name>Sweden</name><name>Iceland</name><name>Jersey</name><name>Man</name><name>Hong Kong</name><name>Japan</name><name>Anguilla</name><name>Bermuda</name> real 0m0.219s user 0m0.192s sys 0m0.020s
Alternatively, you can do the above dynamically:
$ xmllint --shell mondial.xml
/ > xpath //country[infant_mortality<//country[name="Singapore"]/infant_mortality]/name/text()
Object is a Node Set :
Set contains 9 nodes:
1 TEXT
content=Andorra
2 TEXT
conte;nt=Sweden
3 TEXT
content=Iceland
4 TEXT
content=Jersey
5 TEXT
content=Man
6 TEXT
content=Hong Kong
7 TEXT
content=Japan
8 TEXT
content=Anguilla
9 TEXT
content=Bermuda
$ time xmllint --xpath '//country[infant_mortality<//country[name="Singapore"]/infant_mortality]/name' --format mondial.xml
<name>Andorra</name><name>Sweden</name><name>Iceland</name><name>Jersey</name><name>Man</name><name>Hong Kong</name><name>Japan</name><name>Anguilla</name><name>Bermuda</name>
real 0m2.074s
user 0m2.052s
sys 0m0.016s
xmllint is definitely the preferred XML companion. It is extremely fast and efficient comparing with Firefox and tdom.