Friday, January 07, 2011

xmllint - Answer to an XML Question

Today I was asked about how I can validate or check the well-formness of XML file. My immediate answers were using a browser to view the malformed XML and the second answer was to parse it using tdom. At that time, xmllint wasn't in my mind. 'cos I seldom use it.

After some thoughts, I think I should validate my anwser. I downloaded a pretty sizeable XML file from Mondial project for my test. I deliberately removed one of the closing tags to make it not well-formed. Both tdom and Firefox are not able to identify the exact location of the missing closing tag. It is only xmllint is able to pinpoint the location

$ diff mondial.xml mondial-malformed.xml 
16819d16818
<    </country>


$ firefox mondial-malformed.xml

Firefox
XML Parsing Error: mismatched tag. Expected: </country>.
Location: file:///home/chihung/Projects/xmllint/mondial-malformed.xml
Line Number 39564, Column 3:</mondial>
--^


$ tclsh
% package require tdom
0.8.3
% set doc [dom parse [tDOM::xmlReadFile mondial-malformed.xml]]
error "mismatched tag" at line 39564 character 2
"ude>
   </desert>
</m <--Error-- ondial>
"


$ xmllint --shell mondial-malformed.xml 
mondial-malformed.xml:39564: parser error : Opening and ending tag mismatch: country line 16795 and mondial
</mondial>
          ^
mondial-malformed.xml:39565: parser error : Premature end of data in tag mondial line 3

^

OK, xmllint is sure the winner in this exercise. Below shows xmllint in action:

$ xmllint --shell mondial.xml 
/ > help
 base         display XML base of the node
 setbase URI  change the XML base of the node
 bye          leave shell
 cat [node]   display node or current node
 cd [path]    change directory to path or to root
 dir [path]   dumps informations about the node (namespace, attributes, content)
 du [path]    show the structure of the subtree under path or the current node
 exit         leave shell
 help         display this help
 free         display memory usage
 load [name]  load a new document with name
 ls [path]    list contents of path or the current directory
 set xml_fragment replace the current node content with the fragment parsed in context
 xpath expr   evaluate the XPath expression in that context and print the result
 setns nsreg  register a namespace to a prefix in the XPath evaluation context
              format for nsreg is: prefix=[nsuri] (i.e. prefix= unsets a prefix)
 setrootns    register all namespace found on the root element
              the default namespace if any uses 'defaultns' prefix
 pwd          display current working directory
 quit         leave shell
 save [name]  save this document to name or the original name
 write [name] write the current node to the filename
 validate     check the document for errors
 relaxng rng  validate the document agaisnt the Relax-NG schemas
 grep string  search for a string in the subtree

/ > validate
mondial.xml:35144: element island: validity error : Syntax of value for attribute sea of island is not valid
validity error : attribute sea line 35144 references an unknown ID ""

/ > base
mondial.xml

/ > dir
DOCUMENT
version=1.0
encoding=UTF-8
URL=mondial.xml
standalone=true

/ > grep Singapore
/mondial/country[105]/name : t--        9 Singapore
/mondial/country[105]/city/name : t--        9 Singapore
/mondial/island[163]/name : t--        9 Singapore

/ > cd /mondial/country[105]

country > cat
<country car_code="SGP" area="632.6" capital="cty-Singapore-Singapore" memberships="org-AsDB org-ASEAN org-Mekong-Group org-CP org-C org-CCC org-ESCAP org-G-77 org-IAEA org-IBRD org-ICC org-ICAO org-ICFTU org-Interpol org-IFRCS org-IFC org-ILO org-IMO org-Inmarsat org-IMF org-IOC org-ISO org-ICRM org-ITU org-Intelsat org-NAM org-PCA org-UN org-UNIKOM org-UPU org-WHO org-WIPO org-WMO org-WTrO">
      <name>Singapore</name>
      <population>3396924</population>
      <population_growth>1.9</population_growth>
      <infant_mortality>4.7</infant_mortality>
      <gdp_total>66100</gdp_total>
      <gdp_ind>28</gdp_ind>
      <gdp_serv>72</gdp_serv>
      <inflation>1.7</inflation>
      <indep_date>1965-08-09</indep_date>
      <government>republic within Commonwealth</government>
      <encompassed continent="asia" percentage="100"/>
      <ethnicgroups percentage="6.4">Indian</ethnicgroups>
      <ethnicgroups percentage="76.4">Chinese</ethnicgroups>
      <ethnicgroups percentage="14.9">Malay</ethnicgroups>
      <city id="cty-Singapore-Singapore" is_country_cap="yes" country="SGP">
         <name>Singapore</name>
         <longitude>103.833</longitude>
         <latitude>1.3</latitude>
         <population year="87">2558000</population>
         <located_at watertype="sea" sea="sea-SouthChinaSea"/>
         <located_on island="island-Singapore"/>
      </city>
   </country>

Finding countries with infant_mortality less than Singapore.

country > xpath //country[infant_mortality<4.7]/name/text()
Object is a Node Set :
Set contains 9 nodes:
1  TEXT
    content=Andorra
2  TEXT
    content=Sweden
3  TEXT
    content=Iceland
4  TEXT
    content=Jersey
5  TEXT
    content=Man
6  TEXT
    content=Hong Kong
7  TEXT
    content=Japan
8  TEXT
    content=Anguilla
9  TEXT
    content=Bermuda

country > quit

This can be turned into command line too.

$ xmllint --xpath '//country[infant_mortality<4.7]/name' --format mondial.xml 
<name>Andorra</name><name>Sweden</name><name>Iceland</name><name>Jersey</name><name>Man</name><name>Hong Kong</name><name>Japan</name><name>Anguilla</name><name>Bermuda</name>

real 0m0.219s
user 0m0.192s
sys 0m0.020s

Alternatively, you can do the above dynamically:

$ xmllint --shell mondial.xml
/ > xpath //country[infant_mortality<//country[name="Singapore"]/infant_mortality]/name/text()
Object is a Node Set :
Set contains 9 nodes:
1  TEXT
    content=Andorra
2  TEXT
    conte;nt=Sweden
3  TEXT
    content=Iceland
4  TEXT
    content=Jersey
5  TEXT
    content=Man
6  TEXT
    content=Hong Kong
7  TEXT
    content=Japan
8  TEXT
    content=Anguilla
9  TEXT
    content=Bermuda

$ time xmllint --xpath '//country[infant_mortality<//country[name="Singapore"]/infant_mortality]/name' --format mondial.xml 
<name>Andorra</name><name>Sweden</name><name>Iceland</name><name>Jersey</name><name>Man</name><name>Hong Kong</name><name>Japan</name><name>Anguilla</name><name>Bermuda</name>
real 0m2.074s
user 0m2.052s
sys 0m0.016s

xmllint is definitely the preferred XML companion. It is extremely fast and efficient comparing with Firefox and tdom.

Labels: ,

2 Comments:

Blogger Unknown said...

Hi

I got an error: xpath command not found when Im out of the shell. Please help me.

9:04 PM  
Blogger chihungchan said...

My xpath is a command within the xmllint shell.

If you install libxml-xpath-perl, you will have a /usr/bin/xpath Perl script

9:30 PM  

Post a Comment

<< Home