Introduction to several common methods of parsing XML with Python

Introduction to several common methods of parsing XML with Python _python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Introduction

XML (extensible Markup Language) refers to Extensible Markup Language, which is designed to transmit and store data, has become the core of many new technologies at present and has different applications in different fields. It is the inevitable product of web development to a certain stage, it has both the core features of SGML and the simplicity of HTML, and it has many new features, such as clear and good structure.
Python parsing xml is common in three ways: one is the xml.dom.* module, it is the implementation of the wide-range DOM API, if you need to deal with the DOM API is suitable for the module, note xml.dom package has a lot of modules, you need to distinguish between them; the second is the xml.sax.* module, It is the implementation of the Sax API, which sacrifices convenience in exchange for speed and memory usage, and sax is an event-based API, which means it can handle a large number of documents "in the air" without being fully loaded into memory; Three is the Xml.etree.ElementTree module (abbreviated ET, which provides a lightweight Python API that is much faster than Dom and has a lot of pleasing APIs to use, et Et.iterparse also provides "in the air" as opposed to sax, without the need to load the entire document into memory, The average performance of ET is similar to that of sax, but the API is more efficient and easy to use.
Second, detailed

parsed XML file (country.xml):
View a code slice from my Code chip

  <?xml version= "1.0"?> 
  <data> 
    <country name= "Singapore" > 
      <rank>4</rank> 
      <year>2011</year> 
      <gdppc>59900</gdppc> 
      <neighbor name= "Malaysia" direction= "N"/> 
    </country> 
    <country name= "Panama" > 
      <rank>68</rank> 
      <year>2011</year> 
      <gdppc>13600</gdppc> 
      <neighbor name= "Costa Rica" direction = "W"/> 
      <neighbor name= "Colombia" direction= "E"/> 
    </country> 
  </data>

1, Xml.etree.ElementTree

ElementTree was born to process XML, and it has two implementations in the Python standard library: One is pure python, like Xml.etree.ElementTree, and the other is a faster xml.etree.cElementTree. Note: Try to use the one that is implemented in C because it is faster and consumes less memory.
View a code slice from my Code chip

  Try: 
    import xml.etree.cElementTree as et 
  except importerror: 
    import xml.etree.ElementTree as ET

This is a more common approach to using the same API for different libraries in Python, and the ElementTree module will automatically look for the available C libraries to speed up from Python 3.3, so you just import Xml.etree.ElementTree is OK.
View a code slice from my Code chip

  #!/usr/bin/evn python #coding: utf-8 try:import xml.etree.cElementTree as ET except Importerror: Import xml.etree.ElementTree as ET import sys try:tree = Et.parse ("Country.xml") #打开xml文档 #root = Et.fromstring (country_string) #从字符串传递xml root = Tree.getroot () #获得root节点 except Exception, E:prin 
    T "Error:cannot parse file:country.xml." 
   
  Sys.exit (1) Print Root.tag, "---", root.attrib for child in Root:print Child.tag, "---", child.attrib Print "*" *10 print root[0][1].text #通过下标访问 print Root[0].tag, root[0].text print "*" *10 for country in R Oot.findall (' Country '): #找到root节点下的所有country节点 rank = country.find (' rank '). Text #子节点下节点rank的值 name = COUNTRY.G ET (' name ') #子节点下属性name的值 print name, rank #修改xml文件 for country in Root.findall (' Country '): Ran k = Int (country.find (' rank '). Text) if rank > 50:root.remove (country) Tree.wriTe (' Output.xml ')

Run Result:

Reference: https://docs.python.org/2/library/xml.etree.elementtree.html
2, xml.dom.*

The File object model (document object, referred to as DOM) is a standard programming interface recommended by the Consortium for processing extensible subscript languages. When parsing an XML document, a DOM parser once you read the entire document and keep all the elements in the document in a tree structure in memory, you can then use the different functions provided by DOM to read or modify the contents and structure of the document, or you can write the modified content to an XML file. In Python, you use Xml.dom.minidom to parse XML files, examples are as follows:
View a code slice from my Code chip

  #!/usr/bin/python #coding =utf-8 from xml.dom.minidom import parse import xml.dom.minidom # use Minido The M parser opens the XML document Domtree = Xml.dom.minidom.parse ("Country.xml") Data = Domtree.documentelement if Data.hasattribute ("N Ame "): print" name element:%s "% Data.getattribute (" name ") # gets all countries in the collection Countrys = Data.getelementsbytagn Ame ("Country") # Print the details for each country for country in Countrys:print ' *****country***** ' if country.hasattribut 
    E ("name"): print "Name:%s"% Country.getattribute ("name") rank = country.getelementsbytagname (' rank ') [0]  print ' rank:%s '% rank.childnodes[0].data year = Country.getelementsbytagname (' year ') [0] print "Year:%s"% Year.childnodes[0].data GDPPC = country.getelementsbytagname (' gdppc ') [0] print "GDPPC:%s"% gdppc.childnodes[0 ].data for neighbor in Country.getelementsbytagname ("neighbor"): Print Neighbor.tagname, ":", neighbor.ge Tattribute ("name"), NeigHbor.getattribute ("direction")

Run Result:

Reference: https://docs.python.org/2/library/xml.dom.html

3, xml.sax.*

Sax is an event-driven API that uses SAX parsing XML to involve two parts: parsers and event handlers. The parser is responsible for reading the XML document and sending events to the event handler, such as when the element starts with the element end event, and the event handler is responsible for the event and processing the passed XML data. Using sax in Python to process XML first introduces the parse function in Xml.sax, as well as the ContentHandler in Xml.sax.handler. Often used in the following cases: first, the processing of large files, two, only need a part of the file, or only to get specific information from the file; third, when you want to build your own object model.
Introduction to ContentHandler class methods
(1) characters (content) method
Call Time:
At the beginning of the line, before the label is encountered, there are characters, and the content values are those strings.
From a label, before the next tag is encountered, there are characters, and the content values are those strings.
From a label, before the line terminator is encountered, there are characters, and the content values are those strings.
The label can be a start tag or an end tag.
(2) Startdocument () method
Called when the document is started.
(3) Enddocument () method
Called when the parser reaches the end of the document.
(4) startelement (name, Attrs) method
Called when an XML start tag is encountered, name is the label's first, and Attrs is the label's property value dictionary.
(5) EndElement (name) method
Called when an XML closing tag is encountered.
View a code slice from my Code chip

  #coding =utf-8 #!/usr/bin/python Import xml.sax class Countryhandler (Xml.sax.ContentHandler): def __init__ (self): self. CurrentData = "" "Self.rank =" "Self.year =" "SELF.GDPPC =" "Self.neighborname =" "SELF.N Eighbordirection = "" # element Start event handling def startelement (self, Tag, attributes): self. CurrentData = Tag if tag = = "Country": Print "*****country*****" name = attributes["Name"] P Rint "Name:", name elif tag = "Neighbor": name = attributes["name"] Direction = attributes["Directi On "] Print name,"-> ", direction # element End Event Handling def endelement (self, tag): if self. CurrentData = = "Rank": print "rank:", Self.rank elif self. CurrentData = "Year": print "Year:", Self.year elif self. CurrentData = = "GDPPC": Print "GDPPC:", SELF.GDPPC self. CurrentData = "" # content event handling def characters (self, content): if self. CurrentData = = "Rank": Self.rank = content elif self. CurrentData = "Year": self.year = content elif self.  CurrentData = = "GDPPC": SELF.GDPPC = Content if __name__ = = "__main__": # Create a XMLReader parser 
   
     = Xml.sax.make_parser () # Turn off Namepsaces parser.setfeature (xml.sax.handler.feature_namespaces, 0) # rewrite Contexthandler Handler = Countryhandler () parser.setcontenthandler (Handler) parser.parse ("Coun 

 Try.xml ")

Run Result:

4, LIBXML2 and lxml parsing xml

LIBXML2 is an XML parser developed using the C language, is a free open source software based on MIT license, a variety of programming languages are based on its implementation, Python's LIBXML2 module is a little bit insufficient: the Xpatheval () interface does not support the use of similar templates, But does not affect the use, because LIBXML2 uses the C language development, therefore uses the API interface the way unavoidably to be somewhat not suitable.
View a code slice from my Code chip

  #!/usr/bin/python 
  #coding =utf-8 
   
  import libxml2 
   
  doc = libxml2.parsefile ("Country.xml") 
  for book in Doc.xpatheval ('//country '): 
    if Book.content!= "": 
      print "----------------------" 
      print book.content 
  for node in Doc.xpatheval ("//country/neighbor[@name = ' Colombia ']"): 
    print Node.name, (node.properties.name , node.properties.content) 
  Doc.freedoc ()

Lxml is developed in the Python language based on LIBXML2, and is more suitable for Python developers than lxml, and the XPath () interface supports the use of similar templates.
View a code slice from my Code chip

  #!/usr/bin/python 
  #coding =utf-8 
   
  import lxml.etree  
   
  doc = lxml.etree.parse ("Country.xml") for  
  node in Doc.xpath ("//country/neighbor[@name = $name]", name = "Colombia"):  
    print Node.tag, Node.items () for 
  node in Doc.xpath ("//country[@name = $name]", name = "Singapore"):  
    print Node.tag, Node.items ()

Third, summary
(1) The class libraries or modules available for XML parsing in Python are XML, LIBXML2, lxml, XPath, and so on, and you need to consult the appropriate documentation for further information.
(2) Each analytical method has its own advantages and disadvantages, before the selection can be integrated in all aspects of performance considerations.
(3) If there is a shortage, please leave a message, thank you first!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to several common methods of parsing XML with Python _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to several common methods of parsing XML with Python _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support