Introduction to several common methods of parsing XML with Python

Source: Internet
Author: User
First, Introduction

XML (extensible Markup Language) refers to Extensible Markup Language, which is designed to transmit and store data, has become the core of many new technologies, and has different applications in different fields. It is the inevitable outcome of the development of the web to a certain stage, which has the core features of SGML, has the simple features of HTML, and has many new features, such as clear and well-structured.
Python parsing xml Common Three ways: one is the xml.dom.* module, which is the implementation of the ubiquitous DOM API, if you need to handle the DOM API is suitable for the module, note that there are many modules in the Xml.dom package, to distinguish between them, the second is the xml.sax.* module, It is the implementation of the Sax API, which sacrifices convenience for speed and memory consumption, sax is an event-based API, which means it can "in the air" processing a large number of documents, not fully loaded into the memory, three is the Xml.etree.ElementTree module (short ET), which provides a lightweight Python-style API, et is much faster than DOM, and there are a lot of delightful APIs that can be used, compared to sax, ET's et.iterparse also provides "in-the-air" handling, without the need to load the entire document into memory, The average performance of ET is similar to that of sax, but the API is a bit more efficient and convenient to use.
Second, detailed

parsed XML file (country.xml):
View code slices to my Code slice

  <?xml version= "1.0"?>        
  
          
   
    
     
    4
   
           
   
    
   
           
   59900
         
  
    
   
    
        
  
          
   
    
   
           
   
    
     
    .
   
           
   
    
     
    13600
   
           
   
           
   
         
  
      

1, Xml.etree.ElementTree

ElementTree is born to deal with XML, which has two implementations in the Python standard library: one that is pure python, such as Xml.etree.ElementTree, and the other a faster xml.etree.cElementTree. Note: Try to use the C language as much as possible, because it is faster and consumes less memory.
View code slices to my Code slice

  Try:     import xml.etree.cElementTree as ET   except Importerror:     

This is a more common way to make Python different libraries use the same API, and starting with Python 3.3, the ElementTree module will automatically find the available C libraries to speed up, so only import is required. Xml.etree.ElementTree can do it.
View code slices to my Code slice

  #!/usr/bin/evn python #coding: utf-8 try:import xml.etree.cElementTree as ET except Importerror:import Xml.etree.ElementTree as ET import sys try:tree = Et.parse ("Country.xml") #打开xml文档 #root = Et.fromstr  ING (country_string) #从字符串传递xml root = Tree.getroot () #获得root节点 except Exception, E:print "Error:cannot     Parse File:country.xml. "  Sys.exit (1) Print Root.tag, "---", root.attrib for child in Root:print Child.tag, "---", child.attrib print "*" *10 print root[0][1].text #通过下标访问 print Root[0].tag, root[0].text print "*" *10 for country in Root.findal      L (' country '): #找到root节点下的所有country节点 rank = country.find (' rank '). Text #子节点下节点rank的值 name = country.get (' name ') #子节点下属性name的值 Print name, rank #修改xml文件 for country in Root.findall (' Country '): rank = Int (country.f  IND (' rank '). Text) if rank > 50:root.remove (Country) tree.write (' Output.xml ')

Operation Result:

Reference: https://docs.python.org/2/library/xml.etree.elementtree.html
2, xml.dom.*

The file object model, or DOM, is the standard programming interface recommended by the Organization for the processing of extensible superscript languages. A DOM parser parses an XML document, reads the entire document at once, stores all the elements of the document in a tree structure in memory, and then you can use the different functions provided by the DOM to read or modify the contents and structure of the document, or to write the modified content to an XML file. The XML file is parsed using Xml.dom.minidom in Python, as in the following example:
View code slices to my Code slice

  #!/usr/bin/python #coding =utf-8 from xml.dom.minidom import parse Import xml.dom.minidom # Open X with Minidom parser     ML Document Domtree = Xml.dom.minidom.parse ("Country.xml") Data = Domtree.documentelement if Data.hasattribute ("name"):       Print "name element:%s"% Data.getattribute ("name") # Get all countries in the collection Countrys = Data.getelementsbytagname ("Country") # Print detailed information for each country for country in Countrys:print "*****country*****" if Country.hasattribute ("name"): PRI NT "Name:%s"% Country.getattribute ("name") rank = country.getelementsbytagname (' rank ') [0] print "rank:%s"%     Rank.childnodes[0].data year = Country.getelementsbytagname (' year ') [0] print "Year:%s"% Year.childnodes[0].data GDPPC = Country.getelementsbytagname (' gdppc ') [0] print "GDPPC:%s"% Gdppc.childnodes[0].data for neighbor In Country.getelementsbytagname ("neighbor"): Print Neighbor.tagname, ":", Neighbor.getattribute ("name"), NEIGHBOR.G Etattribute ("direction")  

Operation Result:

Reference: https://docs.python.org/2/library/xml.dom.html

3, xml.sax.*

Sax is an event-driven API that uses SAX parsing XML to involve two parts: parsers and event handlers. Where the parser is responsible for reading the XML document and sending events to the event handler, such as the element starting with the element end event, and the event handler is responsible for the event, processing the passed XML data. Using SAX to process XML in Python first introduces the parse function in Xml.sax and the ContentHandler in Xml.sax.handler. Often used in the following cases: first, the processing of large files, two, only the parts of the file, or simply to get specific information from the file; third, when you want to build your own object model.
Introduction to ContentHandler class methods
(1) characters (content) method
Call Time:
Starting with a row, there are characters before the label is encountered, and the content value is these strings.
From a label, before encountering the next label, there are characters, content values for these strings.
From a label that encounters a line terminator, there are characters, content values for these strings.
The label can be either the start tag or the end tag.
(2) Startdocument () method
Called when the document is started.
(3) Enddocument () method
Called when the parser reaches the end of the document.
(4) startelement (name, Attrs) method
Called when an XML start tag is encountered, name is the name of the tag, and Attrs is a dictionary of the property values of the tag.
(5) EndElement (name) method
Called when an XML end tag is encountered.
View code slices to my Code slice

  #coding =utf-8 #!/usr/bin/python Import xml.sax class Countryhandler (Xml.sax.ContentHandler): Def __init_ _ (self): self. CurrentData = "Self.rank =" "Self.year =" "SELF.GDPPC =" "Self.neighborname =" "SELF.NEIGHB Ordirection = "" # element Start event handling def startelement (self, Tag, attributes): self.  CurrentData = Tag if tag = = "Country": Print "*****country*****" name = attributes["Name"] Print        "Name:", name elif tag = = "Neighbor": name = attributes["name"] Direction = attributes["direction"] Print name, "--", direction # element End Event Handling def endElement (self, tag): if self. CurrentData = = "Rank": print "rank:", Self.rank elif self. CurrentData = = "Year": print ' Year: ', Self.year elif self. CurrentData = = "GDPPC": Print "GDPPC:", SELF.GDPPC self. CurrentData = "" # content event handling def characters (self, content): if self. CurrentData = = "Rank ": Self.rank = content elif self. CurrentData = = ' Year ': self.year = content elif self. CurrentData = = "GDPPC": SELF.GDPPC = Content if __name__ = = "__main__": # Create a XMLReader parser = XM L.sax.make_parser () # Turn off Namepsaces parser.setfeature (xml.sax.handler.feature_namespaces, 0) # rewrite Co  Ntexthandler Handler = Countryhandler () parser.setcontenthandler (Handler) parser.parse ("Country.xml")

Operation Result:

4, LIBXML2 and lxml parsing xml

LIBXML2 is an XML parser developed using the C language, a free open source software based on the MIT license, a variety of programming languages based on its implementation, the LIBXML2 module in Python is a little bit inadequate: the Xpatheval () interface does not support the use of similar templates, But does not affect the use, because LIBXML2 uses the C language development, therefore in uses the API interface the way inevitably will be somewhat not suitable.
View code slices to my Code slice

  #!/usr/bin/python   #coding =utf-8      import libxml2      doc = libxml2.parsefile ("Country.xml")   for book in Doc.xpatheval ('//country '):     if book.content! = "":       print "----------------------"       print book.content   for node in Doc.xpatheval ("//country/neighbor[@name = ' Colombia ']"):     print Node.name, (node.properties.name , node.properties.content)   

Lxml is developed in the Python language based on LIBXML2 and is more suitable for Python developers from a usage level than lxml, and the XPath () interface supports the use of similar templates.
View code slices to my Code slice

  #!/usr/bin/python   #coding =utf-8      import lxml.etree       doc = lxml.etree.parse ("Country.xml") for    node in Doc.xpath ("//country/neighbor[@name = $name]", name = "Colombia"):      print Node.tag, Node.items () for   node in Doc.xpath ("//country[@name = $name]", name = "Singapore"):      

Iii. Summary
(1) In Python, the XML parsing available class library or module has XML, LIBXML2, lxml, XPath, etc., need to know more about the corresponding document.
(2) Each analytic method has its own advantages and disadvantages, before the choice can be integrated in all aspects of performance considerations.
(3) If there is insufficient, please leave a message, thank you first!

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.