Introduction to several common methods for parsing XML using Python, and several methods for parsing xml using python

Source: Internet
Author: User

Introduction to several common methods for parsing XML using Python, and several methods for parsing xml using python

I. Introduction

XML (eXtensible Markup Language) is an eXtensible Markup Language designed to transmit and store data. It has become the core of many new technologies and has different applications in different fields. It is an inevitable product of the development of web to a certain stage. It has both the core features of SGML and the simple features of HTML, as well as many new features such as clear and well-structured.
There are three common methods for parsing XML in python: xml. dom. * module, which is the W3C dom api implementation. If you need to process DOM APIs, this module is very suitable. Pay attention to xml. there are many modules in the dom package, which must be differentiated. The second is xml. sax. * module, which is the implementation of the sax api. This module sacrifices convenience in exchange for speed and memory usage. SAX is an event-based API, this means that it can "process a large number of documents in the air" without fully loading them into the memory. The third is xml. etree. the ElementTree module (ET) Provides Lightweight Python APIs. Compared with DOM, ET is much faster and many pleasant APIs are available, ET. iterparse also provides an "in the air" processing method, and there is no need to load the entire document to the memory. The average performance of ET is similar to that of SAX, however, the API is more efficient and easy to use.
Ii. Details

The parsed xml file (country. xml ):
View the CODE piece derived from my CODE piece on CODE

  <?xml version="1.0"?>   <data>     <country name="Singapore">       <rank>4</rank>       <year>2011</year>       <gdppc>59900</gdppc>       <neighbor name="Malaysia" direction="N"/>     </country>     <country name="Panama">       <rank>68</rank>       <year>2011</year>       <gdppc>13600</gdppc>       <neighbor name="Costa Rica" direction="W"/>       <neighbor name="Colombia" direction="E"/>     </country>   </data> 

1. xml. etree. ElementTree

ElementTree is born to process XML. It has two implementations in the Python standard library: one is implemented in pure Python, such as xml. etree. elementTree, and xml with a higher speed. etree. cElementTree. Note: Use the C language as much as possible because it is faster and consumes less memory.
View the CODE piece derived from my CODE piece on CODE

  try:     import xml.etree.cElementTree as ET   except ImportError:     import xml.etree.ElementTree as ET 

This is a common method that allows different Python libraries to use the same API. The ElementTree module will automatically find available C libraries from Python 3.3 to speed up the process, so you only need to import xml. etree. you can use ElementTree.
View the CODE piece derived from my CODE piece on CODE

#! /Usr/bin/evn python # coding: UTF-8 try: import xml. etree. cElementTree as ET partition t ImportError: import xml. etree. elementTree as ET import sys try: tree = ET. parse ("country. xml ") # Open the xml Document # root = ET. fromstring (country_string) # pass xml root = tree from the string. getroot () # obtain the root node token t Exception, e: print "Error: cannot parse file: country. xml. "sys. exit (1) print root. tag, "---", root. attrib for child in root: print child. tag, "---", child. attrib print "*" * 10 print root [0] [1]. text # use subscript to access print root [0]. tag, root [0]. text print "*" * 10 for country in root. findall ('country'): # Find all country nodes under the root node rank = country. find ('rank '). text # name = country. get ('name') # print name and rank of the attribute name under the subnode # modify the xml file for country in root. findall ('country'): rank = int (country. find ('rank '). text) if rank> 50: root. remove (country) tree. write ('output. xml ')

Running result:

Reference: https://docs.python.org/2/library/xml.etree.elementtree.html
2. xml. dom .*

The Document Object Model (DOM) is a standard programming interface recommended by W3C for processing Extensible slogans. When a DOM parser parses an XML document, it reads the entire document at one time and stores all the elements in the document in a tree structure in the memory, then you can use different functions provided by DOM to read or modify the content and structure of the document, or write the modified content into the xml file. In python, xml. dom. minidom is used to parse xml files. The example is as follows:
View the CODE piece derived from my CODE piece on CODE

#! /Usr/bin/python # coding = UTF-8 from xml. dom. minidom import parse import xml. dom. minidom # Use the minidom parser to open the XML document DOMTree = xml. dom. minidom. parse ("country. xml ") Data = DOMTree.doc umentElement if Data. hasAttribute ("name"): print "name element: % s" % Data. getAttribute ("name") # obtain Countrys = Data from all countries in the collection. getElementsByTagName ("country") # print the details of each Country for Country in Countrys: print "****** Country *****" if country. hasAttribute ("name"): print "name: % s" % Country. getAttribute ("name") rank = Country. getElementsByTagName ('rank ') [0] print "rank: % s" % rank. childNodes [0]. data year = Country. getElementsByTagName ('Year') [0] print "year: % s" % year. childNodes [0]. data gdppc = Country. getElementsByTagName ('gdppc ') [0] print "gdppc: % s" % gdppc. childNodes [0]. data for neighbor in Country. getElementsByTagName ("neighbor"): print neighbor. tagName, ":", neighbor. getAttribute ("name"), neighbor. getAttribute ("direction ")

Running result:

Reference: https://docs.python.org/2/library/xml.dom.html

3. xml. sax .*

SAX is an event-driven API. parsing XML using SAX involves two parts: a parser and an event processor. The parser is responsible for reading XML documents and sending events to the event processor. For example, the element starts to end with the element, and the event processor is responsible for responding to the event, process the passed XML data. In python, to process xml using the sax method, you must first introduce the parse function in xml. sax, And the ContentHandler in xml. sax. handler. It is often used in the following situations: 1. process large files; 2. Only part of the file content is required, or you only need to obtain specific information from the file; 3. When you want to build your own object model.
ContentHandler Class Method Introduction
(1) characters (content) Method
Call time:
Starting from the row, there are characters before the tag is met, and the value of content is these strings.
From a tag, before the next tag is met, there are characters whose content value is these strings.
There are characters before a line terminator from a tag and the value of content is these strings.
A tag can be a start tag or an end tag.
(2) startDocument () method
This document is called at startup.
(3) endDocument () method
The parser is called when it reaches the end of the document.
(4) startElement (name, attrs) Method
When a tag is started in XML, the name is the tag name, And the attrs is the attribute value Dictionary of the tag.
(5) endElement (name) Method
It is called when an XML end tag is encountered.
View the CODE piece derived from my CODE piece on CODE

# Coding = UTF-8 #! /Usr/bin/python import xml. sax class CountryHandler (xml. sax. contentHandler): def _ init _ (self): self. currentData = "" self. rank = "" self. year = "" self. gdppc = "" self. neighborname = "" self. neighbordirection = "" # element start event processing def startElement (self, tag, attributes): self. currentData = tag if tag = "country": print "****** Country *****" name = attributes ["name"] print "name :", name elif tag = "neighbor": name = attributes ["name"] direction = attributes ["direction"] print name, "-> ", direction # def endElement (self, tag): if self. currentData = "rank": print "rank:", self. rank elif self. currentData = "year": print "year:", self. year elif self. currentData = "gdppc": print "gdppc:", self. gdppc self. currentData = "" # content event processing def characters (self, content): if self. currentData = "rank": self. rank = content elif self. currentData = "year": self. year = content elif self. currentData = "gdppc": self. gdppc = content if _ name _ = "_ main _": # create an XMLReader parser = xml. sax. make_parser () # turn off namepsaces parser. setFeature (xml. sax. handler. feature_namespaces, 0) # override ContextHandler Handler = CountryHandler () parser. setContentHandler (Handler) parser. parse ("country. xml ")

Running result:

4. libxml2 and lxml parsing xml

Libxml2 is an xml Parser developed in C language. It is a free open-source software based on MIT License. It is implemented in multiple programming languages. The libxml2 module in python is somewhat inadequate: the xpathEval () interface does not support template-like usage, but it does not affect usage. Because libxml2 is developed in C language, it is difficult to use the API interface.
View the CODE piece derived from my CODE piece on CODE

  #!/usr/bin/python   #coding=utf-8      import libxml2      doc = libxml2.parseFile("country.xml")   for book in doc.xpathEval('//country'):     if book.content != "":       print "----------------------"       print book.content   for node in doc.xpathEval("//country/neighbor[@name = 'Colombia']"):     print node.name, (node.properties.name, node.properties.content)   doc.freeDoc() 

Lxml is developed using the python language based on libxml2. It is more suitable for python developers than lxml in terms of use, and the xpath () interface supports usage of similar templates.
View the CODE piece derived from my CODE piece on CODE

  #!/usr/bin/python   #coding=utf-8      import lxml.etree       doc = lxml.etree.parse("country.xml")    for node in doc.xpath("//country/neighbor[@name = $name]", name = "Colombia"):      print node.tag, node.items()   for node in doc.xpath("//country[@name = $name]", name = "Singapore"):      print node.tag, node.items() 

Iii. Summary
(1) the class libraries or modules available for XML parsing in Python include xml, libxml2, lxml, and xpath. For more information, see relevant documents.
(2) Each resolution method has its own advantages and disadvantages. You can consider the performance of each aspect before selecting it.
(3) if there are any deficiencies, please leave a message. Thank you first!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.