Use Python to access XML for instance analysis

Last Update:2017-05-14 Source: Internet

Author: User

Tags xml attribute

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces common methods for accessing XML in Python, and analyzes common methods, advantages and disadvantages, and precautions for accessing xml in Python based on specific instance forms, for more information about how to use Python to access XML, the common methods, advantages and disadvantages of accessing xml in Python are analyzed in detail based on the specific instance form. For more information, see

This document describes how to access XML using Python. We will share this with you for your reference. The details are as follows:

Currently, Python 3.2 has the following four methods to access XML:

1. Expat
2. DOM
3. SAX
4. ElementTree

Use the following xml as the basis for discussion

Expat

Expat is a stream-oriented parser. You register the parser callback (or handler) function, and then start searching for its documentation. When the parser identifies the specified location of the file, it will call the corresponding handler (if you have registered one ). The file is delivered to the parser and split into multiple fragments and loaded into memory in segments. Therefore, expat can parse those huge files.

SAX

SAX is a Parser API for sequential access to XML. it acts as a stream Parser and has an event-driven API. The user-defined callback function is called when an event occurs during parsing. Events are triggered when any XML feature is encountered, and when they are ended, they are triggered again. XML attributes are also part of the data transmitted to element events. The direction of data processing in SAX is single; the parsed data cannot be read again without restarting.

DOM

Before any processing starts, the DOM parser must place the entire tree in the memory. Therefore, the memory usage of the DOM parser depends entirely on the size of the input data (relatively speaking, the memory content of the SAX parser is based only on the maximum depth of the XML file (the maximum depth of the XML tree) and the maximum data stored by the XML attribute on a single XML project ).

DOM has two implementation methods in python3.2:

1. xml. minidom is a basic implementation.
2. xml. pulldom builds the Accessed subtree only when needed.

'''Created on 2012-5-25 @ author: salomon' '''import xml. dom. minidom as minidomdom = minidom. parse ("E: \ test. xml ") root = dom. getElementsByTagName ("Schools") # The function getElementsByTagName returns NodeList. print (root. length) for node in root: print ("Root element is % s. "% Node. tagName) # format the output, which is quite different from the C-Series language. Schools = node. getElementsByTagName ("School") for school in schools: print (school. nodeName) print (school. tagName) print (school. getAttribute ("Name") print (school. attributes ["Name"]. value) classes = school. getElementsByTagName ("Class") print ("There are % d classes in school % s" % (classes. length, school. getAttribute ("Name") for mclass in classes: print (mclass. getAttribute ("Id") for student in MCIA Ss. getElementsByTagName ("Student"): print (student. attributes ["Name"]. value) print (student. getElementsByTagName ("English") [0]. nodeValue) # Why? Print (student. getElementsByTagName ("English") [0]. childNodes [0]. nodeValue) student. getElementsByTagName ("English") [0]. childNodes [0]. nodeValue = 75f = open ('new. xml ', 'W', encoding = 'utf-8') dom. writexml (f, encoding = 'utf-8') f. close ()

ElementTree

Currently, ElementTree is found to have less information and does not know its working mechanism. Some materials show that ElementTree is almost a lightweight DOM, but all Element nodes of ElementTree work in the same way. It is similar to XpathNavigator in C.

'''Created on 2012-5-25@author: salomon'''from xml.etree.ElementTree import ElementTreetree = ElementTree()tree.parse("E:\\test.xml")root = tree.getroot()print(root.tag)print(root[0].tag)print(root[0].attrib)schools = root.getchildren() for school in schools:  print(school.get("Name"))  classes = school.findall("Class")  for mclass in classes:    print(mclass.items())    print(mclass.keys())    print(mclass.attrib["Id"])    math = mclass.find("Student").find("Scores").find("Math")    print(math.text)    math.set("teacher", "bada")tree.write("new.xml")

Compare:

As for the above, the Expat and SAX parse XML methods are the same, that is, they do not know what the performance is. Compared with the above two resolvers, DOM consumes memory, and the processing of files is relatively slow due to time-consuming access. If the file is too large to be loaded into the memory, the DOM parser cannot be used, but for some types of XML verification, you need to access the entire file, or when some XML processing only requires the access to the entire file, DOM is the only choice.

Note:

It should be noted that the technologies used to access XML are not original to Python, and Python is also introduced by referring to other languages or directly from other languages. For example, Expat is a development Library developed in C language to parse XML documents. SAX was initially developed by David Megginson using the java language. DOM can access and modify the content and structure of a document in a way independent of platform and language. It can be applied to any programming language.

For comparison, I also want to list how C # accesses XML documents:

1. DOM-based XmlDocument
2. XmlReader and XmlWriter based on stream files (unlike the SAX stream file implementation, SAX is an event-driven model ).
3. Linq to Xml

Two stream file models: XmlReader/XMLWriter VS SAX

The stream model iterates one node in the XML document each time. it is suitable for processing large documents and consumes less memory. The stream model has two variants: the "push" model and the "pull" model.

The push model is also commonly referred to as SAX. SAX is an event-driven model. that is to say, every time it discovers a node, it uses the push model to trigger an event, however, we must write the processing programs for these events, which is very inflexible and troublesome.

. NET uses the implementation scheme based on the "pull" model. when traversing a document, the "pull" model will pull the part of the document that is interested from the reader without triggering events, allow us to access documents in a programmatic manner, which greatly improves flexibility. in terms of performance, the "pull" model can be used to process nodes selectively. every time a node is found in SAX, it notifies the client, the "pull" model can improve the overall efficiency of the Application.

The above is the details of the instance analysis using the Python access XML method. For more information, see other related articles in the first PHP community!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More