Learn more about the XML tools in Python

Last Update:2016-06-06 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

module: Xmllib

Xmllib is a non-validating low-level parser. The xmllib used by the application programmer can override the Xmlparser class and provide methods for working with document elements such as specific or generic tags, or character entities. The use of Xmllib has not changed since Python 1.5x to Python 2.0+, and in most cases a better choice is to use SAX technology, which is also a stream-oriented technology that is more standard for languages and developers.

The examples in this article are the same as in the original column: include a DTD called QUOTATIONS.DTD and the documentation for this DTD sample.xml (see Resources for the files that are mentioned in this article). The following code shows the first few lines of each of the quotes in Sample.xml and generates a very simple ASCII indicator of unknown tags and entities. The parsed text is processed as a continuous stream, and any accumulator used is the responsibility of the programmer (such as the string in the tag (#PCDATA), or the list or dictionary of tokens encountered).
List 1:try_xmllib.py

Import Xmllib, String Classquotationparser (Xmllib. Xmlparser): "" "Crude xmllib Extractor for QUOTATIONS.DTD document" "" Def__init__ (self): Xmllib.        Xmlparser.__init__ (self) self.thisquote = "# quotation accumulator Defhandle_data  (Self, data): Self.thisquote = self.thisquote + data defsyntax_error (self, message): Pass           Defstart_quotations (self, attrs): # Top level tag print '---Begin Document---'        Defstart_quotation (self, attrs): print ' quotation: ' Defend_quotation (self): print String.Join (String.Split (self.thisquote[:230)) + ' ... ', print ' (' +str (l En (self.thisquote)) + ' bytes ' \ n ' self.thisquote = ' defunknown_starttag (self, Tag, attrs): Self.thi   Squote = Self.thisquote + ' {' Defunknown_endtag (self, Tag): Self.thisquote = Self.thisquote + '} ' defunknown_charref (self, ref): Self.thisquote = Self.thisquote          + '?' Defunknown_entityref (self, ref): Self.thisquote = self.thisquote + ' # ' if __name__ = = ' __main__ ' : Parser = Quotationparser () for C in open ("Sample.xml"). Read (): Parser.feed (c) Pars Er.close ()

Verify

You may need to look ahead to the future of standard XML support because it requires validation while parsing. Unfortunately, the standard Python 2.0 XML package does not include a validated parser.

Xmlproc is the original syntax parser for Python, which performs almost complete validation. If you need a validated parser, Xmlproc is the only choice for Python at this moment. Furthermore, XMLPROC provides a variety of advanced and test interfaces that are not available with other parser.

Select one of the parser

If you decide to use XML's simple API (SAX)--it should be used for complex things, because most of the other tools are built on top of it--you will do a lot of parsing for your parser. The Xml.sax module contains a facility that automatically selects the "Best" parser. In the standard Python 2.0 installation, the only parser that can be selected is expat, which is a fast extension of C language writing. However, you can also install another parser under $PYTHONLIB/xml/parsers for your selection. Setting up the parser is simple:
Listing 2:python Choosing the best parser statement

Import         Xml.saxparser = Xml.sax.make_parser ()

You can also choose a particular parser by passing parameters, but for portability-and for the upward compatibility of the better parser in the future-the best way to do this is to use Make_parser ().

You can import Xml.parsers.expat directly. If you do this, you will get some special tricks that the SAX interface does not provide. In this way, Xml.parsers.expat is somewhat "low" compared to SAX. But sax technology is very standard, and the flow-oriented processing is very good; in most cases, Sax is the right level. Typically, because the Make_parser () function has been able to obtain the performance provided by expat, the difference in pure speed is minimal.

What is SAX

Given the background factors, the better answer to what is SAX is:

SAX (The simple API for XML) is the common parser interface for XML parsers. It allows application authors to write applications that use the XML parser, but it is independent of the parser used. (Consider it as the JDBC of XML.) ) (Lars Marius garshol,sax for Python)

SAX-like the API of the parser module it provides-is basically a sequential processor of an XML document. The method used is very similar to the xmllib example, but more abstract. The application programmer will define a handler class, not a parser class, which can be registered in any parser used by the handler class. You must define 4 SAX interfaces (each of which has several methods): Documenthandler, Dtdhandler, Entityresolver, and ErrorHandler. Creating a parser is also connected to the default interface unless it is overwritten. The code performs the same tasks as the Xmllib example:
List 3:try_sax.py

"Simple SAX example, updated-Python 2.0+" Import string import Xml.sax from Xml.sax.handler Import * Classquotationhandler (ContentHandler): "" "Crude extractor fo R QUOTATIONS.DTD compliant XML document "" "Def__init__ (self): self.in_quote = 0 Self.thisquote =          "Defstartdocument (self): print '---Begin Document---' defstartelement (Self, Name, attrs): if name = = ' Quotation ': print ' quotation: ' Self. In_quote = 1 Else:self.thisquote = self.thisquote + ' {' Defendelement (self, NA ME): if name = = ' Quotation ': Print String.Join (String.Split (self.thisquote[:230])) + ' ... ', print ' (' +str (len (self.thisquote)) + ' bytes) \ n ' self.thisquote = ' Self.in_quote =      0 Else:        Self.thisquote = Self.thisquote + '} ' defcharacters (self, ch): if Self.in_qu  Ote:self.thisquote = self.thisquote + ch if __name__ = = ' __main__ ': parser = Xml.sax.make_parser () Handler = Quotationhandler () parser.setcontenthandler (handler) parser.parse ("Sample.xml")

There are two things to note in the above example than Xmllib: the. Parse () method handles the entire stream or string, so you do not have to create loops for the parser;. Parse () also has the flexibility to receive a file name, a Document object, or many class file objects (some with. Read () mode).

Package: DOM

The DOM is a high-level tree representation of an XML document. The model is not just for Python, but a generic XML model (see Resources for further information). The Python DOM package is built on SAX and is included in the standard XML support for Python 2.0. The code sample was not added to this article due to space limitations, but an excellent overall description was given in Xml-sig's "Python/xml HOWTO":

The Document Object model specifies a tree representation of the XML document. The top-level document instance is the root of the tree, which has only one descendant, the top-level element instance, which has child nodes that represent content and child elements, and they can have descendants, and so on. Defined functions allow arbitrary traversal of the result tree, access to element and attribute values, insertion and deletion of nodes, and conversion of the tree back to XML.

The DOM can be used to modify an XML document because you can create a DOM tree, modify the tree by adding new nodes and moving the subtree back and forth, and then generate a new XML document as output. You can also construct a DOM tree yourself, and then convert it to XML; it's more flexible to generate XML output in this way than just ... write files.

The syntax for using the Xml.dom module has some changes compared to earlier articles. The DOM implementation that comes with Python 2.0 is called Xml.dom.minidom and provides a lightweight and small version of the DOM. Obviously, some of the experimental features of the complete Xml-sig DOM have not been put into the xml.dom.minidom, but you will not notice this.

Building a DOM object is simple, just:
Listing 4: Creating a Python DOM object in an XML file

From         xml.dom.minidom         import         Parse, parseStringdom1 = Parse (' mydata.xml ')         # Parse a XML file by name

Using DOM objects is a very straightforward way of working with OOP patterns. However, many of the properties that are similar to the manifest are often encountered in layers that cannot be easily differentiated at once (except for circular enumeration). For example, here is an ordinary piece of DOM Python code:
Listing 5: Iterating through the Python DOM node object

         for node         in Dom_node.childnodes:          if         node.nodename = = ' #text ':           # PCDATA is a kind of node,    PCDATA = node.nodevalue            # but not a new subtag               elif         node.nodename = = ' spam ':    spam_node_ List.append (node)         # Create List of 
 
  
   
   nodes

There are some more detailed DOM examples in the Python standard description document. An example of using DOM objects in my early articles (see Resources) is still correct, but some methods and property names have changed since the article was published, so check out the Python documentation.

Module: Pyxie

The Pyxie module, built on top of the Python standard XML support, provides an additional high-level interface to the XML document. Pyxie will do two basic things: it transforms an XML document into a row-based format that is easier to parse, and it provides a way to treat an XML document as an operational tree. The row-based PYX format used by Pyxie is not language-restricted and its tools are available in several languages. In summary, the document's PYX representation is easier to handle with common line-based text processing tools, such as grep, sed, awk, bash, Perl, or standard Python modules, such as String and re, compared to their XML representations. Depending on the results, converting from XML to PYX can save a lot of work.

Pyxie the idea of treating XML documents as trees is similar to the idea in the DOM. Because DOM standards are widely supported in many programming languages, if the tree representation of an XML document is required, most programmers use DOM standards rather than Pyxie.

More modules: Xml_pickle and Xml_objectify

I developed my own advanced module for working with XML, called Xml_pickle and xml_objectify. I've also written many similar modules elsewhere (see Resources), so there's no need to introduce too much. These modules are useful when you "think in Python" rather than "think with XML." In particular, xml_objectify itself hides almost all of the XML threads from the programmer, allowing you to fully use the Python "raw" object in your program. The actual XML data format is almost abstract and invisible. Similarly, Xml_pickle enables a Python programmer to start with an "original" Python object whose data can originate from any source code and then put them (consecutively) into an XML format that other users might need later.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More