Learn more about the XML tools in Python

Learn more about the XML tools in Python _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags documentation object model xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Module: Xmllib

Xmllib is a low-level parser that is not validated. The xmllib used by application programmers can override the Xmlparser class and provide methods for handling document elements such as specific or generic tags, or character entities. The use of Xmllib has not changed since Python 1.5x to Python 2.0+, and in most cases the better option is to use SAX technology, which is also a stream-oriented technology that is more standard for both languages and developers.

The examples in this article are the same as in the original column: a DTD called QUOTATIONS.DTD and a document Sample.xml for this DTD (see Resources for a file of the files mentioned in this article). The following code shows the first few lines of each quote in Sample.xml and produces a very simple unknown tag and an ASCII indicator of the entity. The parsed text is processed as a continuous stream, and any accumulator used is owned by the programmer (such as a string in the tag (#PCDATA), or a list or dictionary of the tags encountered.
List 1:try_xmllib.py

Import Xmllib, String Classquotationparser (Xmllib. Xmlparser): "" "Crude xmllib Extractor for QUOTATIONS.DTD document" "Def__init__ (self): xmllib .

        Xmlparser.__init__ (self) self.thisquote = ' # quotation accumulator Defhandle_data
    
    
    (Self, data): Self.thisquote = self.thisquote + data defsyntax_error (self, message):

         Pass Defstart_quotations (self, attrs): # Top level tag Print
  
    
    '---Begin Document---' defstart_quotation (self, attrs): print ' quotation: ' Defend_quotation (self): print String.Join (String.Split (self.thisquote[:230)) + ' ... ', print ' (' +str len (self.thisquote) + ' bytes) \ n ' self.thisquote = ' defunkn Own_starttag (self, Tag, attrs): Self.thisquote = selF.thisquote + ' {' Defunknown_endtag (self, tag): Self.thisquote = Self.thisquote + '} '
  
    
    Defunknown_charref (self, ref): Self.thisquote = Self.thisquote + '? ' Defunknown_entityref (self, ref): Self.thisquote = self.thisquote + ' # ' if __name__ = = ' __ Main__ ': parser = Quotationparser () for C in open ("Sample.xml"). Read (): Pars

 Er.feed (c) parser.close ()

Validate

The reason you may need to look forward to the future of standard XML support is that you need to validate it while parsing it. Unfortunately, the standard Python 2.0 XML package does not include a validator parser.

Xmlproc is the original Python parser that performs almost complete validation. If you need a validated parser, Xmlproc is the only current choice for Python. Also, Xmlproc provides a variety of advanced and test interfaces that other parsers do not have.

Select a parser

If you decide to use XML as a simple API (SAX)-it should be used for complex things, because most of the other tools are built on it-you'll be doing a lot of sorting out the parser for you. The Xml.sax module contains a facility that automatically selects the "Best" parser. In the standard Python 2.0 installation, the only parser that can be selected is expat, which is a fast extension of C language writing. However, you can also install another parser under $PYTHONLIB/xml/parsers for your choice. Setting up a parser is simple:
List 2:python A statement that selects the best Syntax Analyzer

Import
         xml.sax
parser = Xml.sax.make_parser ()

You can also select a specific parser by passing parameters, but given the portability-and the upward compatibility for better parser in the future-the best approach is to use Make_parser () to do the work.

You can import Xml.parsers.expat directly. If you do this, you'll get some special tips that the SAX interface doesn't offer. In this way, Xml.parsers.expat is somewhat "low-level" compared to SAX. But the sax technique is very standard, and it's good for streaming, and in most cases the level of Sax is right. Generally, because the Make_parser () function has been able to obtain the performance provided by expat, the difference in pure speed is very small.

What is SAX

Given the background, the better answer to what is SAX is:

SAX (the simple XML API) is the common parser interface for XML parser. It allows an application author to write an application that uses an XML parser, but it is independent of the parser used. (Treat it as JDBC for XML.) ) (Lars Marius garshol,sax for Python)

SAX--As the API for the parser module it provides--is essentially a sequential processor of an XML document. The method used is very similar to the xmllib example, but more abstract. The application programmer will define a handler class, rather than a parser class, that can be registered in any parser used by the handler class. 4 SAX interfaces must be defined (there are several methods for each interface): Documenthandler, Dtdhandler, Entityresolver, and ErrorHandler. Create a parser unless overridden, it also connects to the default interface. The code performs the same tasks as the Xmllib example:
List 3:try_sax.py

"Simple SAX example, updated for Python 2.0+" import string import Xml.sax F Rom Xml.sax.handler import * Classquotationhandler (ContentHandler): "" "Crude extractor for QUOTATIONS.DTD compliant XML document" "Def__init__ (self): Self.in_quot E = 0 Self.thisquote = ' defstartdocument (self): print '---Begin Doc
      
    
    Ument---' defstartelement (self, Name, attrs): if name = = ' Quotation ': print ' quotation: ' self.in_quote = 1 else:self.thisquote = SE
      
    
    Lf.thisquote + ' {' Defendelement (self, name): if name = = ' Quotation ': Print String.Join (String.Split (self.thisquote[:230])) + ' ... ', print ' (' +st R (Len (self.thisquote)) + ' bytes) \ n ' self.thisquote = ' Self.in_quote = 0 Else:self.thisquote = Self.thi Squote + '} ' defcharacters (self, ch): if self.in_quote:self.thisquo te = self.thisquote + ch if __name__ = = ' __main__ ': parser = Xml.sax.make_parser () handler = Quotati

 Onhandler () Parser.setcontenthandler (handler) parser.parse ("Sample.xml")

Compared with xmllib, two things to note in the example above: the. Parse () method handles the entire stream or string, so you do not have to create a loop for the parser; parse () also has the flexibility to receive a filename, a file object, or a number of class file objects (some with. Read () Way).

Packages: DOM

A DOM is an advanced tree representation of an XML document. The model is not just for Python, but for a generic XML model (see Resources for further information). Python's DOM packages are built on SAX and are included in the standard XML support of Python 2.0. Due to space limitations, the code example is not added to this article, but an excellent overall description is given in Xml-sig's "Python/xml HOWTO":

The Document Object model specifies a tree representation of the XML document. The top-level document instance is the root of the tree, and it has only one descendant, the top-level element instance, which has child nodes that represent content and child elements, they can have descendants, and so on. Defined functions allow arbitrary traversal of the result tree, access to element and attribute values, insertion and deletion of nodes, and conversion of trees back to XML.

The DOM can be used to modify an XML document because a DOM tree can be created, modified by adding new nodes and moving the subtree back and forth, and then generating a new XML document as output. You can also construct a DOM tree yourself and then convert it to XML, which makes XML output more flexible than just writing <tag1>...</tag1> to a file.

The syntax for using the Xml.dom module has changed a bit compared to earlier articles. The DOM implementations in Python 2.0 are called Xml.dom.minidom and provide a lightweight and small version of the DOM. Obviously, some of the experimental features in the complete Xml-sig DOM are not put into the xml.dom.minidom, but you don't notice it.

Building a DOM object is simple; just:
Listing 4: Creating a Python DOM object in an XML file


         from Xml.dom.minidom 
    
    import

         parse, parsestring
dom1 = Parse (' mydata.xml ') 
    
    # Parse a XML file by Name

Using DOM objects is a very straightforward work of OOP patterns. However, many of the properties of similar listings are often encountered in hierarchies that cannot be easily differentiated immediately (except for circular enumeration). For example, here is a typical snippet of DOM Python code:
Listing 5: Iteration through the Python DOM node object

For
         node 
    
    in
         dom_node.childnodes:
  
    
    if

         node.nodename = = ' #text ':   
    
    # PCDATA is a kind of node,< C14/>pcdata = Node.nodevalue    
    
    # but not a new subtag
     
     
     elif

         node.nodename = ' spam ':
    spam_node_ List.append (node) 
    
    # Create List of <spam> nodes

The Python standard documentation has some more detailed DOM examples. The directions in my earlier article about using DOM objects (see Resources) are still correct, but some method and property names are changed since the article was published, so consult the Python documentation.

module: Pyxie

The Pyxie module is built on top of Python standard XML support, which provides an additional high-level interface for XML documents. Pyxie will do two basic things: it transforms an XML document into a line-based format that is easier to parse, and it provides a way to treat an XML document as an operable tree. The Pyxie PYX format used is not language-specific, and its tools are available in several languages. In summary, the PYX representation of a document is easier to use than a common line-based text processing tool, such as grep, sed, awk, bash, Perl, or standard Python modules, such as String and re, compared to their XML representations. Depending on the result, converting from XML to PYX can save a lot of work.

Pyxie the idea of an XML document as a tree is similar to the idea in DOM. Because the DOM standard is widely supported in many programming languages, if the tree representation of XML documents is required, most programmers use DOM standards rather than Pyxie.

more modules: Xml_pickle and Xml_objectify

I developed my own advanced modules for processing XML, called Xml_pickle and xml_objectify. I've also written a lot of similar modules elsewhere (see Resources), and there's no need to do much about it here. These modules are useful when you "think in Python" rather than "think in XML." In particular, xml_objectify itself hides almost all of the XML cues from the programmer, allowing you to fully use the Python "original" object in your program. The actual XML data format is almost invisible. Similarly, xml_pickle the Python programmer to start with the "original" Python object, whose data can be derived from any source code, and then put them (sequentially) into an XML format that other users may need later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More