This article mainly introduces the more in-depth XML tool in Python. This article is from the IBM official Developer Technical Documentation. if you need it, please refer to it.
Module: xmllib
Xmllib is a non-verified low-level syntax analyzer. The xmllib used by application programmers can overwrite the XMLParser class and provide methods to process document elements (such as specific or class tags, or character entities. The usage of xmllib has not changed since Python 1.5x to Python 2.0 +. In most cases, the better choice is to use the SAX technology, which is also a stream-oriented technology, it is more standard for languages and developers.
The example in this article is the same as that in the original column: includes a file named quotations. dtd and the document sample of this DTD. xml (see references to obtain the file files mentioned in this article ). The following code displays the first few lines of each section in sample. xml, and generates simple ASCII indicators for unknown tags and entities. The analyzed text is processed as a continuous stream, and any accumulators used are the responsibility of the programmer (such as the string in the tag (# PCDATA), or the list or Dictionary of tags encountered ).
Listing 1: try_xmllib.py
import xmllib, string classQuotationParser (xmllib.XMLParser): """Crude xmllib extractor for quotations.dtd document""" def__init__ (self): xmllib.XMLParser.__init__(self) self.thisquote = '' # quotation accumulator defhandle_data (self, data): self.thisquote = self.thisquote + data defsyntax_error (self, message): pass defstart_quotations (self, attrs): # top level tag print '--- Begin Document ---' defstart_quotation (self, attrs): print 'QUOTATION:' defend_quotation (self): print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' defunknown_starttag (self, tag, attrs): self.thisquote = self.thisquote + '{' defunknown_endtag (self, tag): self.thisquote = self.thisquote + '}' defunknown_charref (self, ref): self.thisquote = self.thisquote + '?' defunknown_entityref (self, ref): self.thisquote = self.thisquote + '#' if __name__ == '__main__': parser = QuotationParser() for c in open("sample.xml").read(): parser.feed(c) parser.close()
Verify
The reason you may need to look forward to the future of standard XML support is that syntax analysis requires verification. Unfortunately, the standard Python 2.0 XML package does not include a validation syntax analyzer.
Xmlproc is a python original syntax analyzer that performs almost complete verification. Xmlproc is currently the only option for Python to verify syntax analyzer. In addition, xmlproc provides a variety of advanced and test interfaces not available in other syntax analyzers.
Select a syntax analyzer
If you decide to use the simple xml api (SAX)-it should be used for complex things, because most of the other tools are built on it-it will help you classify many syntax analyzers. The xml. sax module contains a facility that automatically selects the "best" syntax analyzer. In standard Python 2.0 installation, the only available syntax analyzer is expat, which is a fast extension written in C language. However, you can also install another syntax analyzer under $ PYTHONLIB/xml/parsers for selection. Setting the syntax analyzer is simple:
Listing 2: Python statements for selecting the best syntax analyzer
import xml.saxparser = xml.sax.make_parser()
You can also select a specific syntax analyzer by passing parameters, but considering portability-and to improve the upward compatibility of the syntax analyzer in the future-the best way is to use make_parser () to complete the work.
You can directly import xml. parsers. expat. If you do this, you will be able to get some special tips that the SAX interface does not provide. In this way, xml. parsers. expat is somewhat "low-level" compared with SAX ". However, the SAX technology is very standard and stream-oriented. In most cases, the level of SAX is suitable. Generally, because the make_parser () function can obtain the performance provided by expat, the pure speed difference is very small.
What is SAX?
Considering the background, the better answer to what is SAX is:
SAX (simple API of XML) is a common syntax analyzer interface of XML syntax analyzer. It allows application authors to write applications that use XML syntax analyzer, but it is independent of the used syntax analyzer. (Think of it as xml jdbc .) (Lars Marius Garshol, SAX for Python)
SAX-like the API of the syntax analyzer module provided by it-is basically a sequence processor of XML documents. The method used is very similar to the xmllib example, but more abstract. The application programmer will define a handler class instead of a syntax analyzer class, which can be registered to any used syntax analyzer. Four SAX interfaces must be defined (each interface has several methods): DocumentHandler, DTDHandler, EntityResolver, and ErrorHandler. Create a syntax analyzer and connect to the default interface unless it is overwritten. The code executes the same tasks as the xmllib example:
Listing 3: try_sax.py
"Simple SAX example, updated for Python 2.0+" import string import xml.sax from xml.sax.handler import * classQuotationHandler (ContentHandler): """Crude extractor for quotations.dtd compliant XML document""" def__init__ (self): self.in_quote = 0 self.thisquote = '' defstartDocument (self): print '--- Begin Document ---' defstartElement (self, name, attrs): if name == 'quotation': print 'QUOTATION:' self.in_quote = 1 else: self.thisquote = self.thisquote + '{' defendElement (self, name): if name == 'quotation': print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' self.in_quote = 0 else: self.thisquote = self.thisquote + '}' defcharacters (self, ch): if self.in_quote: self.thisquote = self.thisquote + ch if __name__ == '__main__': parser = xml.sax.make_parser() handler = QuotationHandler() parser.setContentHandler(handler) parser.parse("sample.xml")
Compared with xmllib, the above example requires two things :. the parse () method processes the entire stream or string, so you do not have to create a loop for the syntax analyzer ;. parse () can also flexibly receive a file name, a file object, or many class file objects (some have. read () method ).
Package: DOM
DOM is an advanced tree representation of XML documents. This model is not just for Python, but for a common XML model (see references for further information ). The Python DOM package is built based on SAX and is included in the standard XML support of Python 2.0. Due to space limitations, the sample code is not added to this article, but an excellent general description is given in the XML-SIG's "Python/xml howto:
The Document Object Model specifies a tree representation for the XML document. A top-level document instance is the root of a tree. it has only one child, that is, a top-level element instance. this element has child nodes that indicate content and child elements. They can also have child nodes, and so on. The defined function allows you to traverse the result tree, access elements and attribute values, insert and delete nodes, and convert the tree back to XML.
DOM can be used to modify XML documents, because you can create a DOM tree, modify the tree by adding new nodes and moving the subtree back and forth, and then generate a new XML document as the output. You can also construct a DOM tree by yourself and convert it to XML. in this way, the XML output ratio is only ... The file writing method is more flexible.
The syntax for using the xml. dom module has changed compared with earlier articles. The built-in DOM implementation in Python 2.0 is called xml. dom. minidom and provides lightweight and small-sized DOM versions. Apparently, some of the experimental features in the DOM of the complete XML-SIG were not put into xml. dom. minidom, but this was not noticed.
It is easy to generate a DOM object. you only need:
Listing 4: Create a Python DOM object in an XML file
from xml.dom.minidom import parse, parseStringdom1 = parse('mydata.xml') # parse an XML file by name
Using DOM objects is a very direct OOP mode. However, many similar list attributes are often encountered in layers that cannot be easily distinguished immediately (except for loop listing. For example, the following is a piece of common DOM Python code:
Listing 5: Using Python DOM node object iteration
for node in dom_node.childNodes: if node.nodeName == '#text': # PCDATA is a kind of node, PCDATA = node.nodeValue # but not a new subtag elif node.nodeName == 'spam': spam_node_list.append(node) # Create list of
nodes
The Python Standard Documentation provides some more detailed DOM examples. In my earlier articles, examples of using DOM objects (see references) indicate that the direction is still correct, but since the publication of this article, some methods and attribute names have been changed, for more information, see The Python documentation.
Module: pyxie
The pyxie module is built on the support of Python standard XML and provides additional advanced interfaces for XML documents. Pyxie performs two basic operations: it converts an XML document into a row-based format that is easier to perform syntax analysis, and provides a method to process an XML document as an operational tree. The row-based PYX format used by pyxie is language-free and its tool is applicable to several languages. In short, the PYX representation of the document is easier to process than the XML representation using common line-based text processing tools, such as grep, sed, awk, bash, perl, or standard python modules, such as string and re. According to the results, converting from XML to PYX may save a lot of work.
Pyxie treats XML documents as a tree and has similar ideas in DOM. Because the DOM Standard is widely supported by many programming languages, if tree representation of XML documents is required, most programmers will use the DOM Standard instead of pyxie.
More modules: xml_pickle and xml_objectid
I developed an advanced module for processing XML, called xml_pickle and xml_objectid. I have written many similar modules in other places (see references), so I don't have to introduce them too much here. These modules are useful when you "think in Python" instead of "think in XML. In particular, xml_objectid hides almost all XML clues from programmers, allowing you to fully use Python "primitive" objects in your programs. The actual XML data format is almost abstracted and invisible. Similarly, xml_pickle enables the Python programmer to start with an "original" Python object. The data of this object can come from any source code, and then put them (continuously) into the XML format that other users may need in the future.