Module: Xmllib
Xmllib is a low-level parser that is not validated. The xmllib used by application programmers can override the Xmlparser class and provide methods for handling document elements such as specific or generic tags, or character entities. The use of Xmllib has not changed since Python 1.5x to Python 2.0+, and in most cases the better option is to use SAX technology, which is also a stream-oriented technology that is more standard for both languages and developers.
The examples in this article are the same as in the original column: a DTD called QUOTATIONS.DTD and a document Sample.xml for this DTD (see Resources for a file of the files mentioned in this article). The following code shows the first few lines of each quote in Sample.xml and produces a very simple unknown tag and an ASCII indicator of the entity. The parsed text is processed as a continuous stream, and any accumulator used is owned by the programmer (such as a string in the tag (#PCDATA), or a list or dictionary of the tags encountered.
List 1:try_xmllib.py
Import Xmllib, String Classquotationparser (Xmllib. Xmlparser): "" "Crude xmllib Extractor for QUOTATIONS.DTD document" "Def__init__ (self): xmllib .
Xmlparser.__init__ (self) self.thisquote = ' # quotation accumulator Defhandle_data
(Self, data): Self.thisquote = self.thisquote + data defsyntax_error (self, message):
Pass Defstart_quotations (self, attrs): # Top level tag Print
'---Begin Document---' defstart_quotation (self, attrs): print ' quotation: ' Defend_quotation (self): print String.Join (String.Split (self.thisquote[:230)) + ' ... ', print ' (' +str len (self.thisquote) + ' bytes) \ n ' self.thisquote = ' defunkn Own_starttag (self, Tag, attrs): Self.thisquote = selF.thisquote + ' {' Defunknown_endtag (self, tag): Self.thisquote = Self.thisquote + '} '
Defunknown_charref (self, ref): Self.thisquote = Self.thisquote + '? ' Defunknown_entityref (self, ref): Self.thisquote = self.thisquote + ' # ' if __name__ = = ' __ Main__ ': parser = Quotationparser () for C in open ("Sample.xml"). Read (): Pars
Er.feed (c) parser.close ()
Validate
The reason you may need to look forward to the future of standard XML support is that you need to validate it while parsing it. Unfortunately, the standard Python 2.0 XML package does not include a validator parser.
Xmlproc is the original Python parser that performs almost complete validation. If you need a validated parser, Xmlproc is the only current choice for Python. Also, Xmlproc provides a variety of advanced and test interfaces that other parsers do not have.
Select a parser
If you decide to use XML as a simple API (SAX)-it should be used for complex things, because most of the other tools are built on it-you'll be doing a lot of sorting out the parser for you. The Xml.sax module contains a facility that automatically selects the "Best" parser. In the standard Python 2.0 installation, the only parser that can be selected is expat, which is a fast extension of C language writing. However, you can also install another parser under $PYTHONLIB/xml/parsers for your choice. Setting up a parser is simple:
List 2:python A statement that selects the best Syntax Analyzer
Import
xml.sax
parser = Xml.sax.make_parser ()
You can also select a specific parser by passing parameters, but given the portability-and the upward compatibility for better parser in the future-the best approach is to use Make_parser () to do the work.
You can import Xml.parsers.expat directly. If you do this, you'll get some special tips that the SAX interface doesn't offer. In this way, Xml.parsers.expat is somewhat "low-level" compared to SAX. But the sax technique is very standard, and it's good for streaming, and in most cases the level of Sax is right. Generally, because the Make_parser () function has been able to obtain the performance provided by expat, the difference in pure speed is very small.
What is SAX
Given the background, the better answer to what is SAX is:
SAX (the simple XML API) is the common parser interface for XML parser. It allows an application author to write an application that uses an XML parser, but it is independent of the parser used. (Treat it as JDBC for XML.) ) (Lars Marius garshol,sax for Python)
SAX--As the API for the parser module it provides--is essentially a sequential processor of an XML document. The method used is very similar to the xmllib example, but more abstract. The application programmer will define a handler class, rather than a parser class, that can be registered in any parser used by the handler class. 4 SAX interfaces must be defined (there are several methods for each interface): Documenthandler, Dtdhandler, Entityresolver, and ErrorHandler. Create a parser unless overridden, it also connects to the default interface. The code performs the same tasks as the Xmllib example:
List 3:try_sax.py
"Simple SAX example, updated for Python 2.0+" import string import Xml.sax F Rom Xml.sax.handler import * Classquotationhandler (ContentHandler): "" "Crude extractor for QUOTATIONS.DTD compliant XML document" "Def__init__ (self): Self.in_quot E = 0 Self.thisquote = ' defstartdocument (self): print '---Begin Doc
Ument---' defstartelement (self, Name, attrs): if name = = ' Quotation ': print ' quotation: ' self.in_quote = 1 else:self.thisquote = SE
Lf.thisquote + ' {' Defendelement (self, name): if name = = ' Quotation ': Print String.Join (String.Split (self.thisquote[:230])) + ' ... ', print ' (' +st R (Len (self.thisquote)) + ' bytes) \ n ' self.thisquote = ' Self.in_quote = 0 Else:self.thisquote = Self.thi Squote + '} ' defcharacters (self, ch): if self.in_quote:self.thisquo te = self.thisquote + ch if __name__ = = ' __main__ ': parser = Xml.sax.make_parser () handler = Quotati
Onhandler () Parser.setcontenthandler (handler) parser.parse ("Sample.xml")
Compared with xmllib, two things to note in the example above: the. Parse () method handles the entire stream or string, so you do not have to create a loop for the parser; parse () also has the flexibility to receive a filename, a file object, or a number of class file objects (some with. Read () Way).
Packages: DOM
A DOM is an advanced tree representation of an XML document. The model is not just for Python, but for a generic XML model (see Resources for further information). Python's DOM packages are built on SAX and are included in the standard XML support of Python 2.0. Due to space limitations, the code example is not added to this article, but an excellent overall description is given in Xml-sig's "Python/xml HOWTO":
The Document Object model specifies a tree representation of the XML document. The top-level document instance is the root of the tree, and it has only one descendant, the top-level element instance, which has child nodes that represent content and child elements, they can have descendants, and so on. Defined functions allow arbitrary traversal of the result tree, access to element and attribute values, insertion and deletion of nodes, and conversion of trees back to XML.
The DOM can be used to modify an XML document because a DOM tree can be created, modified by adding new nodes and moving the subtree back and forth, and then generating a new XML document as output. You can also construct a DOM tree yourself and then convert it to XML, which makes XML output more flexible than just writing <tag1>...</tag1> to a file.
The syntax for using the Xml.dom module has changed a bit compared to earlier articles. The DOM implementations in Python 2.0 are called Xml.dom.minidom and provide a lightweight and small version of the DOM. Obviously, some of the experimental features in the complete Xml-sig DOM are not put into the xml.dom.minidom, but you don't notice it.
Building a DOM object is simple; just:
Listing 4: Creating a Python DOM object in an XML file
from Xml.dom.minidom
import
parse, parsestring
dom1 = Parse (' mydata.xml ')
# Parse a XML file by Name
Using DOM objects is a very straightforward work of OOP patterns. However, many of the properties of similar listings are often encountered in hierarchies that cannot be easily differentiated immediately (except for circular enumeration). For example, here is a typical snippet of DOM Python code:
Listing 5: Iteration through the Python DOM node object
For
node
in
dom_node.childnodes:
if
node.nodename = = ' #text ':
# PCDATA is a kind of node,< C14/>pcdata = Node.nodevalue
# but not a new subtag
elif
node.nodename = ' spam ':
spam_node_ List.append (node)
# Create List of <spam> nodes
The Python standard documentation has some more detailed DOM examples. The directions in my earlier article about using DOM objects (see Resources) are still correct, but some method and property names are changed since the article was published, so consult the Python documentation.
module: Pyxie
The Pyxie module is built on top of Python standard XML support, which provides an additional high-level interface for XML documents. Pyxie will do two basic things: it transforms an XML document into a line-based format that is easier to parse, and it provides a way to treat an XML document as an operable tree. The Pyxie PYX format used is not language-specific, and its tools are available in several languages. In summary, the PYX representation of a document is easier to use than a common line-based text processing tool, such as grep, sed, awk, bash, Perl, or standard Python modules, such as String and re, compared to their XML representations. Depending on the result, converting from XML to PYX can save a lot of work.
Pyxie the idea of an XML document as a tree is similar to the idea in DOM. Because the DOM standard is widely supported in many programming languages, if the tree representation of XML documents is required, most programmers use DOM standards rather than Pyxie.
more modules: Xml_pickle and Xml_objectify
I developed my own advanced modules for processing XML, called Xml_pickle and xml_objectify. I've also written a lot of similar modules elsewhere (see Resources), and there's no need to do much about it here. These modules are useful when you "think in Python" rather than "think in XML." In particular, xml_objectify itself hides almost all of the XML cues from the programmer, allowing you to fully use the Python "original" object in your program. The actual XML data format is almost invisible. Similarly, xml_pickle the Python programmer to start with the "original" Python object, whose data can be derived from any source code, and then put them (sequentially) into an XML format that other users may need later.