One of the main elements of starting to use XML in Python is the ability to arrange the comparability of all available modules. In the first part of his new Python column "Lovely Python", David Mertz briefly describes the most popular and useful Python modules on XML, and points out the individual modules that can be downloaded and the resources available for reading. This article helps you determine which modules are most appropriate for a particular task.
In many cases, Python is the ideal language for using XML documents. Like Perl, REBOL, REXX, and TCL, it is a flexible scripting language with powerful text manipulation capabilities. Moreover, XML documents typically encode a large number of complex data structures in addition to encoding most types of text files (or streaming files). The "read several rows and compare them to some rule expressions" style common in text processing is often not well suited for thorough parsing and processing of XML. Fortunately, Python (compared to most other languages) not only has a way of dealing directly with complex data structures (typically using classes and attributes), but there are many XML-related modules that can help parse, process, and generate XML.
With respect to XML, remember a general concept: You can validate or not validate XML documents. In the previous processing type, you must read the document type definition (DTD) before reading the XML document. In this case, the process will compute the simple sentence pattern of the XML document as a whole, and the specific syntax constraints of the DTD will be computed. In most cases, you can use non-validation processing (usually faster and more appropriate for programs)--we believe that the document creator follows the rules of the document scope. Most of the modules discussed below are not verifiable, and if there are validation options, the description will indicate.
The central repository (vaults of Parnassus) (see Resources) has recently become a standard way to find Python resources. You can find all of the following modules that are discussed at that site (by linking to the site of the respective module owner). In particular, you can find the PyXML release in the Repository, which is the tar file and the Win32 form installer.
Python's XML Special Interest Group (XML-SIG)
Xml-sig's members perform many-or most-of the tasks of maintaining a portion of Python XML tools. Like other Python SIG, Xml-sig maintains mailing lists, list files, useful reference power, documents, standard packages, and other resources. After reading the overview in this article, it's a good idea to start with the Xml-sig Web page.
Xml-sig maintains the PyXML release, based on the specific focus described in this article. This package contains many of the modules discussed in this article, some "Getting Started" documentation, some demo code and other Xml-sig decisions to put into the release. A given package may not always contain the latest version of each stand-alone module or tool, but it is a good idea to download the PyXML release. Later, you can add any modules that are not included at any time, or include a new version of the module (and modules that are not included in many of the services provided by the PyXML distribution).
Module: Xmllib module (Standard)
"Not included in the standard release", Python 1.5.* with modules [Xmllib]. Python 1.6 may be a combination of more xml-sig achievements, but it's still beta. [Xmllib] is a low-level parser that is not validated. [Xmllib] works by overwriting the Xmlparser class with an application and providing a way to work with document elements, such as specific or generic tags, or character entities.
As an example of the [Xmllib] in use, the PyXML release includes a DTD called ' QUOTATIONS.DTD ' and the document ' Sample.xml ' of the DTD (see Resources for an archive of the files mentioned in this article). The following code shows the first few lines of each quote in ' sample.xml ' and generates a very simple unknown tag and an ASCII indicator for the entity. The parsed text is processed as a continuous stream, and any accumulator used is owned by the programmer (such as a string (#PCDATA) in the tag, or a list/dictionary of the tags encountered.
Try the Xmllib code
#--------------------try_xmllib.py--------------------# import Xmllib, String class Quotationparser (Xmllib.
Xmlparser): "" "Crude xmllib Extractor for QUOTATIONS.DTD document" "Def __init__ (self): Xmllib. Xmlparser.__init__ (self) self.thisquote = ' # quotation accumulator def handle_data (s
Elf, data): Self.thisquote = self.thisquote + data def syntax_error (self, message): Pass def start_quotations (self, attrs): # Top level tag print '---Begin Document---' Def start_quot ation (self, attrs): print ' quotation: ' Def end_quotation (self): print string
. Join (String.Split (self.thisquote[:230])) + ' ... ', print ' (' +str (len (self.thisquote) + ' bytes ') \ n ' Self.thisquote = ' def unknown_starttag (self, Tag, attrs): Self.thisquote = self.th
Isquote + ' {' def unknown_endtag (self, tag): Self.thisquote = Self.thisquote + '} ' def Unknown_charr
EF (self, ref): Self.thisquote = Self.thisquote + '? '
def unknown_entityref (self, ref): Self.thisquote = self.thisquote + ' # ' if __name__ = = ' __main__ ': Parser = Quotationparser () for C in open ("Sample.xml"). Read (): Parser.feed (c) par Ser.close ()