Document Object Model
The Xml.dom module is probably the most powerful tool for Python programmers when working with XML documents. Unfortunately, the documentation provided by Xml-sig is still relatively small. The language-independent DOM specification fills in some gaps in this area. But Python programmers had better have a quick-start guide to the Python-language-specific DOM. This article is intended to provide such a guide. In the previous column, sample QUOTATIONS.DTD files were used in some samples, and these files can be used with the code sample archive file in this article.
It is necessary to understand the exact meaning of the DOM. In this regard, the formal explanation is very good:
The Document Object model is a platform-independent and language-independent interface that allows programs and scripts to dynamically access and update the content, structure, and style of a document. Documents can be processed further, and the results of processing can be merged into the displayed pages. (World Wide Web Consortium DOM Working Group)
The DOM transforms an XML document into a tree-or forest-representation. The World Wide Web Consortium specification gives an example of the DOM version of an HTML table.
As shown, the DOM defines a set of methods that can traverse, prune, reorganize, output, and manipulate trees from a more abstract perspective, which is more convenient than the linear representation of an XML document.
Convert HTML to XML
Valid HTML is almost always valid XML, but not exactly the same. There are two main differences, XML tags are case-sensitive, and all XML tags require an explicit closing symbol (as an end tag, which is optional for some HTML tags; for example:). A simple example of using Xml.dom is to use the Htmlbuilder () class to convert HTML into XML.
try_dom1.py
"" "Convert a valid HTML document to XML Usage:python try_dom1.py < infile.html > outfile.xml" "" import SYS from xml.dom import core from xml.dom.html_builder import Htmlbuilder # Construct An Htmlbuilder object and feeds the data to ITB = Htmlbuilder () b.feed (Sys.stdin.read ()) # Get the NE wly-constructed document Objectdoc = b.document # Output it as XML print doc.toxml ()
The Htmlbuilder () class is easy to implement the functionality of some of the basic Xml.dom.builder templates it inherits, and its source code is worth studying. However, even if we implement the template function ourselves, the outline of the DOM program is similar. In general, we will build a DOM instance with some methods and then operate on that instance. The. ToXml () method of a DOM instance is a simple way to generate a string representation of a DOM instance (in the above case, just print it out after the build).
Convert a Python object to XML
Python programmers can achieve quite a lot of functionality and versatility by exporting any Python object as an XML instance. This allows us to handle Python objects in a customary way, and we can choose whether to end up using instance properties as tokens in the generated XML. With just a few lines (derived from the building.py example), we can convert a Python "native" object into a DOM object and recursively handle those attributes that contain the object.
try_dom2.py
"" "Build a DOM instance from scratch, write it to XML Usage:python try_dom2.py > Outfile.xml" "" "Import Types from Xml.dom import core from Xml.dom.builder Import Builder # Recursive function to build DOM instance from Python instance Defobject_convert (builder, I NST): # Put Entire object inside an elem w/same name as the class. Builder.startelement (inst.__class__.__name__) for attr in Inst.__dict__.keys (): If attr[0] = = ' _ ': # Skip internal attributes continue value = GetAttr (Inst, attr) if type (value) = = Types. Instancetype: # Recursively process subobjects Object_convert (builder, value) Else: # Convert anything else to string, put it in an element builder.startelement (attr) builder.text (str ( VaLue)) builder.endelement (attr) builder.endelement (inst.__class__.__name__) if __name__ = = ' __ main__ ': # Create Container classes Classquotations:pass classquotation: Pass # Create An instance, fill it with hierarchy of attributes Inst = Quotations () Inst.title = " Quotations file (not quotations.dtd conformant) "inst.quot1 = QUOT1 = Quotation () Quot1.text =" "'" is not a qui Ne "is not a quine ' is a quine" "" Quot1.source = "Joshua Shagam, kuro5hin.org" inst.quot2 = Quot2 = Quotation () Quot2.text = "Python is not a democracy. Voting doesn ' t help. "+\" crying may ... "Quot2.source =" Guido van Rossum, Comp.lang.python "# Create the DOM Builder builder = Builder () Object_convert (builder, inst) print builder.document.toxml ()
There are some limitations to function object_convert (). For example, it is not possible to use the above procedure to generate quotations.dtd that conforms to an XML document: #PCDATA text cannot be placed directly into the quotation class, but only in the properties of the class (such as. text). A simple workaround is to have Object_convert () handle a property with a name in a special way, for example. PCDATA. There are various ways to make the conversion of the DOM more ingenious, but the beauty of this approach is that we can start with the entire Python object and translate them into XML documents in a concise way.
It should also be noted that in the generated XML document, elements at the same level do not have a significant sequential relationship. For example, using a specific version of Python in the author's system, the second quotation defined in the source code appears first in the output. But this sequence of relationships can change between different versions and systems. The properties of a Python object are not arranged in a fixed order, so this feature makes sense. For data related to database systems, we want them to have this feature, but for articles tagged with XML, it is not desirable to have this feature (unless we want to update William Burroughs's "cut-up" method).
Convert an XML document into a Python object
Generating a Python object from an XML document is as simple as its reverse process. In most cases, it is possible to use the Xml.dom method. In some cases, however, it is preferable to use the same techniques as for all "generic" Python objects to process objects generated from an XML document. For example, in the following code, the function Pyobj_printer () might be a function already used to handle any Python object.
try_dom3.py
"" "Read in a DOM instance, convert it to a Python object" "from Xml.dom.utils import FileR Eader Classpyobject:passdefpyobj_printer (Py_obj, level=0): "" "Return a" deep "str ing description of a Python object "" "from string import join, split Import Types descript = "for Membname in Dir (py_obj ): member = GetAttr (py_obj,membname) if type (member) = = types. Instancetype:descript = descript + (' *level) + ' {' +membname+ '}\n ' descript = descript + pyobj_printer (member, level+3) elif type (member) = = types. Listtype:descript = descript + (' *level) + ' [' +membname+ ']\n ' for I in range (len (member)): Descript = descript+ ("*level) +STR (i+1) + ': ' + \ Pyobj_printer (member[i],level+3) Else:descript = descript + membname+ ' = ' descript = descript + join (Split (str (member) [:]) + ' ... \ n ' return Descript defpyobj_from_dom (Dom_node): "" "Converts a DOM tree to a" native "Python obje CT "" "Py_obj = Pyobject () py_obj. PCDATA = ' for node in Dom_node.get_childnodes (): If Node.name = = ' #text ': py_obj. PCDATA = Py_obj. PCDATA + node.value elif hasattr (Py_obj, Node.name): GetAttr (Py_obj, Node.name). Append (pyobj_from_ Dom (node)) else:setattr (Py_obj, Node.name, [Pyobj_from_dom (node)]) return py_obj # Main testdom_obj = FileReader ("Quotes.xml"). Documentpy_obj = Pyobj_from_dom (dom_obj) If _ _name__ = = "__main__": Print Pyobj_printer (Py_obj)
The focus here should be on the function pyobj_from_dom (), especially the Xml.dom method that actually works. Get_childnodes (). In Pyobj_from_dom (), we directly extract all the text between the tags and put it in the reserved property. The PCDATA. For any nested tags encountered, we create a new property whose name matches the tag and assign a list to the property so that it can potentially contain tokens that appear more than once in the parent block. Of course, use lists to maintain the order of tags encountered in the XML document.
In addition to using the old pyobj_printer () generic function (or, more complex and robust functions), we can use the normal property notation to access the elements of the py_obj.
Python Interactive Session
>>> from try_dom3 import *>>> py_obj.quotations[0].quotation[3].source[0]. PCDATA ' Guido van Rossum, '
Reschedule the DOM tree
One of the great advantages of DOM is that it allows programmers to manipulate XML documents in a non-linear fashion. Each block enclosed by a matching on/off tag is just a "node" in the DOM tree. When nodes are maintained in a list-like manner to preserve sequential information, the order is nothing special or immutable. We can easily cut a node and graft it to another location in the DOM tree (if the DTD is allowed or even grafted on to another layer). Or add a new node, delete an existing node, and so on.
try_dom4.py
"" "Manipulate the arrangement of nodes in a DOM object" "" From try_dom3 Import * #-- Var ' Doc ' would hold the single
"trunk" doc = dom_obj.get_childnodes () [0] #--pull off all the nodes to a Python list# (each node is a
block, or a whitespace text node) nodes = [] While 1:try:node = Doc.removechild (doc . Get_childnodes () [0]) Except:break nodes.append (node) #--Reverse the order of T He quotations using a list method# (we could also perform more complicated operations on the list:# delete elements, add n EW ones, sort on complex criteria, etc.) Nodes.reverse () #--Fill ' Doc ' back to with our rearranged nodes for node in nodes: # If second arg is None, insert was to end of list Doc.insertbefore (node, None) #--Output the manipulated DOM print dom_obj.toxml ()
If we treat an XML document as a text file only, or use a sequence-oriented module such as Xmllib or Xml.sax, then performing a rescheduling of the quotation node in the above lines will lead to a problem worth considering. However, if you use the DOM, the problem is as simple as any other action performed on the Python list.