Details about the dynamics of DOM methods in Python and the dynamics of pythondom
Document Object Model
The xml. dom module may be the most powerful tool for Python programmers when using XML documents. Unfortunately, there are currently few documents provided by the XML-SIG. The W3C language-independent DOM specification fills in this gap. But it is better for Python programmers to have a quick start guide for DOM specific to the Python language. This article aims to provide such a guide. In the previous column, some samples use the sample quotations. dtd file, which can be used together with the code sample file in this article.
It is necessary to understand the exact meaning of DOM. In this regard, the formal explanation is very good:
The Document Object Model is a platform-independent and language-independent interface that allows programs and scripts to dynamically access and update document content, structures, and styles. You can further process the document, and the results can also be merged into the displayed page. (DOM workgroup of the World Wide Web Alliance)
DOM converts an XML document to a tree-or forest-representation. The W3C specification provides an example of the DOM version of an HTML table.
As shown in, DOM defines a set of methods that can traverse, trim, reorganize, output, and operate trees from a more abstract perspective, this method is more convenient than linear representation of XML documents.
Convert HTML to XML
Valid HTML is almost effective XML, but not exactly the same. There are two major differences: XML tags are case-sensitive, and all XML tags require an explicit ending symbol (as the ending mark, this is optional for some HTML tags; for example, ). A simple example of using xml. dom is to use the HtmlBuilder () class to convert HTML into XML.
Try_dom1.py
"""Convert a valid HTML document to XML USAGE: python try_dom1.py < infile.html > outfile.xml""" import sys from xml.dom import core from xml.dom.html_builder import HtmlBuilder # Construct an HtmlBuilder object and feed the data to itb = HtmlBuilder()b.feed(sys.stdin.read()) # Get the newly-constructed document objectdoc = b.document # Output it as XML print doc.toxml()
The HtmlBuilder () class can easily implement the functions of some basic xml. dom. builder templates inherited by it. Its source code is worth studying. However, even if we implement the template function, the outline of the DOM program is similar. In general, we will use some methods to build a DOM instance and then operate on the instance. The. toxml () method of the DOM instance is a simple method to generate the string representation of the DOM instance (in the above cases, as long as it is printed after it is generated ).
Convert a Python object to XML
Python programmers can export any Python object as an XML instance to implement many functions and versatility. This allows us to process Python objects in a habitual way, and we can choose whether to use the instance attribute as the tag in the generated XML. Just a few lines (derived from the building. py example) can be used to convert a Python "native" object to a DOM object and perform recursive processing on those attributes containing the object.
Try_dom2.py
"""Build a DOM instance from scratch, write it to XML USAGE: python try_dom2.py > outfile.xml""" import types from xml.dom import core from xml.dom.builder import Builder # Recursive function to build DOM instance from Python instance defobject_convert (builder, inst): # Put entire object inside an elem w/ same name as the class. builder.startElement(inst.__class__.__name__) for attr in inst.__dict__.keys(): if attr[0] == '_': # Skip internal attributes continue value = getattr(inst, attr) if type(value) == types.InstanceType: # Recursively process subobjects object_convert(builder, value) else : # Convert anything else to string, put it in an element builder.startElement(attr) builder.text(str(value)) builder.endElement(attr) builder.endElement(inst.__class__.__name__) if __name__ == '__main__': # Create container classes classquotations : pass classquotation : pass # Create an instance, fill it with hierarchy of attributes inst = quotations() inst.title = "Quotations file (not quotations.dtd conformant)" inst.quot1 = quot1 = quotation() quot1.text = """'"is not a quine" is not a quine' is a quine""" quot1.source = "Joshua Shagam, kuro5hin.org" inst.quot2 = quot2 = quotation() quot2.text = "Python is not a democracy. Voting doesn't help. "+\ "Crying may..." quot2.source = "Guido van Rossum, comp.lang.python" # Create the DOM Builder builder = Builder() object_convert(builder, inst) print builder.document.toxml()
The object_convert () function has some limitations. For example, you cannot use the above process to generate the quotations. dtd: # PCDATA text that conforms to the XML document and can only be placed in the class attributes (such as. text ). A simple work und is to let object_convert () process a property with a name in a special way, such as. PCDATA. You can use various methods to make DOM conversion more clever, but the beauty of this method is that we can start from the entire Python object and convert them into XML documents in a concise way.
It should also be noted that in the generated XML document, there is no obvious sequential relationship between elements at the same level. For example, if Python of a specific version is used in the author's system, the second quotation defined in the source code appears first in the output. However, this order changes between different versions and systems. Attributes of Python objects are not arranged in a fixed order, so this feature is meaningful. We hope that the data related to the database system will have this feature, however, it is clear that Articles marked as XML do not want this feature (unless we want to update William Burroughs's "cut-up" method ).
Convert an XML document to a Python object
Generating Python objects from XML documents is as simple as its reverse process. In most cases, you can use the xml. dom method. However, in some cases, it is best to use the same technology as to process all "class" Python objects to process objects generated from XML documents. For example, in the following code, the pyobj_printer () function may have been used to process any Python object.
Try_dom3.py
"""Read in a DOM instance, convert it to a Python object""" from xml.dom.utils import FileReader classPyObject : passdefpyobj_printer (py_obj, level=0): """Return a "deep" string description of a Python object""" from string import join, split import types descript = '' for membname in dir(py_obj): member = getattr(py_obj,membname) if type(member) == types.InstanceType: descript = descript + ( ' '*level) + '{'+membname+ '}\n' descript = descript + pyobj_printer(member, level+3) elif type(member) == types.ListType: descript = descript + ( ' '*level) + '['+membname+ ']\n' for i in range(len(member)): descript = descript+( ' '*level)+str(i+1)+ ': '+ \ pyobj_printer(member[i],level+3) else : descript = descript + membname+ '=' descript = descript + join(split(str(member)[:50]))+ '...\n' return descript defpyobj_from_dom (dom_node): """Converts a DOM tree to a "native" Python object""" py_obj = PyObject() py_obj.PCDATA = '' for node in dom_node.get_childNodes(): if node.name == '#text': py_obj.PCDATA = py_obj.PCDATA + node.value elif hasattr(py_obj, node.name): getattr(py_obj, node.name).append(pyobj_from_dom(node)) else : setattr(py_obj, node.name, [pyobj_from_dom(node)]) return py_obj # Main testdom_obj = FileReader( "quotes.xml").documentpy_obj = pyobj_from_dom(dom_obj) if __name__ == "__main__": print pyobj_printer(py_obj)
Here the focus should be on the function pyobj_from_dom (), especially the xml. dom method. get_childNodes () that actually works (). In pyobj_from_dom (), we extract all texts between tags and put them in the reserved property. PCDATA. For any Nested Tag, we create a new attribute whose name matches the tag and assigns a list to this attribute, this can potentially include the tags that appear multiple times in the parent block. Of course, use the list to maintain the order of tags encountered in the XML document.
In addition to using the old pyobj_printer () class functions (or more complex and robust functions), we can use the normal attribute mark to access the elements of py_obj.
Python interactive session
>>> from try_dom3 import *>>> py_obj.quotations[0].quotation[3].source[0].PCDATA 'Guido van Rossum, '
Reschedule the DOM tree
One advantage of DOM is that it allows programmers to operate XML documents in a non-linear manner. Each block enclosed by the matched on/off mark is only a "Node" in the DOM tree ". When a node is maintained in a list-like manner to retain the ordered information, there is no special or unchangeable order. We can easily cut down a node and graft it to another location of the DOM tree (if DTD permits, or even graft it to another layer ). Or add new nodes, delete existing nodes, and so on.
Try_dom4.py
"""Manipulate the arrangement of nodes in a DOM object""" from try_dom3 import * #-- Var 'doc' will hold the single <quotations> "trunk"doc = dom_obj.get_childNodes()[0] #-- Pull off all the nodes into a Python list# (each node is a <quotation> block, or a whitespace text node)nodes = [] while 1: try : node = doc.removeChild(doc.get_childNodes()[0]) except : break nodes.append(node) #-- Reverse the order of the quotations using a list method# (we could also perform more complicated operations on the list:# delete elements, add new ones, sort on complex criteria, etc.)nodes.reverse() #-- Fill 'doc' back up with our rearranged nodes for node in nodes: # if second arg is None, insert is to end of list doc.insertBefore(node, None) #-- Output the manipulated DOM print dom_obj.toxml()
If we think of an XML document as only a text file, or use a sequence-oriented module (such as xmllib or xml. and then execute the rescheduling operation on the quotation node in the preceding lines. However, if DOM is used, the problem is as simple as any other operation on the Python list.