Example of using Minidom to process XML (Python learning) (reprint) __python

Source: Internet
Author: User
Tags cdata

Http://www.cnblogs.com/xuxm2007/archive/2011/01/16/1936610.html


Http://blog.csdn.net/ywchen2000/archive/2006/07/04/876742.aspx

Http://blog.csdn.net/zhangj1012003_2007/archive/2010/04/23/5514807.aspx

Http://blog.csdn.net/zhangj1012003_2007/archive/2010/04/22/5514929.aspx

Http://blog.csdn.net/zhangj1012003_2007/archive/2010/04/22/5514935.aspx

One. XML read.

In Newedit, there is a code fragment function, the code fragment is divided into fragments of the classification and fragments of the content. By default, they are saved in XML format. Let me tell you how to use Minidom to read and save XML files.

The following is a sample file for fragment classification--catalog.xml

<?xml version= "1.0" encoding= "Utf-8"?>
<catalog>
<maxid>4</maxid>
<item id= "1" >
<caption>Python</caption>
<item id= "4" >
<caption> Testing </caption>
</item>
</item>
<item id= "2" >
<caption>Zope</caption>
</item>
</catalog>

The classification is a tree structure that may be displayed as:

Python
Test
Zope

Let's briefly introduce the knowledge of XML, if you already know it can jump over.

1. Encoding of XML documents

This XML document is encoded as Utf-8, so the "test" you see is actually UTF-8 encoded. In the processing of XML documents are used UTF-8 encoding, so if you do not specify encoding, it is considered that the file is UTF-8 encoded. In Python, it seems that only a few encodings are supported, like our usual GB2312 code, so it is recommended that you use UTF-8 encoding when working with XML.

2. Structure of XML document

XML documents have XML header information and XML information body. Header information such as:

<?xml version= "1.0" encoding= "Utf-8"?>

It shows the version used for this XML document and how it is encoded. Somewhat complex there are also some definitions of document types (DOCTYPE) that define the DTD or schema used for this XML document and the definitions of some entities. It's not used here, and I'm not an expert, so I'm not going to elaborate.

The XML information body is composed of the top elements of the tree. Each XML document has a document element, which is the root element of the tree, and all other elements and content are contained within the root element.

3. DOM

DOM is the abbreviation for Document Object model, which is a method of representing an XML document in an object tree, and the advantage of using it is that you can easily iterate through the object.

4. Elements and nodes

An element is a mark, and it appears in pairs. An XML document is composed of elements, but there can be text between elements and elements, and the contents of elements are text. There are many nodes in the Minidom, the elements also belong to a node, it is not a leaf node, that is, there are child nodes, there are some leaf nodes, such as text nodes, it no longer have child nodes.

As in Catalog.xml, the document element is catalog, which has two elements: Maxid and item. Maxid is used to represent the ID value of the current largest item. Each item has an ID attribute, the id attribute is unique, the XML document name used in Newedit to generate the code fragment for each category, so it cannot be duplicated, and it is an incremented value. The item element has a caption child element that represents the name of the taxonomy item, and it can also contain the item element. In this way, we define a tree-like XML structure, and let's take a look at them if we read them out.


One, get the DOM object

>>> Import Xml.dom.minidom
>>> dom = xml.dom.minidom.parse (' D:/catalog.xml ')

So we get a DOM object, and its first element should be catalog.

Second, get the document element object

>>> root = dom.documentelement

So we get the root element (catalog).

Third, node properties

Each node has its Nodename,nodevalue,nodetype attribute. NodeName is the name of the knot.

>>> Root.nodename
U ' catalog '

NodeValue is the value of a node and is valid only for text nodes. NodeType are the types of nodes, and now have the following:

' Attribute_node '
' Cdata_section_node '
' Comment_node '
' Document_fragment_node '
' Document_node '
' Document_type_node '
' Element_node '
' Entity_node '
' Entity_reference_node '
' Notation_node '
' Processing_instruction_node '
' Text_node '

These nodes are well understood by name. Catalog is the Element_node type.

>>> Root.nodetype
1
>>> Root. Element_node
1

Iv. access to child elements and child nodes

There are many ways to access child elements, child nodes, and for child elements that know the element's name, you can use the getElementsByTagName method, such as reading the MAXID child element:

>>> root.getelementsbytagname (' Maxid ')
[<dom Element:maxid at 0xb6d0a8>]

This returns a list, because our example has only one maxid, so there is only one item in the list.

If you want to get all the child nodes (including elements) under an element, you can use the ChildNodes property:

>>> Root.childnodes
[<dom text node "\ n", <dom Element:maxid at 0xb6d0a8>, <dom text node "\ n", <dom Element: item at 0xb6d918>, <dom text node "\ n", <dom Element:item at 0xb6de40>, <dom text node "\ n" ;, <dom Element:item at 0xb6dfa8>, <dom Text node "\ n"]

You can see that the content between all two tags is treated as a text node. Like a carriage return at the back of each line, the text node is seen. From the above results we can see the type of each node, in this case there are text nodes and element nodes, the name of the node (element node), the node value (text node). Each node is an object, different node objects have different properties and methods, more detailed to see the document. Because this example is relatively simple, it involves only text nodes and ELEMENT nodes.

getElementsByTagName can search all child elements of the current element, including all levels of child elements. ChildNodes only holds the first layer of child nodes for the current element.

This allows us to traverse childnodes to access each node and judge its nodetype to get different content. For example, print out the names of all elements:

>>> for node in root.childnodes:
If Node.nodetype = node. Element_node:
Print Node.nodename

Maxid
Item
Item

For a text node, the text content that you want it to use:. Data property.

For simple elements, such as: <caption>python</caption>, we can write a function to get its contents (this is Python).

def gettagtext (root, tag):
node = root.getelementsbytagname (tag) [0]
rc = ""
For node in Node.childnodes:
If Node.nodetype in (node. Text_node, NODE. Cdata_section_node):
rc = rc + node.data
return RC

This function only handles the first conforming child element found. It will spell all the text nodes in the first child element that matches. When NodeType is a text-class node, Node.data is the content of the text. If we examine the element caption, we may see:

[<dom Text node "Python"]

Indicates that the caption element has only one text node.

If an element has attributes, then you can use the GetAttribute method, such as:

>>> itemlist = root.getelementsbytagname (' item ')
>>> item = itemlist[0]
>>> item.getattribute (' id ')
U ' 1 '

This gets the property value of the first item element.

Let's briefly summarize how to use Minidom to read information in XML

1. Import xml.dom.minidom module, generate DOM object
2. Get the Document object (root object)
3. Find the element to be processed through the getElementsByTagName () method and the ChildNodes property (and some other methods and properties)
4. Get the content of the text node under the element
Two. Write.

Let me show you how to generate XML files like Catalog.xml from scratch.

First, generate DOM objects

>>> Import Xml.dom.minidom
>>> Impl = Xml.dom.minidom.getDOMImplementation ()
>>> dom = impl.createdocument (None, ' Catalog ', none)

This results in an empty DOM object. Where catalog is the document element name, which is the root element name.

Second, display the generated XML content

Each DOM node object (including the DOM object itself) has methods for outputting XML content, such as ToXml (), Toprettyxml ()

ToXml () outputs a compact format of XML text, such as:

<catalog><item>test</item><item>test</item></catalog>

Toprettyxml () outputs the beautified XML text, such as:

<catalog>
<item>
Test
</item>
<item>
Test
</item>
</catalog>

As you can see, it adds a carriage return after each node and automatically handles the indentation. But for each element, if the element has only text content, then I want the element's tag to be together with the text, such as:

<item>test</item>

Rather than a separate format, the minidom itself does not support such a process. about how to implement a shape like this:

<catalog>
<item>test</item>
<item>test</item>
</catalog>

Such an XML format, we say later.

Iii. Generating various node objects

Dom objects have various methods of generating nodes, and the following lists the text nodes, CDATA nodes, and the generation process of the element nodes.

1. Generation of text nodes

>>> text=dom.createtextnode (' test ')
Test

Note that when generating a node, Minidom does not check the text characters, as if characters such as ' < ', ' & ' appear in the text, they should be converted to the corresponding entity symbol ' &lt; ', ' &amp; ' No, this is not the deal.

2. CDATA Node Generation

>>> data = dom.createcdatasection (' aaaaaa\nbbbbbb ')
>>> Data.toxml ()
' <! [cdata[aaaaaa\nbbbbbb]]> '

CDATA is used to include large chunks of text while not having to convert the ' < ', ' & ' character tags, which are used with <! [cdata[text]]> to include. However, there is no such string as "]]>" in the text. Minidom do not make these checks when generating a node, only if you can find the error when you output it.

3. The generation of ELEMENT nodes

>>> item = dom.createelement (' caption ')
>>> Item.toxml ()
' <caption/> '

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.