Use minidom to process XML (2)-XML writing

Source: Internet
Author: User
Tags cdata return tag

Original article address:

Http://blog.donews.com/limodou/archive/2004/07/15/43609.aspx

Http://blog.donews.com/limodou/archive/2004/07/15/43755.aspx

 

In newedit
It has the function of code snippets. code snippets are divided into the classification of fragments and the content of fragments. By default, all files are saved in XML format. The following describes how to use minidom to read and save XML files.

The following is an example file of segment classification-catalog. xml.

<? XML version = "1.0" encoding = "UTF-8"?>
<Catalog>
<Maxid> 4 </maxid>
<Item id = "1">
<Caption> python </caption>
<Item id = "4">
<Caption> test </caption>
</Item>
</Item>
<Item id = "2">
<Caption> Zope </caption>
</Item>
</CATALOG>

The classification is a tree structure, which may be:

Python

Test
Zope

First, let's briefly introduce the XML knowledge. If you already know it, you can jump over.

1. XML document encoding

This
The XML document is encoded in UTF-8, so the "test" you see is actually UTF-8 encoding. In XML document processing, it is done using UTF-8 encoding, so if you do not specify
In the case of encoding, they all think that the file is UTF-8 encoded. In python, it seems that only several types of codes are supported, as we do not support the commonly used gb2312 code. Therefore, we recommend that you
Use UTF-8 encoding when processing XML.

2. XML document structure

XML documents include xml header information and XML Information body. Header information such:

<? XML version = "1.0" encoding = "UTF-8"?>

It indicates the version used by the XML document and the encoding method. Some complicated types of documents are also defined (doctype), which is used to define the DTD, schema, and entity definitions used in this XML document. It is not used here, and I am not an expert, so I will not elaborate on it.

The XML Information body is composed of tree elements. Each XML document has a document element, that is, the root element of the tree. All other elements and content are contained in the root element.

3. Dom

Dom is short for Document Object Model. It represents an XML document in the object tree. The advantage of using Dom is that you can traverse objects flexibly.

4. Elements and nodes

An element is a tag, which appears in pairs. XML documents are composed of elements, but there can be text between elements, and the content of elements is also text. There are many nodes in the minidom, And the element also belongs to a node. It is not a leaf node, that is, it has a subnode; there are also some leaf nodes, such as text nodes, it no longer has subnodes.

Image
In catalog. XML, the document element is catalog, which has two types of elements: maxid and item. Maxid indicates the id value of the largest item currently. Every
Each item has an ID attribute, which is unique in newedit.
Is used to generate the XML file name of the code segment corresponding to each category, so it cannot be repeated, and it is an incremental value. The item element has a caption subelement to indicate
It can also contain the item element. In this way, a tree-like XML structure is defined. Let's take a look at how to read them.

1. Get the DOM object

>>> Import XML. Dom. minidom
>>> Dom = xml. Dom. minidom. parse ('d:/CATALOG. xml ')

In this way, we get a DOM object, and its first element should be catalog.

2. Get the document Element Object

>>> Root = dom.doc umentelement

In this way, the root element (Catalog) is obtained ).

Iii. node attributes

Each node has its nodename, nodevalue, and nodetype attributes. Nodename is the node name.

>>> Root. nodename
U'catalog'

Nodevalue is the value of a node and is only valid for text nodes. Nodetype is a node type. The following types are available:

'Attribute _ node'
'Cdata _ section_node'
'Comment _ node'
'Document _ fragment_node'
'Document _ node'
'Document _ type_node'
'Element _ node'
'Entity _ node'
'Entity _ reference_node'
'Notation _ node'
'Processing _ instruction_node'
'Text _ node'

These nodes are well understood by their names. Catalog is of the element_node type.

>>> Root. nodetype
1
>>> Root. element_node
1

4. Access to sub-elements and sub-nodes

There are many ways to access sub-elements and sub-nodes. For sub-elements that know the element name, you can use the getelementsbytagname method, such as reading maxid sub-elements:

>>> Root. getelementsbytagname ('maxid ')
[<Dom element: maxid at 0xb6d0a8>]

In this case, a list is returned. Because maxid has only one item in our example, the List has only one item.

If you want to get all the child nodes (including elements) under an element, you can use the childnodes attribute:

>>> Root. childnodes
[<Dom
Text node "/N">, <Dom element: maxid at 0xb6d0a8>, <dom
Text node "/N">, <Dom element: item at 0xb6d918>, <dom
Text node "/N">, <Dom element: item at 0xb6de40>, <dom
Text node "/N">, <Dom element: item at 0xb6dfa8>, <dom
Text node "/N">]

It can be seen that the content between all two tags is considered as a text node. Like the carriage return after each line,
All are seen as text nodes. From the above results, we can see the type of each node. In this example, there are text nodes and element nodes, node names (element nodes), and node values (text nodes ). Each node is
Different Node objects have different attributes and methods. For more information, see. Because this example is simple, it only involves text nodes and element nodes.

Getelementsbytagname can be used to search for all child elements of the current element, including child elements of all layers. Childnodes only saves the first child node of the current element.

In this way, we can traverse childnodes to access each node and determine its nodetype to get different content. For example, print the names of all elements:

>>> For node in root. childnodes:
If node. nodetype = node. element_node:
Print node. nodename

Maxid
Item
Item

For a text node, you can use the. Data Attribute to retrieve the text content.

For simple elements, such as <caption> python </caption>, we can write such a function to get its content (here it is Python ).

Def gettagtext (root, tag ):
Node = root. getelementsbytagname (TAG) [0]
Rc = ""
For node in node. childnodes:
If node. nodetype in (node. text_node, node. cdata_section_node ):
Rc = RC + node. Data
Return RC

This function only processes the first child element to be found. It concatenates all text nodes in the first child element. When nodetype is a text node, node. Data is the text content. If we examine the element caption, we may see:

[<Dom text node "Python">]

It indicates that the caption element has only one text node.

If an element has an attribute, you can use the getattribute method, for example:

>>> Itemlist = root. getelementsbytagname ('item ')
>>> Item = itemlist [0]
>>> Item. getattribute ('id ')
U'1'

In this way, the attribute value of the first item element is obtained.

Next, let's briefly summarize how to use minidom to read information in XML.

1. Import the XML. Dom. minidom module to generate a DOM object.
2. Get the Document Object (root object)
3. Use the getelementsbytagname () method and the childnodes attribute (there are other methods and attributes) to find the elements to be processed.
4. Get the content of the text node under the element

 

Next I will demonstrate how to create an XML file like catalog. XML from scratch.

1. Generate a DOM object

>>> Import XML. Dom. minidom
>>> Impl = xml. Dom. minidom. getdomimplementation ()
>>> Dom = impl. createdocument (none, 'catalog ', none)

In this way, an empty DOM object is generated. Catalog is the name of the document element, that is, the name of the root element.

Ii. display the generated XML content

Each Dom Node object (including the DOM object itself) has methods for outputting XML content, such as toxml () and toprettyxml ()

Toxml () Outputs XML text in a compact format, for example:

<Catalog> <item> test </item> </CATALOG>

Toprettyxml () outputs the beautifying XML text, such:

<Catalog>
<Item>
Test
</Item>
<Item>
Test
</Item>
</CATALOG>

It can be seen that it adds a carriage return character to the end of each node and automatically handles the reduction. However, for each element, if the element only contains text content, I want the tag of the element to be associated with the text, for example:

<Item> test </item>

I don't want to use a separate format, but the minidom itself does not support such processing. How to achieve this is shown in the following figure:

<Catalog>
<Item> test </item>
<Item> test </item>
</CATALOG>

This XML format will be discussed later.

3. generate various node objects

DOM objects have various methods to generate nodes. The text nodes, CDATA nodes, and element nodes are listed below.

1. Text node generation

>>> Text = Dom. createtextnode ('test ')
Test

It should be noted that, when the node is generated, minidom does not check text characters, such as if '<', '&' or other characters appear in the text, it should be converted to the corresponding Entity symbols '& lt;', '& amp;'. This is not done here.

2. CDATA node generation

>>> DATA = Dom. createcdatasection ('AAAAAA/nbbbbbbbb ')
>>> Data. toxml ()
'<! [CDATA [aaaaaa/nbbbbbbbb]>'

CDATA
Is used to include large text, and can not be converted to '<', '&' character tags, it is used <! [CDATA [text]>. But not in text
It must exist in strings such as "]>. When the node is generated, minidom does not perform these checks. It is only possible to find errors when you output them.

3. Generation of element nodes

>>> Item = Dom. createelement ('caption ')
>>> Item. toxml ()
'<Caption/>'

Pair
For a node like an element, the generated element node is actually an empty element, that is, it does not contain any text. to include text or other elements, we need to use appendchild () or
Insertbefore () and other methods add the child node to the element node. For example, add the text node generated above to the caption element node:

>>> Item. appendchild (text)
<Dom text node "test">
>>> Item. toxml ()
'<Caption> test </caption>'

You can use the setattribute () method of the element object to add attributes to the element, for example:

>>> Item. setattribute ('id', 'idvalue ')
>>> Item. toxml ()
'<Caption id = "idvalue"> test </caption>'

4. Generate a DOM object tree

Me
With DOM objects, we know how to generate various nodes, including Leaf nodes (nodes that do not contain other nodes, such as text nodes) and non-leaf nodes (nodes that contain other nodes, such as element node) generation
Then, you need to use the appendchild () or insertbefore () method of the Node object to connect each node to a tree based on its position. Finally
String to the document node, that is, the root node. For example, a complete example is:

>>> Import XML. Dom. minidom
>>> Impl = xml. Dom. minidom. getdomimplementation ()
>>> Dom = impl. createdocument (none, 'catalog ', none)
>>> Root = dom.doc umentelement
>>> Item = Dom. createelement ('item ')
>>> Text = Dom. createtextnode ('test ')
>>> Item. appendchild (text)
<Dom text node "test">
>>> Root. appendchild (item)
<Dom element: item at 0xb9cf80>
>>> Print root. toxml ()
<Catalog> <item> test </item> </CATALOG>

5. Functions for simple generation of element nodes

The following is a small function I wrote for simple generation:

<Caption> test </caption>

Or, for example:

<Item> <! [CDATA [test]> </item>

Element Node

1 def makeeasytag (DOM, tagname, value, type = 'text '):
2 tag = Dom. createelement (tagname)
3 if value. Find (']>')>-1:
4 type = 'text'
5 If type = 'text ':
6 value = value. Replace ('&', '& amp ;')
7 value = value. Replace ('<', '& lt ;')
8 text = Dom. createtextnode (value)
9 Elif type = 'cdata ':
10 text = Dom. createcdatasection (value)
11 tag. appendchild (text)
12 Return tag

Parameter description:

  • Dom is a DOM object
  • Tagname is the name of the element to be generated, such as 'item'
  • Value is the text content of the object, which can be multiple rows.
  • Type is the format of the text node, 'text' is the general text node, and 'cdata' is the CDATA Node

Function processing description:

  • First, create an element node.
  • Check whether the text content has ']>'. If yes, the text node can only be a text node.
  • If the node type is 'text', replace '<' with '& lt;', '&' with '& amp;' in the text content, and generate a text node.
  • If the node type is 'cdata', The CDATA node is generated.
  • Append the generated text node to the element node.

Therefore, this small function can automatically process character conversion and avoid the occurrence of ']>' strings in the CDATA node.

The statement for generating the 'item' node above can be changed:

>>> Item = makeeasytag (DOM, 'item', 'test ')
>>> Item. toxml ()
'<Item> test </item>'

6. Write Data to an XML file

The DOM object tree has been generated. We can call the writexml () method of Dom to write the content into the file. The syntax format of the writexml () method is:

Writexml (writer, indent, addindent, newl, encoding)

  • Writer is a file object.
  • Indent is the character filled before each tag. For example, ''indicates that there are two spaces before each tag.
  • Addindent is the character of each subnode.
  • Newl is the character filled after each tag. For example, '/N' indicates that each tag is followed by a carriage return.
  • Encoding is the value of the encoding attribute in the generated XML Information header. At the time of output, minidom does not actually process the encoding. If your saved text contains Chinese characters, you need to encode and convert it yourself.

The writexml method can be omitted except for the writer parameter. The following is an example of a text with Chinese characters:

1 >>> import XML. Dom. minidom
2 >>> impl = xml. Dom. minidom. getdomimplementation ()
3 >>> dom = impl. createdocument (none, 'catalog ', none)
4 >>> root = dom.doc umentelement
5 >>> text = Unicode ('Chinese character example', 'cp936 ')
6 >>> item = makeeasytag (DOM, 'item', text)
7 >>> root. appendchild (item)
8 <Dom element: item at 0xb9ceb8>
9 >>> root. toxml ()
10 U' <catalog> <item>/u6c49/u5b57/u793a/u4f8b </item> </CATALOG>'
11 >>> F = file ('d:/test. xml', 'w ')
12 >>> import codecs
13 >>> writer = codecs. Lookup ('utf-8') [3] (f)
14 >>> Dom. writexml (writer, encoding = 'utf-8 ')
15 >>> writer. Close ()

Five lines use Unicode encoding internally during XML processing. Therefore, the characters like Chinese characters must first be converted to Unicode. If you do not perform this step, the minicode will not be checked and there may be no error during saving. However, an error may occur during reading.
Lines 12-13 generate write stream objects for UTF-8 encoding so that Unicode is automatically converted to UTF-8 encoding when saved.

This completes writing the XML file.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.