There are many articles about reading xml from python, but most of them post an xml file and then post the code for processing the file. This is not conducive to learning for beginners. I hope this article will be easier to understand and teach you how to use python to read xml files.
1. What is xml?
Xml can be used to tag data and define data types. It is a source language that allows you to define your own markup language.
Abc. xml
Copy codeThe Code is as follows:
<? Xml version = "1.0" encoding = "UTF-8"?>
<Catalog>
<Maxid> 4 </maxid>
<Login username = "pytest" passwd = '000000'>
<Caption> Python </caption>
<Item id = "4">
<Caption> test </caption>
</Item>
</Login>
<Item id = "2">
<Caption> Zope </caption>
</Item>
</Catalog>
OK. In terms of structure, it is similar to our common HTML hypertext markup language. However, they are designed for different purposes. hypertext markup language is designed to display data, and its focus is on the appearance of the data. It is designed to transmit and store data, with the focus on data content.
It has the following features:
First, it is composed of tag pairs, <aa> </aa>
The tag can have attributes: <aa id = '000000'> </aa>
Tag pairs can embed data: <aa> abc </aa>
Tags can be embedded into sub-tags (with hierarchical relationships ):
Ii. Obtain tag attributes
The following describes how to use python to read files of this type.
Copy codeThe Code is as follows:
# Coding = UTF-8
Import xml. dom. minidom
# Open an xml document
Dom = xml. dom. minidom. parse ('abc. xml ')
# Obtain the document Element Object
Root = dom.doc umentElement
Print root. nodeName
Print root. nodeValue
Print root. nodeType
Print root. ELEMENT_NODE
The mxl. dom. minidom module is used to process xml files.
Xml. dom. minidom. parse () is used to open an xml file and change the dom variable of the file object.
DocumentElement is used to obtain the document element of the dom object and give the obtained object to the root user.
Each node has its nodeName, nodeValue, and nodeType attributes.
NodeName is the node name.
NodeValue is the value of a node and is only valid for text nodes.
NodeType is the node type. Catalog is of the ELEMENT_NODE type.
There are currently the following types:
'Attribute _ node'
'Cdata _ SECTION_NODE'
'Comment _ node'
'Document _ FRAGMENT_NODE'
'Document _ node'
'Document _ TYPE_NODE'
'Element _ node'
'Entity _ node'
'Entity _ REFERENCE_NODE'
'Notation _ node'
'Processing _ INSTRUCTION_NODE'
'Text _ node'
3. Obtain sub-tags
Now you need to obtain the name of the sub-tag of catalog
Copy codeThe Code is as follows:
<? Xml version = "1.0" encoding = "UTF-8"?>
<Catalog>
<Maxid> 4 </maxid>
<Login username = "pytest" passwd = '000000'>
<Caption> Python </caption>
<Item id = "4">
<Caption> test </caption>
</Item>
</Login>
<Item id = "2">
<Caption> Zope </caption>
</Item>
</Catalog>
You can use the getElementsByTagName method to obtain the child element that knows the element name:
Copy codeThe Code is as follows:
# Coding = UTF-8
Import xml. dom. minidom
# Open an xml document
Dom = xml. dom. minidom. parse ('abc. xml ')
# Obtain the document Element Object
Root = dom.doc umentElement
Bb = root. getElementsByTagName ('maxid ')
B = bb [0]
Print B. nodeName
Bb = root. getElementsByTagName ('login ')
B = bb [0]
Print B. nodeName
How to differentiate tags with the same Tag Name:
Copy codeThe Code is as follows:
<? Xml version = "1.0" encoding = "UTF-8"?>
<Catalog>
<Maxid> 4 </maxid>
<Login username = "pytest" passwd = '000000'>
<Caption> Python </caption>
<Item id = "4">
<Caption> test </caption>
</Item>
</Login>
<Item id = "2">
<Caption> Zope </caption>
</Item>
</Catalog>
How to distinguish between <caption> and <item> labels?
Copy codeThe Code is as follows:
# Coding = UTF-8
Import xml. dom. minidom
# Open an xml document
Dom = xml. dom. minidom. parse ('abc. xml ')
# Obtain the document Element Object
Root = dom.doc umentElement
Bb = root. getElementsByTagName ('caption ')
B = bb [2]
Print B. nodeName
Bb = root. getElementsByTagName ('item ')
B = bb [1]
Print B. nodeName
Root. getElementsByTagName ('caption ') obtains a group of caption tags. B [0] indicates the first tag in A group. B [2], the third tag in the group.
4. Obtain tag attribute values
Copy codeThe Code is as follows:
<? Xml version = "1.0" encoding = "UTF-8"?>
<Catalog>
<Maxid> 4 </maxid>
<Login username = "pytest" passwd = '000000'>
<Caption> Python </caption>
<Item id = "4">
<Caption> test </caption>
</Item>
</Login>
<Item id = "2">
<Caption> Zope </caption>
</Item>
</Catalog>
<Login> and <item> tags have attributes. How can they be obtained?
Copy codeThe Code is as follows:
# Coding = UTF-8
Import xml. dom. minidom
# Open an xml document
Dom = xml. dom. minidom. parse ('abc. xml ')
# Obtain the document Element Object
Root = dom.doc umentElement
Itemlist = root. getElementsByTagName ('login ')
Item = itemlist [0]
Un = item. getAttribute ("username ")
Print un
Pd = item. getAttribute ("passwd ")
Print pd
Ii = root. getElementsByTagName ('item ')
I1 = ii [0]
I = i1.getAttribute ("id ")
Print I
I2 = ii [1]
I = i2.getAttribute ("id ")
Print I
The getAttribute method can obtain the values corresponding to the attributes of an element.
5. obtain data between tag pairs
Copy codeThe Code is as follows:
<? Xml version = "1.0" encoding = "UTF-8"?>
<Catalog>
<Maxid> 4 </maxid>
<Login username = "pytest" passwd = '000000'>
<Caption> Python </caption>
<Item id = "4">
<Caption> test </caption>
</Item>
</Login>
<Item id = "2">
<Caption> Zope </caption>
</Item>
</Catalog>
<Caption> there is data between tag pairs. How can we obtain the data?
There are multiple methods to obtain data between tag pairs,
Method 1:
Copy codeThe Code is as follows:
# Coding = UTF-8
Import xml. dom. minidom
# Open an xml document
Dom = xml. dom. minidom. parse ('abc. xml ')
# Obtain the document Element Object
Root = dom.doc umentElement
Cc = dom. getElementsByTagName ('caption ')
C1 = cc [0]
Print c1.firstChild. data
C2 = cc [1]
Print c2.firstChild. data
C3 = cc [2]
Print c3.firstChild. data
The firstChild attribute returns the first child node of the selected node.. data indicates that the node's person data is obtained.
Method 2:
Copy codeThe Code is as follows:
# Coding = UTF-8
From xml. etree import ElementTree as ET
Per = ET. parse ('abc. xml ')
P = per. findall ('./login/item ')
For oneper in p:
For child in oneper. getchildren ():
Print child. tag, ':', child. text
P = per. findall ('./item ')
For oneper in p:
For child in oneper. getchildren ():
Print child. tag, ':', child. text
Method 2 is a bit complicated, and the referenced module is different from the previous one. findall is used to specify which level of tag to start traversing.
The getchildren method returns all child tags in the document order. And output the tag Name (child. tag) and tag data (child. text)
Actually, method 2 does not work here. Its core function is to traverse all sub-tags under a certain level of tag.