Author: Shane
Source: http://bluescorpio.cnblogs.com
Lxml takes all the pain out of XML.
Stephen Richter
Lxml is the most abundant and easy-to-use library for working with XML and HTML in Python. Lxml is a python-based binding between libxml2 and libxslt libraries. What makes it different is that it takes into account the speed and functional integrity of these libraries, as well as the simplicity of pure Python APIs. Most of them are compatible with the well-known elementtree APIs but are superior to others.
Install lxml:
Requirement: python2.3 or a later version is required.
Run the following command as a Super User or administrator using the easy_install tool:
Easy_install lxml
In Windows, it is best to specify the version: easy_install lxml = 2.2.6
Use lxml for development
Lxml. etree Guide
Generally, lxml. etree is used.
>>> From lxml import etree
Element class. An element is the main container class of the elementtree API. Most of the XML tree functions are accessed through this class. Elements can be easily created using the element factory method.
>>> Root = etree. element ("root ")
The XML tag name of an element is accessed through the tag attribute.
>>> Print root. Tag # Root
Elements is organized in the XML tree structure. To create child elements and add them to the parent element, you can use the append () method.
>>> Root. append (etree. element ("Child1 "))
We also have a more efficient method: subelement factory method, which uses the same parameters as the element factory method, but the parent node must be used as the first parameter:
>>> Child2 = etree. subelement (root, "child2 ")
>>> Child3 = etree. subelement (root, "Child3 ")
You can use the tostring () method to view the obtained XML
>>> Print etree. tostring (root, pretty_print = true)
<Root>
<Child1/>
<Child2/>
<Child3/>
</Root>
Element is list
>>> Child = root [0]
>>> Print child. Tag
Child1
>>> Print Len (Root)
3
>>> Root. Index (root [1]) # lxml. etree only!
1
Print all child nodes:
>>> Children = List (Root)
>>> For child in root:
... Print (child. Tag)
Child1
Child2
Child3
You can use the insert () method to insert a new subnode:
>>> Root. insert (0, etree. element ("child0 "))
Delete a subnode:
>>> Root [0] = root [-1] # This moves the element!
>>> For child in root:
... Print (child. Tag)
Child3
Child1
Child2
If you want to copy an element to a different place, you need to create an independent deep copy.
>>> From copy import deepcopy
>>> Element = etree. element ("neu ")
>>> Element. append (deepcopy (root [1])
>>> Print (element [0]. Tag)
Child1
>>> Print ([C. Tag for C in root])
['Child3 ', 'child1', 'child2 ']
Getparent () returns the parent node:
>>> Root is root [0]. getparent () # lxml. etree only!
True
The sibling or neighbor node of the element is accessed through the next and previous attributes.
The siblings (or neighbors) of an element are accessed as next and previous elements:
>>> Root [0] is root [1]. getprevious () # lxml. etree only!
True
>>> Root [1] is root [0]. getnext () # lxml. etree only!
True
Element with attribute
XML elements support attributes and can be directly created using the element factory method.
>>> Root = etree. element ("root", interesting = "totally ")
>>> Etree. tostring (Root)
B '<root interesting = "totally"/>'
You can use the set and get methods to access these attributes:
>>> Print root. Get ("interesting ")
Totally
>>> Root. Set ("interesting", "somewhat ")
>>> Print root. Get ("interesting ")
Somewhat
You can also use the attrib dictionary interface.
>>> Attributes = root. attrib
>>> Print (attributes ["interesting"])
Somewhat
>>> Print (attributes. Get ("hello "))
None
>>> Attributes ["hello"] = "Guten Tag"
>>> Print (attributes. Get ("hello "))
Guten Tag
>>> Print (root. Get ("hello "))
Guten Tag
The element can contain text:
>>> Root = etree. element ("root ")
>>> Root. Text = "text"
>>> Print (root. Text)
Text
>>> Etree. tostring (Root)
'<Root> text </root>'
If XML is used in (x) HTML, the text can also be displayed in different elements:
<HTML> <body> Hello <br/> world </body> The element has the tail attribute, which contains the text of the element directly following in the XML tree until the next element.
>>> Html = etree. element ("html ")
>>> Body = etree. subelement (HTML, "body ")
>>> Body. Text = "text"
>>> Etree. tostring (HTML)
B '<HTML> <body> text </body> >>> BR = etree. subelement (body, "Br ")
>>> Etree. tostring (HTML)
B '<HTML> <body> text <br/> </body> >>> Br. Tail = "tail"
>>> Etree. tostring (HTML)
B '<HTML> <body> text <br/> tail </body>
Use XPath to search for text
The text content extracted from the XML tree is XPath,
>>> Print (html. XPath ("string ()") # lxml. etree only!
Texttail
>>> Print (html. XPath ("// text ()") # lxml. etree only!
['Text', 'tail']
If it is frequently used, you can wrap it into a method:
>>> Build_text_list = etree. XPath ("// text ()") # lxml. etree only!
>>> Print (build_text_list (HTML ))
['Text', 'tail']
You can also use the getparent method to obtain the parent node.
>>> Texts = build_text_list (HTML)
>>> Print (texts [0])
Text
>>> Parent = texts [0]. getparent ()
>>> Print (parent. Tag)
Body
>>> Print (texts [1])
Tail
>>> Print (texts [1]. getparent (). Tag)
BR
You can also find out if it's normal text content or tail text:
>>> Print (texts [0]. is_text)
True
>>> Print (texts [1]. is_text)
False
>>> Print (texts [1]. is_tail)
True
Tree iteration:
Elements provides a tree iterator for iterative access to tree elements.
>>> Root = etree. element ("root ")
>>> Etree. subelement (root, "child"). Text = "CHILD 1"
>>> Etree. subelement (root, "child"). Text = "Child 2"
>>> Etree. subelement (root, "another"). Text = "child 3"
>>> Print (etree. tostring (root, pretty_print = true ))
<Root>
<Child> Child 1 </child>
<Child> Child 2 </child>
<Another> child 3 </Another>
</Root>
>>> For element in root. ITER ():
... Print ("% s-% s" % (element. Tag, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
If you know the tag you are interested in, you can pass the tag name to the ITER Method for filtering.
>>> For element in root. ITER ("child "):
... Print ("% s-% s" % (element. Tag, element. Text ))
Child-Child 1
Child-Child 2
By default, the iterator obtains all nodes of a tree, including instances of processinginstructions, comments and entity. If you want to confirm that only the elements object is returned, you can pass in element factory as a parameter.
>>> Root. append (etree. Entity ("#234 "))
>>> Root. append (etree. Comment ("some comment "))
>>> For element in root. ITER ():
... If isinstance (element. Tag, basestring ):
... Print ("% s-% s" % (element. Tag, element. Text ))
... Else:
... Print ("Special: % s-% s" % (element, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
Special: audio-extract
Special: <! -- Some comment -->-some comment
>>> For element in root. ITER (TAG = etree. element ):
... Print ("% s-% s" % (element. Tag, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
>>> For element in root. ITER (TAG = etree. entity ):
... Print (element. Text)
Serialization:
Serialization usually uses the tostring () method to return a string, or elementtree. write () method to write a file, an object of a class file, or a URL (put through FTP or http post ). Both use the same keyword parameter such as pretty_print to format the output or encoding to select a specific output encoding instead of simple ASCII.
>>> Root = etree. XML ("<root> <A> <B/> </a> </root> ")
>>> Etree. tostring (Root)
'<Root> <A> <B/> </a> </root>'
>>> Print etree. tostring (root, xml_declaration = true)
<? XML version = '1. 0' encoding = 'ascii '?>
<Root> <A> <B/> </a> </root>
>>> Print etree. tostring (root, encoding = "iso-8859-1 ")
<? XML version = '1. 0' encoding = 'iso-8859-1 '?>
<Root> <A> <B/> </a> </root>
>>> Print etree. tostring (root, pretty_print = true)
<Root>
<A>
<B/>
</A>
</Root>
Note that pretty printing appends a newline at the end.
Note that pretty print adds a new line at the end.
From lxml2.0, serialisation can not only serialize XML, but also serialize to HTML or extract text content by passing function keywords.
>>> Root = etree. XML ("<HTML>
>>> Etree. tostring (Root) # default: method = 'xml'
'<HTML> >>> Etree. tostring (root, method = "XML") # Same as abve
'<HTML> >>> Etree. tostring (root, method = "html ")
'<HTML>
>>> Print etree. tostring (root, method = "html", pretty_print = true)
<HTML>
<Head> <Body> <p> Hello <br> world </P> </body>
</Html>
>>> Etree. tostring (root, method = "text ")
B 'helloworld'
For XML serialization, the default text encoding is ASCII.
>>> BR = root. Find (". // Br ")
>>> Br. Tail = u "W/xf6rld"
>>> Etree. tostring (root, method = "text") # doctest: + ellipsis
Traceback (most recent call last ):
...
Unicodeencodeerror: 'ascii 'codec can't encode character U'/xf6 '...
>>> Etree. tostring (root, method = "text", encoding = "UTF-8 ")
B 'hellow/xc3/xb6rld'
>>> Etree. tostring (root, encoding = Unicode, method = "text ")
U'hellow/xf6rld'
Elementtree class:
An elementtree is a Document Packaging class centered around a tree with a root node. It provides many methods for parsing, serialization, and general document processing. The biggest difference is that it is serialized as a whole document. In contrast, it is serialized into a single element.
>>> Tree = etree. parse (stringio ("""/
<? XML version = "1.0"?>
<! Doctype root system "test" [<! Entity tasty "eggs">]>
<Root>
<A> & tasty; </a>
</Root>
"""))
>>> Print(tree.docinfo.doc type)
<! Doctype root system "test">
>>># Lxml 1.3.4 and later
>>> Print (etree. tostring (tree ))
<! Doctype root system "test "[
<! Entity tasty "eggs">
]>
<Root>
<A> eggs </a>
</Root>
>>># Lxml 1.3.4 and later
>>> Print (etree. tostring (etree. elementtree (tree. getroot ())))
<! Doctype root system "test "[
<! Entity tasty "eggs">
]>
<Root>
<A> eggs </a>
</Root>
>>># Elementtree and lxml <= 1.3.3
>>> Print (etree. tostring (tree. getroot ()))
<Root>
<A> eggs </a>
</Root>
Parse from strings and files:
Fromstring () is the easiest way to parse strings
>>> Some_xml_data = "<root> data </root>"
>>> Root = etree. fromstring (some_xml_data)
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'
The XML () method is similar to the fromstring () method, but it is mainly used to write XML text to the source file.
>>> Root = etree. XML ("<root> data </root> ")
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'
The parse () method is used to Parse Files or Class Object.
>>> Some_file_like = stringio. stringio ("<root> data </root> ")
>>> Tree = etree. parse (some_file_like)
>>> Etree. tostring (tree)
'<Root> data </root>'
Note that parse () returns an elementtree object instead of the Element Object of the string parsing method.
>>> Root = tree. getroot ()
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'
Parser object: lxml. etree uses the standard parser with the default configuration by default. to configure the parser, you can create your own instance.
>>> Parser = etree. xmlparser (remove_blk_text = true) # lxml. etree only!
In this example, a parser is created to remove null text between tags, which can reduce the size of the tree and avoid tail, if you know that the blank content does not make any sense to you.
>>> Root = etree. XML ("<root> <A/> <B> </B> </root>", Parser)
>>> Etree. tostring (Root)
B '<root> <A/> <B> </B> </root>'
>>> For element in root. ITER ("*"):
... If element. text is not none and not element. Text. Strip ():
... Element. Text = none
>>> Etree. tostring (Root)
B '<root> <A/> <B/> </root>'
Incremental Resolution:
Lxml. etree provides two methods for incremental parsing. One method is through a class object, which repeatedly calls the read () method.
>>> Class datasource:
... Data = [B "<roo", B "T> <", B "A/", B "> <", B "/root>"]
... Def read (self, requested_size ):
... Try:
... Return self. Data. Pop (0)
... Handle T indexerror:
... Return B''
>>> Tree = etree. parse (datasource ())
>>> Etree. tostring (tree)
B '<root> <A/> </root>'
The second method is provided by the feed (data) and close () methods through the feed parser interface.
>>> Parser = etree. xmlparser ()
>>> Parser. Feed ("<roo ")
>>> Parser. Feed ("T> <")
>>> Parser. Feed ("/")
>>> Parser. Feed ("> <")
>>> Parser. Feed ("/root> ")
>>> Root = parser. Close ()
>>> Etree. tostring (Root)
'<Root> <A/> </root>'
When calling the close () method (or when an exception occurs), you can call the feed () method to re-use Parser:
>>> Parser. Feed ("<root/> ")
>>> Root = parser. Close ()
>>> Etree. tostring (Root)
B '<root/>'