Lxml of Python

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Shane
Source: http://bluescorpio.cnblogs.com

Lxml takes all the pain out of XML.
Stephen Richter

Lxml is the most abundant and easy-to-use library for working with XML and HTML in Python. Lxml is a python-based binding between libxml2 and libxslt libraries. What makes it different is that it takes into account the speed and functional integrity of these libraries, as well as the simplicity of pure Python APIs. Most of them are compatible with the well-known elementtree APIs but are superior to others.

Install lxml:

Requirement: python2.3 or a later version is required.

Run the following command as a Super User or administrator using the easy_install tool:

Easy_install lxml

In Windows, it is best to specify the version: easy_install lxml = 2.2.6

Use lxml for development

Lxml. etree Guide

Generally, lxml. etree is used.

>>> From lxml import etree

Element class. An element is the main container class of the elementtree API. Most of the XML tree functions are accessed through this class. Elements can be easily created using the element factory method.

>>> Root = etree. element ("root ")

The XML tag name of an element is accessed through the tag attribute.

>>> Print root. Tag # Root

Elements is organized in the XML tree structure. To create child elements and add them to the parent element, you can use the append () method.

>>> Root. append (etree. element ("Child1 "))

We also have a more efficient method: subelement factory method, which uses the same parameters as the element factory method, but the parent node must be used as the first parameter:

>>> Child2 = etree. subelement (root, "child2 ")
>>> Child3 = etree. subelement (root, "Child3 ")

You can use the tostring () method to view the obtained XML

>>> Print etree. tostring (root, pretty_print = true)
<Root>
<Child1/>
<Child2/>
<Child3/>
</Root>

Element is list

>>> Child = root [0]
>>> Print child. Tag
Child1

>>> Print Len (Root)
3

>>> Root. Index (root [1]) # lxml. etree only!
1

Print all child nodes:

>>> Children = List (Root)

>>> For child in root:

... Print (child. Tag)
Child1
Child2
Child3

You can use the insert () method to insert a new subnode:

>>> Root. insert (0, etree. element ("child0 "))
Delete a subnode:

>>> Root [0] = root [-1] # This moves the element!
>>> For child in root:
... Print (child. Tag)
Child3
Child1
Child2

If you want to copy an element to a different place, you need to create an independent deep copy.

>>> From copy import deepcopy
>>> Element = etree. element ("neu ")
>>> Element. append (deepcopy (root [1])
>>> Print (element [0]. Tag)
Child1
>>> Print ([C. Tag for C in root])
['Child3 ', 'child1', 'child2 ']

Getparent () returns the parent node:
>>> Root is root [0]. getparent () # lxml. etree only!
True

The sibling or neighbor node of the element is accessed through the next and previous attributes.
The siblings (or neighbors) of an element are accessed as next and previous elements:

>>> Root [0] is root [1]. getprevious () # lxml. etree only!
True
>>> Root [1] is root [0]. getnext () # lxml. etree only!
True

Element with attribute

XML elements support attributes and can be directly created using the element factory method.

>>> Root = etree. element ("root", interesting = "totally ")
>>> Etree. tostring (Root)
B '<root interesting = "totally"/>'

You can use the set and get methods to access these attributes:

>>> Print root. Get ("interesting ")
Totally
>>> Root. Set ("interesting", "somewhat ")
>>> Print root. Get ("interesting ")
Somewhat

You can also use the attrib dictionary interface.

>>> Attributes = root. attrib
>>> Print (attributes ["interesting"])
Somewhat
>>> Print (attributes. Get ("hello "))
None
>>> Attributes ["hello"] = "Guten Tag"
>>> Print (attributes. Get ("hello "))
Guten Tag
>>> Print (root. Get ("hello "))
Guten Tag

The element can contain text:

>>> Root = etree. element ("root ")
>>> Root. Text = "text"
>>> Print (root. Text)
Text
>>> Etree. tostring (Root)
'<Root> text </root>'

If XML is used in (x) HTML, the text can also be displayed in different elements:
<HTML> <body> Hello world </body> The element has the tail attribute, which contains the text of the element directly following in the XML tree until the next element.

>>> Html = etree. element ("html ")
>>> Body = etree. subelement (HTML, "body ")
>>> Body. Text = "text"
>>> Etree. tostring (HTML)
B '<HTML> <body> text </body> >>> BR = etree. subelement (body, "Br ")
>>> Etree. tostring (HTML)
B '<HTML> <body> text </body> >>> Br. Tail = "tail"
>>> Etree. tostring (HTML)
B '<HTML> <body> text tail </body>

Use XPath to search for text

The text content extracted from the XML tree is XPath,
>>> Print (html. XPath ("string ()") # lxml. etree only!
Texttail
>>> Print (html. XPath ("// text ()") # lxml. etree only!
['Text', 'tail']

If it is frequently used, you can wrap it into a method:

>>> Build_text_list = etree. XPath ("// text ()") # lxml. etree only!
>>> Print (build_text_list (HTML ))
['Text', 'tail']

You can also use the getparent method to obtain the parent node.

>>> Texts = build_text_list (HTML)
>>> Print (texts [0])
Text
>>> Parent = texts [0]. getparent ()
>>> Print (parent. Tag)
Body
>>> Print (texts [1])
Tail
>>> Print (texts [1]. getparent (). Tag)
BR
You can also find out if it's normal text content or tail text:
>>> Print (texts [0]. is_text)
True
>>> Print (texts [1]. is_text)
False
>>> Print (texts [1]. is_tail)
True

Tree iteration:

Elements provides a tree iterator for iterative access to tree elements.

>>> Root = etree. element ("root ")
>>> Etree. subelement (root, "child"). Text = "CHILD 1"
>>> Etree. subelement (root, "child"). Text = "Child 2"
>>> Etree. subelement (root, "another"). Text = "child 3"
>>> Print (etree. tostring (root, pretty_print = true ))
<Root>
<Child> Child 1 </child>
<Child> Child 2 </child>
<Another> child 3 </Another>
</Root>

>>> For element in root. ITER ():
... Print ("% s-% s" % (element. Tag, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3

If you know the tag you are interested in, you can pass the tag name to the ITER Method for filtering.

>>> For element in root. ITER ("child "):
... Print ("% s-% s" % (element. Tag, element. Text ))
Child-Child 1
Child-Child 2

By default, the iterator obtains all nodes of a tree, including instances of processinginstructions, comments and entity. If you want to confirm that only the elements object is returned, you can pass in element factory as a parameter.

>>> Root. append (etree. Entity ("#234 "))
>>> Root. append (etree. Comment ("some comment "))
>>> For element in root. ITER ():
... If isinstance (element. Tag, basestring ):
... Print ("% s-% s" % (element. Tag, element. Text ))
... Else:
... Print ("Special: % s-% s" % (element, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
Special: audio-extract
Special: <! -- Some comment -->-some comment

>>> For element in root. ITER (TAG = etree. element ):
... Print ("% s-% s" % (element. Tag, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
>>> For element in root. ITER (TAG = etree. entity ):
... Print (element. Text)

Serialization:

Serialization usually uses the tostring () method to return a string, or elementtree. write () method to write a file, an object of a class file, or a URL (put through FTP or http post ). Both use the same keyword parameter such as pretty_print to format the output or encoding to select a specific output encoding instead of simple ASCII.

>>> Root = etree. XML ("<root> <A> </a> </root> ")
>>> Etree. tostring (Root)
'<Root> <A> </a> </root>'

>>> Print etree. tostring (root, xml_declaration = true)
<? XML version = '1. 0' encoding = 'ascii '?>
<Root> <A> </a> </root>

>>> Print etree. tostring (root, encoding = "iso-8859-1 ")
<? XML version = '1. 0' encoding = 'iso-8859-1 '?>
<Root> <A> </a> </root>

>>> Print etree. tostring (root, pretty_print = true)
<Root>
<A>

</A>
</Root>

Note that pretty printing appends a newline at the end.

Note that pretty print adds a new line at the end.

From lxml2.0, serialisation can not only serialize XML, but also serialize to HTML or extract text content by passing function keywords.

>>> Root = etree. XML ("<HTML>
>>> Etree. tostring (Root) # default: method = 'xml'
'<HTML> >>> Etree. tostring (root, method = "XML") # Same as abve
'<HTML> >>> Etree. tostring (root, method = "html ")
'<HTML>

>>> Print etree. tostring (root, method = "html", pretty_print = true)
<HTML>
<Head> <Body> Hello world </body>
</Html>

>>> Etree. tostring (root, method = "text ")
B 'helloworld'

For XML serialization, the default text encoding is ASCII.

>>> BR = root. Find (". // Br ")
>>> Br. Tail = u "W/xf6rld"
>>> Etree. tostring (root, method = "text") # doctest: + ellipsis
Traceback (most recent call last ):
...
Unicodeencodeerror: 'ascii 'codec can't encode character U'/xf6 '...
>>> Etree. tostring (root, method = "text", encoding = "UTF-8 ")
B 'hellow/xc3/xb6rld'

>>> Etree. tostring (root, encoding = Unicode, method = "text ")
U'hellow/xf6rld'

Elementtree class:

An elementtree is a Document Packaging class centered around a tree with a root node. It provides many methods for parsing, serialization, and general document processing. The biggest difference is that it is serialized as a whole document. In contrast, it is serialized into a single element.

>>> Tree = etree. parse (stringio ("""/
<? XML version = "1.0"?>
<! Doctype root system "test" [<! Entity tasty "eggs">]>
<Root>
<A> & tasty; </a>
</Root>
"""))
>>> Print(tree.docinfo.doc type)
<! Doctype root system "test">

>>># Lxml 1.3.4 and later
>>> Print (etree. tostring (tree ))
<! Doctype root system "test "[
<! Entity tasty "eggs">
]>
<Root>
<A> eggs </a>
</Root>

>>># Lxml 1.3.4 and later
>>> Print (etree. tostring (etree. elementtree (tree. getroot ())))
<! Doctype root system "test "[
<! Entity tasty "eggs">
]>
<Root>
<A> eggs </a>
</Root>

>>># Elementtree and lxml <= 1.3.3
>>> Print (etree. tostring (tree. getroot ()))
<Root>
<A> eggs </a>
</Root>

Parse from strings and files:

Fromstring () is the easiest way to parse strings

>>> Some_xml_data = "<root> data </root>"
>>> Root = etree. fromstring (some_xml_data)
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'

The XML () method is similar to the fromstring () method, but it is mainly used to write XML text to the source file.
>>> Root = etree. XML ("<root> data </root> ")
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'

The parse () method is used to Parse Files or Class Object.
>>> Some_file_like = stringio. stringio ("<root> data </root> ")
>>> Tree = etree. parse (some_file_like)
>>> Etree. tostring (tree)
'<Root> data </root>'

Note that parse () returns an elementtree object instead of the Element Object of the string parsing method.

>>> Root = tree. getroot ()
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'

Parser object: lxml. etree uses the standard parser with the default configuration by default. to configure the parser, you can create your own instance.

>>> Parser = etree. xmlparser (remove_blk_text = true) # lxml. etree only!

In this example, a parser is created to remove null text between tags, which can reduce the size of the tree and avoid tail, if you know that the blank content does not make any sense to you.

>>> Root = etree. XML ("<root> <A/> </root>", Parser)
>>> Etree. tostring (Root)
B '<root> <A/> </root>'
>>> For element in root. ITER ("*"):
... If element. text is not none and not element. Text. Strip ():
... Element. Text = none
>>> Etree. tostring (Root)
B '<root> <A/> </root>'

Incremental Resolution:

Lxml. etree provides two methods for incremental parsing. One method is through a class object, which repeatedly calls the read () method.
>>> Class datasource:
... Data = [B "<roo", B "T> <", B "A/", B "> <", B "/root>"]
... Def read (self, requested_size ):
... Try:
... Return self. Data. Pop (0)
... Handle T indexerror:
... Return B''
>>> Tree = etree. parse (datasource ())
>>> Etree. tostring (tree)
B '<root> <A/> </root>'

The second method is provided by the feed (data) and close () methods through the feed parser interface.

>>> Parser = etree. xmlparser ()
>>> Parser. Feed ("<roo ")
>>> Parser. Feed ("T> <")
>>> Parser. Feed ("/")
>>> Parser. Feed ("> <")
>>> Parser. Feed ("/root> ")

>>> Root = parser. Close ()
>>> Etree. tostring (Root)
'<Root> <A/> </root>'

When calling the close () method (or when an exception occurs), you can call the feed () method to re-use Parser:
>>> Parser. Feed ("<root/> ")
>>> Root = parser. Close ()
>>> Etree. tostring (Root)
B '<root/>'

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lxml of Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lxml of Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support