Lxml of Python

Source: Internet
Author: User

Author: Shane
Source: http://bluescorpio.cnblogs.com

 

Lxml takes all the pain out of XML.
Stephen Richter

Lxml is the most abundant and easy-to-use library for working with XML and HTML in Python. Lxml is a python-based binding between libxml2 and libxslt libraries. What makes it different is that it takes into account the speed and functional integrity of these libraries, as well as the simplicity of pure Python APIs. Most of them are compatible with the well-known elementtree APIs but are superior to others.

Install lxml:

Requirement: python2.3 or a later version is required.

Run the following command as a Super User or administrator using the easy_install tool:

Easy_install lxml

In Windows, it is best to specify the version: easy_install lxml = 2.2.6

Use lxml for development

Lxml. etree Guide

Generally, lxml. etree is used.

>>> From lxml import etree

 

Element class. An element is the main container class of the elementtree API. Most of the XML tree functions are accessed through this class. Elements can be easily created using the element factory method.

>>> Root = etree. element ("root ")

The XML tag name of an element is accessed through the tag attribute.

>>> Print root. Tag # Root

Elements is organized in the XML tree structure. To create child elements and add them to the parent element, you can use the append () method.

>>> Root. append (etree. element ("Child1 "))

We also have a more efficient method: subelement factory method, which uses the same parameters as the element factory method, but the parent node must be used as the first parameter:

>>> Child2 = etree. subelement (root, "child2 ")
>>> Child3 = etree. subelement (root, "Child3 ")

You can use the tostring () method to view the obtained XML

>>> Print etree. tostring (root, pretty_print = true)
<Root>
<Child1/>
<Child2/>
<Child3/>
</Root>

Element is list

>>> Child = root [0]
>>> Print child. Tag
Child1

>>> Print Len (Root)
3

>>> Root. Index (root [1]) # lxml. etree only!
1

Print all child nodes:

>>> Children = List (Root)

>>> For child in root:

... Print (child. Tag)
Child1
Child2
Child3

You can use the insert () method to insert a new subnode:

>>> Root. insert (0, etree. element ("child0 "))
Delete a subnode:

>>> Root [0] = root [-1] # This moves the element!
>>> For child in root:
... Print (child. Tag)
Child3
Child1
Child2

If you want to copy an element to a different place, you need to create an independent deep copy.

>>> From copy import deepcopy
>>> Element = etree. element ("neu ")
>>> Element. append (deepcopy (root [1])
>>> Print (element [0]. Tag)
Child1
>>> Print ([C. Tag for C in root])
['Child3 ', 'child1', 'child2 ']

Getparent () returns the parent node:
>>> Root is root [0]. getparent () # lxml. etree only!
True

The sibling or neighbor node of the element is accessed through the next and previous attributes.
The siblings (or neighbors) of an element are accessed as next and previous elements:

>>> Root [0] is root [1]. getprevious () # lxml. etree only!
True
>>> Root [1] is root [0]. getnext () # lxml. etree only!
True

Element with attribute

XML elements support attributes and can be directly created using the element factory method.

>>> Root = etree. element ("root", interesting = "totally ")
>>> Etree. tostring (Root)
B '<root interesting = "totally"/>'

You can use the set and get methods to access these attributes:

>>> Print root. Get ("interesting ")
Totally
>>> Root. Set ("interesting", "somewhat ")
>>> Print root. Get ("interesting ")
Somewhat

You can also use the attrib dictionary interface.

>>> Attributes = root. attrib
>>> Print (attributes ["interesting"])
Somewhat
>>> Print (attributes. Get ("hello "))
None
>>> Attributes ["hello"] = "Guten Tag"
>>> Print (attributes. Get ("hello "))
Guten Tag
>>> Print (root. Get ("hello "))
Guten Tag

 

The element can contain text:

>>> Root = etree. element ("root ")
>>> Root. Text = "text"
>>> Print (root. Text)
Text
>>> Etree. tostring (Root)
'<Root> text </root>'

If XML is used in (x) HTML, the text can also be displayed in different elements:
<HTML> <body> Hello <br/> world </body> The element has the tail attribute, which contains the text of the element directly following in the XML tree until the next element.

>>> Html = etree. element ("html ")
>>> Body = etree. subelement (HTML, "body ")
>>> Body. Text = "text"
>>> Etree. tostring (HTML)
B '<HTML> <body> text </body> >>> BR = etree. subelement (body, "Br ")
>>> Etree. tostring (HTML)
B '<HTML> <body> text <br/> </body> >>> Br. Tail = "tail"
>>> Etree. tostring (HTML)
B '<HTML> <body> text <br/> tail </body>

 

Use XPath to search for text

The text content extracted from the XML tree is XPath,
>>> Print (html. XPath ("string ()") # lxml. etree only!
Texttail
>>> Print (html. XPath ("// text ()") # lxml. etree only!
['Text', 'tail']

If it is frequently used, you can wrap it into a method:

>>> Build_text_list = etree. XPath ("// text ()") # lxml. etree only!
>>> Print (build_text_list (HTML ))
['Text', 'tail']

You can also use the getparent method to obtain the parent node.

>>> Texts = build_text_list (HTML)
>>> Print (texts [0])
Text
>>> Parent = texts [0]. getparent ()
>>> Print (parent. Tag)
Body
>>> Print (texts [1])
Tail
>>> Print (texts [1]. getparent (). Tag)
BR
You can also find out if it's normal text content or tail text:
>>> Print (texts [0]. is_text)
True
>>> Print (texts [1]. is_text)
False
>>> Print (texts [1]. is_tail)
True

 

Tree iteration:

Elements provides a tree iterator for iterative access to tree elements.

>>> Root = etree. element ("root ")
>>> Etree. subelement (root, "child"). Text = "CHILD 1"
>>> Etree. subelement (root, "child"). Text = "Child 2"
>>> Etree. subelement (root, "another"). Text = "child 3"
>>> Print (etree. tostring (root, pretty_print = true ))
<Root>
<Child> Child 1 </child>
<Child> Child 2 </child>
<Another> child 3 </Another>
</Root>

>>> For element in root. ITER ():
... Print ("% s-% s" % (element. Tag, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3

If you know the tag you are interested in, you can pass the tag name to the ITER Method for filtering.

>>> For element in root. ITER ("child "):
... Print ("% s-% s" % (element. Tag, element. Text ))
Child-Child 1
Child-Child 2

By default, the iterator obtains all nodes of a tree, including instances of processinginstructions, comments and entity. If you want to confirm that only the elements object is returned, you can pass in element factory as a parameter.

>>> Root. append (etree. Entity ("#234 "))
>>> Root. append (etree. Comment ("some comment "))
>>> For element in root. ITER ():
... If isinstance (element. Tag, basestring ):
... Print ("% s-% s" % (element. Tag, element. Text ))
... Else:
... Print ("Special: % s-% s" % (element, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
Special: audio-extract
Special: <! -- Some comment -->-some comment

>>> For element in root. ITER (TAG = etree. element ):
... Print ("% s-% s" % (element. Tag, element. Text ))
Root-None
Child-Child 1
Child-Child 2
Another-child 3
>>> For element in root. ITER (TAG = etree. entity ):
... Print (element. Text)

 

Serialization:

Serialization usually uses the tostring () method to return a string, or elementtree. write () method to write a file, an object of a class file, or a URL (put through FTP or http post ). Both use the same keyword parameter such as pretty_print to format the output or encoding to select a specific output encoding instead of simple ASCII.

>>> Root = etree. XML ("<root> <A> <B/> </a> </root> ")
>>> Etree. tostring (Root)
'<Root> <A> <B/> </a> </root>'

>>> Print etree. tostring (root, xml_declaration = true)
<? XML version = '1. 0' encoding = 'ascii '?>
<Root> <A> <B/> </a> </root>

>>> Print etree. tostring (root, encoding = "iso-8859-1 ")
<? XML version = '1. 0' encoding = 'iso-8859-1 '?>
<Root> <A> <B/> </a> </root>

>>> Print etree. tostring (root, pretty_print = true)
<Root>
<A>
<B/>
</A>
</Root>

Note that pretty printing appends a newline at the end.

Note that pretty print adds a new line at the end.

From lxml2.0, serialisation can not only serialize XML, but also serialize to HTML or extract text content by passing function keywords.

>>> Root = etree. XML ("<HTML>
>>> Etree. tostring (Root) # default: method = 'xml'
'<HTML> >>> Etree. tostring (root, method = "XML") # Same as abve
'<HTML> >>> Etree. tostring (root, method = "html ")
'<HTML>

>>> Print etree. tostring (root, method = "html", pretty_print = true)
<HTML>
<Head> <Body> <p> Hello <br> world </P> </body>
</Html>

>>> Etree. tostring (root, method = "text ")
B 'helloworld'

For XML serialization, the default text encoding is ASCII.

>>> BR = root. Find (". // Br ")
>>> Br. Tail = u "W/xf6rld"
>>> Etree. tostring (root, method = "text") # doctest: + ellipsis
Traceback (most recent call last ):
...
Unicodeencodeerror: 'ascii 'codec can't encode character U'/xf6 '...
>>> Etree. tostring (root, method = "text", encoding = "UTF-8 ")
B 'hellow/xc3/xb6rld'

>>> Etree. tostring (root, encoding = Unicode, method = "text ")
U'hellow/xf6rld'

Elementtree class:

An elementtree is a Document Packaging class centered around a tree with a root node. It provides many methods for parsing, serialization, and general document processing. The biggest difference is that it is serialized as a whole document. In contrast, it is serialized into a single element.

>>> Tree = etree. parse (stringio ("""/
<? XML version = "1.0"?>
<! Doctype root system "test" [<! Entity tasty "eggs">]>
<Root>
<A> & tasty; </a>
</Root>
"""))
>>> Print(tree.docinfo.doc type)
<! Doctype root system "test">

>>># Lxml 1.3.4 and later
>>> Print (etree. tostring (tree ))
<! Doctype root system "test "[
<! Entity tasty "eggs">
]>
<Root>
<A> eggs </a>
</Root>

>>># Lxml 1.3.4 and later
>>> Print (etree. tostring (etree. elementtree (tree. getroot ())))
<! Doctype root system "test "[
<! Entity tasty "eggs">
]>
<Root>
<A> eggs </a>
</Root>

>>># Elementtree and lxml <= 1.3.3
>>> Print (etree. tostring (tree. getroot ()))
<Root>
<A> eggs </a>
</Root>

Parse from strings and files:

Fromstring () is the easiest way to parse strings

>>> Some_xml_data = "<root> data </root>"
>>> Root = etree. fromstring (some_xml_data)
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'

The XML () method is similar to the fromstring () method, but it is mainly used to write XML text to the source file.
>>> Root = etree. XML ("<root> data </root> ")
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'

The parse () method is used to Parse Files or Class Object.
>>> Some_file_like = stringio. stringio ("<root> data </root> ")
>>> Tree = etree. parse (some_file_like)
>>> Etree. tostring (tree)
'<Root> data </root>'

Note that parse () returns an elementtree object instead of the Element Object of the string parsing method.

>>> Root = tree. getroot ()
>>> Print root. Tag
Root
>>> Etree. tostring (Root)
'<Root> data </root>'

 

Parser object: lxml. etree uses the standard parser with the default configuration by default. to configure the parser, you can create your own instance.

>>> Parser = etree. xmlparser (remove_blk_text = true) # lxml. etree only!

In this example, a parser is created to remove null text between tags, which can reduce the size of the tree and avoid tail, if you know that the blank content does not make any sense to you.

>>> Root = etree. XML ("<root> <A/> <B> </B> </root>", Parser)
>>> Etree. tostring (Root)
B '<root> <A/> <B> </B> </root>'
>>> For element in root. ITER ("*"):
... If element. text is not none and not element. Text. Strip ():
... Element. Text = none
>>> Etree. tostring (Root)
B '<root> <A/> <B/> </root>'

Incremental Resolution:

Lxml. etree provides two methods for incremental parsing. One method is through a class object, which repeatedly calls the read () method.
>>> Class datasource:
... Data = [B "<roo", B "T> <", B "A/", B "> <", B "/root>"]
... Def read (self, requested_size ):
... Try:
... Return self. Data. Pop (0)
... Handle T indexerror:
... Return B''
>>> Tree = etree. parse (datasource ())
>>> Etree. tostring (tree)
B '<root> <A/> </root>'

The second method is provided by the feed (data) and close () methods through the feed parser interface.

>>> Parser = etree. xmlparser ()
>>> Parser. Feed ("<roo ")
>>> Parser. Feed ("T> <")
>>> Parser. Feed ("/")
>>> Parser. Feed ("> <")
>>> Parser. Feed ("/root> ")

>>> Root = parser. Close ()
>>> Etree. tostring (Root)
'<Root> <A/> </root>'

When calling the close () method (or when an exception occurs), you can call the feed () method to re-use Parser:
>>> Parser. Feed ("<root/> ")
>>> Root = parser. Close ()
>>> Etree. tostring (Root)
B '<root/>'

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.