Basic use of XPath in Python

Last Update:2018-08-23 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

written in front of the words: the previous article we use requests to carry out some small reptile experiments, but want to more smoothly into the crawler learning, understand some of the methods of parsing Web pages is definitely necessary, so we will come together to learn the basic use of Lxml.etree module

Tips : Bloggers Use the system for WIN10, using a python version of 3.6.5

First, Introduction to XPath

To understand XPath, we first need to know what an XML document is, in fact, the XML document is simply a tree of a series of nodes, such as

Common nodes in XML documents are

 
  
   
   Root node: HTML 
   ELEMENT nodes: HTML, body, Div, p, a 
   Attribute node: href 
   Text node: Hello World, Click here 
   
 
Common Inter-node relationships in XML documents are
 
  
   
   Parent-Child: P and A are child nodes of the Div, whereas Div is the parent of P and a 
   Brother: P and A are sibling nodes 
   Ancestors/descendants: Body, Div, p, A are descendants of HTML nodes, whereas HTML is the ancestor node of body, Div, p, a 
   
 
While XPath is a language used to determine the location of a part of an XML document, its full name is the XML Path Language (Language), and for Web parsing, XPath is more convenient and concise than regular expressions, so Python specifically provides a special module- The Etree module in the lxml library is used to process XPath, and we can install it using the following command
$ pip install lxml
Second, the basic use of XPath Method 1. Import Module>>> from lxml import etree
Here, for the sake of brevity, we construct a simple XML document ourselves.
>>> sc = ‘‘‘
2. Constructing _element objectsThe #可以使用HTML () method constructs the _element object and automatically complements the incomplete code >>> HTML = etree. HTML (SC) #构造对象结果检查 >>> type (HTML) <class ' lxml.etree._element ' > #补全代码结果检查, note that the ToString () method is used to The element object is converted to a bytes type string, and the decode (' Utf-8 ') method is used to convert the bytes type string to the str type string >>> print (etree.tostring (HTML). Decode ( ' Utf-8 ') 
3. Matching Results
You can use the XPath () method to match, notice that the method returns a matching list, and that each item in the list is a _element object
(1) / represents a descendant, such as E1/e2, that represents the E2 node in the E1 child node, and/E represents the E. node in the text sub-node
>>> test = html.xpath(‘/html/body/div/a‘)>>> print(test)[<Element a at 0x3843bc0>, <Element a at 0x3843c10>, <Element a at 0x3843c38>, <Element a at 0x3843c60>, <Element a at 0x3843c88>]
(2) // represents descendants, such as E1//e2, which represents the E2 node in the E1 descendant node,//e represents the E node in the document descendant node
>>> test = html.xpath(‘//a‘)>>> print(test)[<Element a at 0x3843bc0>, <Element a at 0x3843c10>, <Element a at 0x3843c38>, <Element a at 0x3843c60>, <Element a at 0x3843c88>]
(3) * represents an attribute node, such as e/*, which represents all nodes in the E child node
>>> test = html.xpath(‘/html/*‘)>>> print(test)[<Element head at 0x3843be8>, <Element body at 0x3843c10>]
(4) text() indicates that a text node, such as E/text (), represents a text node in the E child node
>>> test = html.xpath(‘/html/head/title/text()‘)>>> print(test)[‘Example website‘]
(5) @ATTR represents an attribute node, such as e/@ATTR represents the attr attribute node in the E child node
>>> test = html.xpath(‘//a/@href‘)>>> print(test)[‘image1.html‘, ‘image2.html‘, ‘image3.html‘, ‘image4.html‘, ‘image5.html‘]
(6) 谓语 to match the specified label
#指定第二个a标签>>> test = html.xpath(‘//a[2]‘)>>> print(test)[<Element a at 0x3843c88>]#指定前两个a标签>>> test = html.xpath(‘//a[position()<=2]‘)>>> print(test)[<Element a at 0x3843c60>, <Element a at 0x3843c88>]#指定带有href属性的a标签>>> test = html.xpath(‘//a[@href]‘)>>> print(test)[<Element a at 0x3843c38>, <Element a at 0x385c300>, <Element a at 0x385c2d8>, <Element a at 0x385c350>, <Element a at 0x385c328>]#指定带有href属性且值为image1.html的a标签>>> test = html.xpath(‘//a[@href="image1.html"]‘)>>> print(test)[<Element a at 0x3843c38>]
4. Common properties and methods of _element objects
We get the matching list first using the XPath () method Tests,tests is a _element object
>>> tests = html.xpath(‘//a‘)
(1) Attribute tag return label signature
>>> for test in tests:        test.tag‘a‘‘a‘‘a‘‘a‘‘a‘
(2) Property attrib returns a dictionary of properties and values
>>> for test in tests:        test.attrib{‘href‘: ‘image1.html‘}{‘href‘: ‘image2.html‘}{‘href‘: ‘image3.html‘}{‘href‘: ‘image4.html‘}{‘href‘: ‘image5.html‘}
(3) Method get() Returns the value of the specified property
>>> for test in tests:        test.get(‘href‘)‘image1.html‘‘image2.html‘‘image3.html‘‘image4.html‘‘image5.html‘
(4) Property text returns text value
>>> for test in tests:        test.text‘Image1‘‘Image2‘‘Image3‘‘Image4‘‘Image5‘
written in the following words : Now we have learned the basic use of requests and lxml.etree modules, the next article we will use them for a basic crawler training, thank you
Basic use of XPath in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More