Basic use of XPath in Python

Source: Internet
Author: User
Tags xpath

written in front of the words: the previous article we use requests to carry out some small reptile experiments, but want to more smoothly into the crawler learning, understand some of the methods of parsing Web pages is definitely necessary, so we will come together to learn the basic use of Lxml.etree module

Tips : Bloggers Use the system for WIN10, using a python version of 3.6.5

First, Introduction to XPath

To understand XPath, we first need to know what an XML document is, in fact, the XML document is simply a tree of a series of nodes, such as

Common nodes in XML documents are

    • Root node: HTML
    • ELEMENT nodes: HTML, body, Div, p, a
    • Attribute node: href
    • Text node: Hello World, Click here

Common Inter-node relationships in XML documents are

    • Parent-Child: P and A are child nodes of the Div, whereas Div is the parent of P and a
    • Brother: P and A are sibling nodes
    • Ancestors/descendants: Body, Div, p, A are descendants of HTML nodes, whereas HTML is the ancestor node of body, Div, p, a

While XPath is a language used to determine the location of a part of an XML document, its full name is the XML Path Language (Language), and for Web parsing, XPath is more convenient and concise than regular expressions, so Python specifically provides a special module- The Etree module in the lxml library is used to process XPath, and we can install it using the following command

$ pip install lxml
Second, the basic use of XPath Method 1. Import Module
>>> from lxml import etree

Here, for the sake of brevity, we construct a simple XML document ourselves.

>>> sc = ‘‘‘
2. Constructing _element objects
The #可以使用HTML () method constructs the _element object and automatically complements the incomplete code >>> HTML = etree. HTML (SC) #构造对象结果检查 >>> type (HTML) <class ' lxml.etree._element ' > #补全代码结果检查, note that the ToString () method is used to The element object is converted to a bytes type string, and the decode (' Utf-8 ') method is used to convert the bytes type string to the str type string >>> print (etree.tostring (HTML). Decode ( ' Utf-8 ') 
3. Matching Results

You can use the XPath () method to match, notice that the method returns a matching list, and that each item in the list is a _element object

(1) / represents a descendant, such as E1/e2, that represents the E2 node in the E1 child node, and/E represents the E. node in the text sub-node

>>> test = html.xpath(‘/html/body/div/a‘)>>> print(test)[<Element a at 0x3843bc0>, <Element a at 0x3843c10>, <Element a at 0x3843c38>, <Element a at 0x3843c60>, <Element a at 0x3843c88>]

(2) // represents descendants, such as E1//e2, which represents the E2 node in the E1 descendant node,//e represents the E node in the document descendant node

>>> test = html.xpath(‘//a‘)>>> print(test)[<Element a at 0x3843bc0>, <Element a at 0x3843c10>, <Element a at 0x3843c38>, <Element a at 0x3843c60>, <Element a at 0x3843c88>]

(3) * represents an attribute node, such as e/*, which represents all nodes in the E child node

>>> test = html.xpath(‘/html/*‘)>>> print(test)[<Element head at 0x3843be8>, <Element body at 0x3843c10>]

(4) text() indicates that a text node, such as E/text (), represents a text node in the E child node

>>> test = html.xpath(‘/html/head/title/text()‘)>>> print(test)[‘Example website‘]

(5) @ATTR represents an attribute node, such as e/@ATTR represents the attr attribute node in the E child node

>>> test = html.xpath(‘//a/@href‘)>>> print(test)[‘image1.html‘, ‘image2.html‘, ‘image3.html‘, ‘image4.html‘, ‘image5.html‘]

(6) 谓语 to match the specified label

#指定第二个a标签>>> test = html.xpath(‘//a[2]‘)>>> print(test)[<Element a at 0x3843c88>]#指定前两个a标签>>> test = html.xpath(‘//a[position()<=2]‘)>>> print(test)[<Element a at 0x3843c60>, <Element a at 0x3843c88>]#指定带有href属性的a标签>>> test = html.xpath(‘//a[@href]‘)>>> print(test)[<Element a at 0x3843c38>, <Element a at 0x385c300>, <Element a at 0x385c2d8>, <Element a at 0x385c350>, <Element a at 0x385c328>]#指定带有href属性且值为image1.html的a标签>>> test = html.xpath(‘//a[@href="image1.html"]‘)>>> print(test)[<Element a at 0x3843c38>]
4. Common properties and methods of _element objects

We get the matching list first using the XPath () method Tests,tests is a _element object

>>> tests = html.xpath(‘//a‘)

(1) Attribute tag return label signature

>>> for test in tests:        test.tag‘a‘‘a‘‘a‘‘a‘‘a‘

(2) Property attrib returns a dictionary of properties and values

>>> for test in tests:        test.attrib{‘href‘: ‘image1.html‘}{‘href‘: ‘image2.html‘}{‘href‘: ‘image3.html‘}{‘href‘: ‘image4.html‘}{‘href‘: ‘image5.html‘}

(3) Method get() Returns the value of the specified property

>>> for test in tests:        test.get(‘href‘)‘image1.html‘‘image2.html‘‘image3.html‘‘image4.html‘‘image5.html‘

(4) Property text returns text value

>>> for test in tests:        test.text‘Image1‘‘Image2‘‘Image3‘‘Image4‘‘Image5‘

written in the following words : Now we have learned the basic use of requests and lxml.etree modules, the next article we will use them for a basic crawler training, thank you

Basic use of XPath in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.