An XPath explanation of Python3 Crawler parsing library

Source: Internet
Author: User
Tags xpath

This article and everyone to share is mainly the python3 of the Reptile Analysis Library XPath related content, together to see it, I hope that everyoneLearn Python crawlerhelpful.

XPath:

The full name is the XML Path Language,xml language, which is a language for finding information in an XML document and in an HTML document

1.XPath Common rules

Expression Description

nodename Select all child nodes of this node

/Select a direct child node from the current node

//Select descendant nodes from current node

. Select the current node

: Select the parent node of the current node

@ Select Attributes

2. Preparatory work: Installing the lxml library

3. Example:

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a>
</ul>
</div>
" "
html = etree. HTML (text) # Call HTML class for HTML initialization work
r = etree.tostring (HTML) # Fix HTML code, complement all other options
Print (R.decode (' Utf-8 ')) # results Return is bytes, we convert it to UTF-8
4. All Nodes

Select all nodes:

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//* ') # Select all
Print (res)

5. Child nodes

Select all the direct a child nodes of the LI node:

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li/a ')
Print (res)

6. Parent node

use. and.

7. Attribute matching

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' ex1 ')
Print (res)
8. Text Properties

choose the internal text of the Li node, two methods, recommend the second

A.

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' Ex1 ']/a/text () ')
Print (res)
B. Recommendation, more complete information

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' Ex1 ']//text () ')
Print (res)

9. Property Acquisition

gets the href attribute of all the a nodes under all LI nodes

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li/a/@href ')
Print (res)

10. Attribute multi-value matching

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Li Li-first" ><a href= "ex1.html" >li1</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[contains (@class, "Li")]/a/text () ')
Print (res)

"Note"

contains (),

The first parameter passes in the property name, the second parameter passes in the property value

11. Multi-Attribute matching

determine a node based on multiple attributes

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "li" name= "123" ><a href= "ex1.html" >ex1</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[@contains (@class, "Li") and @name = "123"]/a/text () ')
Print (res)

12. Sequential selection (multiple nodes)

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a></li>
<li class= "ex3" ><a href= "ex3.html" >ex3</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[1]/a/text () ') # first Li
res = Html.xpath ('//li[last ()]/a/text () ') # last Li
res = Html.xpath ('//li[position () <3]/a/text () ') # top two Li
res = Html.xpath ('//li[last () -2]/a/text () ') # first Li

"Note"

serial number starting from 1

13. Node Axis selection

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a></li>
<li class= "ex3" ><a href= "ex3.html" >ex3</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[1]/ancestor::* ') # Get ancestor node
res = Html.xpath ('//li[1]/ancestor::d IV ') # Get ancestor div node
res = Html.xpath ('//li[1]/attribute::* ') # All property values
res = Html.xpath ('//li[1]/child::a[href= "ex1.html"] ') # All direct child nodes
res = Html.xpath ('//li[1]/descendant::span ') # all descendant nodes
res = Html.xpath ('//li[1]/following::* [2] ') # All nodes after the current node
res = Html.xpath ('//li[1]/following-sibling::* ') # All sibling nodes after the current node

"Note" These are all axes

ancestor, attribute, child, descendant, following, Following-sibling


Source: Network

An XPath explanation of Python3 Crawler parsing library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.