An XPath explanation of Python3 Crawler parsing library

Last Update:2018-05-04 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article and everyone to share is mainly the python3 of the Reptile Analysis Library XPath related content, together to see it, I hope that everyoneLearn Python crawlerhelpful.

XPath:

The full name is the XML Path Language,xml language, which is a language for finding information in an XML document and in an HTML document

1.XPath Common rules

Expression Description

nodename Select all child nodes of this node

/Select a direct child node from the current node

//Select descendant nodes from current node

. Select the current node

: Select the parent node of the current node

@ Select Attributes

2. Preparatory work: Installing the lxml library

3. Example:

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a>
</ul>
</div>
" "
html = etree. HTML (text) # Call HTML class for HTML initialization work
r = etree.tostring (HTML) # Fix HTML code, complement all other options
Print (R.decode (' Utf-8 ')) # results Return is bytes, we convert it to UTF-8
4. All Nodes

Select all nodes:

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//* ') # Select all
Print (res)

5. Child nodes

Select all the direct a child nodes of the LI node:

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li/a ')
Print (res)

6. Parent node

use. and.

7. Attribute matching

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' ex1 ')
Print (res)
8. Text Properties

choose the internal text of the Li node, two methods, recommend the second

A.

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' Ex1 ']/a/text () ')
Print (res)
B. Recommendation, more complete information

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' Ex1 ']//text () ')
Print (res)

9. Property Acquisition

gets the href attribute of all the a nodes under all LI nodes

From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li/a/@href ')
Print (res)

10. Attribute multi-value matching

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Li Li-first" ><a href= "ex1.html" >li1</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[contains (@class, "Li")]/a/text () ')
Print (res)

"Note"

contains (),

The first parameter passes in the property name, the second parameter passes in the property value

11. Multi-Attribute matching

determine a node based on multiple attributes

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "li" name= "123" ><a href= "ex1.html" >ex1</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[@contains (@class, "Li") and @name = "123"]/a/text () ')
Print (res)

12. Sequential selection (multiple nodes)

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a></li>
<li class= "ex3" ><a href= "ex3.html" >ex3</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[1]/a/text () ') # first Li
res = Html.xpath ('//li[last ()]/a/text () ') # last Li
res = Html.xpath ('//li[position () <3]/a/text () ') # top two Li
res = Html.xpath ('//li[last () -2]/a/text () ') # first Li

"Note"

serial number starting from 1

13. Node Axis selection

From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a></li>
<li class= "ex3" ><a href= "ex3.html" >ex3</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[1]/ancestor::* ') # Get ancestor node
res = Html.xpath ('//li[1]/ancestor::d IV ') # Get ancestor div node
res = Html.xpath ('//li[1]/attribute::* ') # All property values
res = Html.xpath ('//li[1]/child::a[href= "ex1.html"] ') # All direct child nodes
res = Html.xpath ('//li[1]/descendant::span ') # all descendant nodes
res = Html.xpath ('//li[1]/following::* [2] ') # All nodes after the current node
res = Html.xpath ('//li[1]/following-sibling::* ') # All sibling nodes after the current node

"Note" These are all axes

ancestor, attribute, child, descendant, following, Following-sibling

Source: Network

An XPath explanation of Python3 Crawler parsing library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More