This article and everyone to share is mainly the python3 of the Reptile Analysis Library XPath related content, together to see it, I hope that everyoneLearn Python crawlerhelpful.
XPath:
The full name is the XML Path Language,xml language, which is a language for finding information in an XML document and in an HTML document
1.XPath Common rules
Expression Description
nodename Select all child nodes of this node
/Select a direct child node from the current node
//Select descendant nodes from current node
. Select the current node
: Select the parent node of the current node
@ Select Attributes
2. Preparatory work: Installing the lxml library
3. Example:
From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a>
</ul>
</div>
" "
html = etree. HTML (text) # Call HTML class for HTML initialization work
r = etree.tostring (HTML) # Fix HTML code, complement all other options
Print (R.decode (' Utf-8 ')) # results Return is bytes, we convert it to UTF-8
4. All Nodes
Select all nodes:
From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//* ') # Select all
Print (res)
5. Child nodes
Select all the direct a child nodes of the LI node:
From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li/a ')
Print (res)
6. Parent node
use. and.
7. Attribute matching
From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' ex1 ')
Print (res)
8. Text Properties
choose the internal text of the Li node, two methods, recommend the second
A.
From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' Ex1 ']/a/text () ')
Print (res)
B. Recommendation, more complete information
From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li[@class = ' Ex1 ']//text () ')
Print (res)
9. Property Acquisition
gets the href attribute of all the a nodes under all LI nodes
From lxml import etree
html = Etree.parse ('./test.html ', etree. Htmlparser ())
res = Html.xpath ('//li/a/@href ')
Print (res)
10. Attribute multi-value matching
From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Li Li-first" ><a href= "ex1.html" >li1</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[contains (@class, "Li")]/a/text () ')
Print (res)
"Note"
contains (),
The first parameter passes in the property name, the second parameter passes in the property value
11. Multi-Attribute matching
determine a node based on multiple attributes
From lxml import etree
Text =
" "
<div>
<ul>
<li class= "li" name= "123" ><a href= "ex1.html" >ex1</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[@contains (@class, "Li") and @name = "123"]/a/text () ')
Print (res)
12. Sequential selection (multiple nodes)
From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a></li>
<li class= "ex3" ><a href= "ex3.html" >ex3</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[1]/a/text () ') # first Li
res = Html.xpath ('//li[last ()]/a/text () ') # last Li
res = Html.xpath ('//li[position () <3]/a/text () ') # top two Li
res = Html.xpath ('//li[last () -2]/a/text () ') # first Li
"Note"
serial number starting from 1
13. Node Axis selection
From lxml import etree
Text =
" "
<div>
<ul>
<li class= "Ex1" ><a href= "ex1.html" >ex1</a></li>
<li class= "ex2" ><a href= "ex2.html" >ex2</a></li>
<li class= "ex3" ><a href= "ex3.html" >ex3</a></li>
</ul>
</div>
" "
html = etree. HTML (text)
res = Html.xpath ('//li[1]/ancestor::* ') # Get ancestor node
res = Html.xpath ('//li[1]/ancestor::d IV ') # Get ancestor div node
res = Html.xpath ('//li[1]/attribute::* ') # All property values
res = Html.xpath ('//li[1]/child::a[href= "ex1.html"] ') # All direct child nodes
res = Html.xpath ('//li[1]/descendant::span ') # all descendant nodes
res = Html.xpath ('//li[1]/following::* [2] ') # All nodes after the current node
res = Html.xpath ('//li[1]/following-sibling::* ') # All sibling nodes after the current node
"Note" These are all axes
ancestor, attribute, child, descendant, following, Following-sibling
Source: Network
An XPath explanation of Python3 Crawler parsing library