XPath plays a pivotal role in Python's crawler learning, comparing regular expression re to doing the same work and achieving similar functions, but XPath is significantly more advantageous than re and makes re a second-tier in Web analytics.
XPath Introduction:
What is it? All called XML Path Language A small query language
said that XPath is a language, and has to say that it has the advantages:
1) can find information in XML
2) support for HTML lookup
3) navigating through elements and attributes
python Development uses XPath conditions:
Because XPath belongs to the lxml library module, you first install the library lxml, and you can view the blog, including the installation methods for Easy_install and Pip, in the specific installation process.
Simple invocation method of XPath:
from Import etreeselector=etree. HTML (source code) # Converts the source code into a format that can be matched by XPath selector.xpath (expression) # returns to a list
How to use XPath:
First, let's talk about the basic syntax of XPath:
How to use four kinds of labels
1) // double slash locates the root node, scans the full text, selects all eligible content in the document, and returns it as a list.
2) / single Slash find the next layer of path label for the current label path or manipulate the current path label content
3) /text () Gets the text content under the current path
4) /@xxxx Extract the property value of the tag under the current path
5) | Optional Use | You can select several paths such as//p | The DIV selects all the eligible P tags and div tags under the current path.
6) . Click to select the current node
7).. Select the parent node of the current node with two points
There are also Starts-with (@ attribute name, same part of attribute character), string (.) Two important special methods are highlighted later.
Use an example to explain how XPath is used:
From lxml import etreehtml= "" "<!DOCTYPE HTML> <HTML> <HeadLang= "en"> <title>Test</title> <Metahttp-equiv= "Content-type"content= "text/html; charset=utf-8" /> </Head> <Body> <DivID= "Content"> <ulID= "ul"> <Li>The</Li> <Li>No.2</Li> <Li>No.3</Li> </ul> <ulID= "Ul2"> <Li>One</Li> <Li>Both</Li> </ul> </Div> <DivID= "url"> <ahref= "Http:www.58.com"title= "+">58</a> <ahref= "Http:www.icnlogs.com"title= "Cnblog">Cnblog</a> </Div> </Body> </HTML>
Selector=etree. HTML (HTML) content=selector.xpath ('//div[@id = "Content"]/ul[@id = "ul"]/li/text ()' # Here the id attribute is used to locate which DIV and UL are matched using text () to get the textual content for inch content: Print I
#输出为
The
No.2
No.3
Con=selector.xpath ('//a/@href'# is used here//to locate qualifying a tags from the full text, using "@ Tag Properties" Gets the href attribute value for a note for the in con: theprint each
#输出结果为:
Http:www.58.com
Http:www.csdn.net
Con=selector.xpath ('/html/body/div/a/@title'# position a tag's title con with absolute path =selector.xpath ('//a/@title'# using relative path positioning both effects are the same as print Len (Con) print con[0]con[1]
#输出结果为:
2
58
Cnblog
XPath's advanced application in Python