The original title: "Python web crawler-scrapy of the selector XPath" to the original text has been modified and interpreted
Advantage
XPath is more convenient to choose than CSS selectors.
- No label for ID class Name property
- Labels with no significant attributes or text characteristics
- Tags with extremely complex nesting levels
XPath path
Positioning method
/ 绝对路径 表示从根节点开始选取// 相对路径 表示从任意节点开始
Basic node positioning
#查找html下的body下的form下的所有input节点/html/body/form/input#查找所有input节点//input
Using wildcard characters
*
Positioning
#查找form节点下的所有节点//form/*#查找所有节点//*#查找所有input节点(input至少有爷爷辈亲戚节点)//*/input
Using index positioning
#定位 第8个td下的 第2个a节点//*/td[7]/a[1]#定位 第8个td下的 第3个span节点//*/td[7]/span[2]#定位 最后一个td下的 最后一个a节点//*/td[last()]/a[last()]
Using Attributes
#定位所有包含name属性的input节点//input[@name]#定位含有属性的所有的input节点//input[@*]#定位所有value=2的input节点//input[@value=‘2‘]#使用多个属性定位//input[@value=‘2‘][@id=‘3‘]//input[@value=‘2‘ and @id=‘3‘]
Using function positioning
function |
meaning |
Contains (,) |
The former contains the latter |
Text () |
Gets the string in the node |
Starts-with () |
String that matches the starting position |
<a class="menu_hot" href="/ads/auth/promote.html">应用推广</a>
#定位href属性中包含“promote.html”的所有a节点//a[contains(@href,‘promote.html‘)]#元素内的文本为“应用推广”的所有a节点//a[text()=‘应用推广‘]#href属性值是以“/ads”开头的所有a节点//a[starts-with(@href,‘/ads‘)]
Using the XPath axis
This section is similar to the sibling, parents, children methods in BeautifulSoup.
Axis name |
meaning |
Ancestor |
Selects all ancestor nodes of the current node |
Ancestor-or-self |
Selects all ancestor nodes of the current node and the current node itself |
Attribute |
Selects all properties of the current node |
Child |
Selects all child nodes of the current node |
Descendant |
Selects all descendant nodes of the current node |
Descendant-or-self |
Selects all descendant nodes of the current node and the current node itself |
Following |
Select all nodes at the end of the Party construction node |
Parent |
Select the parent node of the current node |
Preceding-sibling |
Selects all sibling nodes before the current node |
Self |
Select the current node itself |
Original address: Http://mp.weixin.qq.com/s/UT4UFDpgo2ER300zq_uqsQ
A simple instance of the Scrapy framework element selector XPath in Python