XPath for the Python parsing library

Last Update:2018-08-23 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. XPath (XML Path Language) XML Pathname language

2. XPath Common rules:

NodeName Select all child nodes of this node

/Select a direct child node from the current node.

Select descendant nodes from the current node

. Select the current node.

.. Select the parent node of the current node

@ Select Attributes

3. Example

1  fromlxmlImportetree2 3Text =" "4 <div>5 <ul>6 <li class= "item-0" ><a href= "link1.html" >first item</a></li>7 <li class= "item-1" ><a href= "link2.html" >second item</a></li>8 <li class= "item-inactive" ><a href= "link3.html" >third item</a></li>9 <li class= "item-1" ><a href= "link4.html" >fourth item</a></li>Ten <li class= "item-0" ><a href= "link5.html" >fifth item</a> One </ul> A </div> - " " -html = etree. HTML (text)#Initialize, construct XPath object the #automatically fix HTML code, last <li> not closed, ToString () method complements HTML code, return result is bytes type -result =etree.tostring (HTML) - Print(Result.decode ('Utf-8'))

You can also read the file to parse it

1  from Import etree 2 3 html = etree.parse (r'C:\Users\Administrator\Desktop\test.txt', etree. Htmlparser ())4 result =5Print(Result.decode ('  Utf-8'))

4. Use the XPath rule that starts with//to select the node that meets the requirements

 fromlxmlImportEtreetext=" "<div> <ul> <li class= "item-0" ><a href= "link1.html" >first item</a></li> <li class= "item-1" ><a href= "link2.html" >second item</a></li> <li class= "item-inact Ive "><a href=" link3.html "> Love me China </a></li> <li class=" item-1 "><a href=" link4.html ">f Ourth item</a></li> <li class= "item-0" ><a href= "link5.html" >fifth item</a> </ul ></div>" "" "Matching Nodes" "HTML=etree. HTML (text) result1= Html.xpath ('//*')#use * to match all nodesPrint(RESULT1) result2= Html.xpath ('//li')#get all the LI nodesPrint(RESULT2)Print(result2[0]) RESULT3= Html.xpath ('//li/a')#get the direct a child node of all LI nodesPrint(RESULT3)#First, select the A node with the href attribute as link3.html, and then get its parent node, getting the value of its Class property#result4 for [' Item-inactive '], which is a list of only one elementRESULT4 = Html.xpath ('//a[@href = "link3.html"]/. /@class')Print(result4[0])#at the same time, the parent:: To obtain the Father node, such as:RESULT5 = Html.xpath ('//a[@href = "link3.html"]/parent::*/@class')" "Property Matching (when selecting a node, you can filter the attribute with the @ symbol)" "#The Li node that matches the attribute class= "Item-inactive"RESULT6 = Html.xpath ('//li[@class = "Item-inactive"]')Print(RESULT6)" "text fetching (using the text () method in XPath to get the literal in the node)" "result7= Html.xpath ('//li[@class = "Item-inactive"]/a[@href = "link3.html"]/text ()')Print(RESULT7)#Print out the list of [' Love Me China ']" "property gets the property by using @" "#The class attribute of the parent node of the a node that matches the attribute href= "link3.html"RESULT8 = Html.xpath ('//a[@href = "link3.html"]/. /@class')Print(RESULT8)#print [' item-inactive ']" "attribute multi-value matching" "html_test=" "<li class= "Li item-inactive" ><a href= "link3.html" > Love me China </a></li>" "#here, the Li Tag class attribute has two values, and if the match is not matched according to the above property, use the Contains () functionHtml_test =etree. HTML (html_test)#with the Contains method, the first parameter wears the property name, and any of the second pass-through property values can be matched toResult9 = Html_test.xpath ('//li[contains (@class, "Li")]/a/text ()')Print(RESULT9)" "Multi-attribute matching (determines a node based on multiple attributes)" "Html_test2=" "<li class= "li item-inactive" name= "item" ><a href= "link3.html" >hello world</a></li>" "#here, the Li Tag class attribute has two values, and if the match is not matched according to the above property, use the Contains () functionHtml_test =etree. HTML (HTML_TEST2)#with the Contains method, the first parameter wears the property name, and any of the second pass-through property values can be matched toResult10 = Html_test.xpath ('//li[contains (@class, li) and @name = "item"]/a[@href = "link3.html"]/text ()')Print(result10)#print [' Hello World ']

5. XPath Operators

5. Sequential selection (when multiple nodes are matched but only one of them is desired)

 fromlxmlImportEtreetext=" "<div> <ul> <li class= "item-0" ><a href= "link1.html" >first item</a></li> <li class= "item-1" ><a href= "link2.html" >second item</a></li> <li class= "item-inact Ive "><a href=" link3.html "> Love me China </a></li> <li class=" item-1 "><a href=" link4.html ">f Ourth item</a></li> <li class= "item-0" ><a href= "link5.html" >fifth item</a> </ul ></div>" "" "Select by order after matching nodes" "HTML=etree. HTML (text) result1= Html.xpath ('//li[1]/a/text ()')#Select the first of the Li nodes that match toPrint(RESULT1) result2= Html.xpath ('//li[last ()]/a/text ( )')#Select the last of the Li nodes that match toPrint(RESULT2) RESULT3= Html.xpath ('//li[position () <3]/a/text ( )')#Select the position of all the Li nodes that match to be less than 3, also the 1th, 2Print(RESULT3) Result4= Html.xpath ('//li[last () -2]/a/text ( )')#Select the third-to -last of the matching LI nodesPrint(RESULT4)" "Node Axis selection" "HTML=etree. HTML (text) result5= Html.xpath ('//li[1]/ancestor::*')#selects all ancestor nodes that match the first of the LI nodesPrint(RESULT5) result6= Html.xpath ('//li[1]/attribute::*')#Select all attribute values for the Li node that matches toPrint(RESULT6) result7= Html.xpath ('//li[1]/child::a')#Select all child nodes of the Li node that match toPrint(result7) result8= Html.xpath ('//li[1]/descendant::a')#selects all descendant nodes of the matching Li nodePrint(RESULT8) Result9= Html.xpath ('//li[1]/following::*')#selects all nodes after getting to the current nodePrint(RESULT9) result10= Html.xpath ('//li[1]/following-sibling::*')#selects all sibling nodes after the current node that gets toPrint(result10)

XPath for the Python parsing library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More