1. XPath (XML Path Language) XML Pathname language
2. XPath Common rules:
NodeName Select all child nodes of this node
/Select a direct child node from the current node.
Select descendant nodes from the current node
. Select the current node.
.. Select the parent node of the current node
@ Select Attributes
3. Example
1 fromlxmlImportetree2 3Text =" "4 <div>5 <ul>6 <li class= "item-0" ><a href= "link1.html" >first item</a></li>7 <li class= "item-1" ><a href= "link2.html" >second item</a></li>8 <li class= "item-inactive" ><a href= "link3.html" >third item</a></li>9 <li class= "item-1" ><a href= "link4.html" >fourth item</a></li>Ten <li class= "item-0" ><a href= "link5.html" >fifth item</a> One </ul> A </div> - " " -html = etree. HTML (text)#Initialize, construct XPath object the #automatically fix HTML code, last <li> not closed, ToString () method complements HTML code, return result is bytes type -result =etree.tostring (HTML) - Print(Result.decode ('Utf-8'))
You can also read the file to parse it
1 from Import etree 2 3 html = etree.parse (r'C:\Users\Administrator\Desktop\test.txt', etree. Htmlparser ())4 result =5Print(Result.decode (' Utf-8'))
4. Use the XPath rule that starts with//to select the node that meets the requirements
fromlxmlImportEtreetext=" "<div> <ul> <li class= "item-0" ><a href= "link1.html" >first item</a></li> <li class= "item-1" ><a href= "link2.html" >second item</a></li> <li class= "item-inact Ive "><a href=" link3.html "> Love me China </a></li> <li class=" item-1 "><a href=" link4.html ">f Ourth item</a></li> <li class= "item-0" ><a href= "link5.html" >fifth item</a> </ul ></div>" "" "Matching Nodes" "HTML=etree. HTML (text) result1= Html.xpath ('//*')#use * to match all nodesPrint(RESULT1) result2= Html.xpath ('//li')#get all the LI nodesPrint(RESULT2)Print(result2[0]) RESULT3= Html.xpath ('//li/a')#get the direct a child node of all LI nodesPrint(RESULT3)#First, select the A node with the href attribute as link3.html, and then get its parent node, getting the value of its Class property#result4 for [' Item-inactive '], which is a list of only one elementRESULT4 = Html.xpath ('//a[@href = "link3.html"]/. /@class')Print(result4[0])#at the same time, the parent:: To obtain the Father node, such as:RESULT5 = Html.xpath ('//a[@href = "link3.html"]/parent::*/@class')" "Property Matching (when selecting a node, you can filter the attribute with the @ symbol)" "#The Li node that matches the attribute class= "Item-inactive"RESULT6 = Html.xpath ('//li[@class = "Item-inactive"]')Print(RESULT6)" "text fetching (using the text () method in XPath to get the literal in the node)" "result7= Html.xpath ('//li[@class = "Item-inactive"]/a[@href = "link3.html"]/text ()')Print(RESULT7)#Print out the list of [' Love Me China ']" "property gets the property by using @" "#The class attribute of the parent node of the a node that matches the attribute href= "link3.html"RESULT8 = Html.xpath ('//a[@href = "link3.html"]/. /@class')Print(RESULT8)#print [' item-inactive ']" "attribute multi-value matching" "html_test=" "<li class= "Li item-inactive" ><a href= "link3.html" > Love me China </a></li>" "#here, the Li Tag class attribute has two values, and if the match is not matched according to the above property, use the Contains () functionHtml_test =etree. HTML (html_test)#with the Contains method, the first parameter wears the property name, and any of the second pass-through property values can be matched toResult9 = Html_test.xpath ('//li[contains (@class, "Li")]/a/text ()')Print(RESULT9)" "Multi-attribute matching (determines a node based on multiple attributes)" "Html_test2=" "<li class= "li item-inactive" name= "item" ><a href= "link3.html" >hello world</a></li>" "#here, the Li Tag class attribute has two values, and if the match is not matched according to the above property, use the Contains () functionHtml_test =etree. HTML (HTML_TEST2)#with the Contains method, the first parameter wears the property name, and any of the second pass-through property values can be matched toResult10 = Html_test.xpath ('//li[contains (@class, li) and @name = "item"]/a[@href = "link3.html"]/text ()')Print(result10)#print [' Hello World ']
5. XPath Operators
5. Sequential selection (when multiple nodes are matched but only one of them is desired)
fromlxmlImportEtreetext=" "<div> <ul> <li class= "item-0" ><a href= "link1.html" >first item</a></li> <li class= "item-1" ><a href= "link2.html" >second item</a></li> <li class= "item-inact Ive "><a href=" link3.html "> Love me China </a></li> <li class=" item-1 "><a href=" link4.html ">f Ourth item</a></li> <li class= "item-0" ><a href= "link5.html" >fifth item</a> </ul ></div>" "" "Select by order after matching nodes" "HTML=etree. HTML (text) result1= Html.xpath ('//li[1]/a/text ()')#Select the first of the Li nodes that match toPrint(RESULT1) result2= Html.xpath ('//li[last ()]/a/text ( )')#Select the last of the Li nodes that match toPrint(RESULT2) RESULT3= Html.xpath ('//li[position () <3]/a/text ( )')#Select the position of all the Li nodes that match to be less than 3, also the 1th, 2Print(RESULT3) Result4= Html.xpath ('//li[last () -2]/a/text ( )')#Select the third-to -last of the matching LI nodesPrint(RESULT4)" "Node Axis selection" "HTML=etree. HTML (text) result5= Html.xpath ('//li[1]/ancestor::*')#selects all ancestor nodes that match the first of the LI nodesPrint(RESULT5) result6= Html.xpath ('//li[1]/attribute::*')#Select all attribute values for the Li node that matches toPrint(RESULT6) result7= Html.xpath ('//li[1]/child::a')#Select all child nodes of the Li node that match toPrint(result7) result8= Html.xpath ('//li[1]/descendant::a')#selects all descendant nodes of the matching Li nodePrint(RESULT8) Result9= Html.xpath ('//li[1]/following::*')#selects all nodes after getting to the current nodePrint(RESULT9) result10= Html.xpath ('//li[1]/following-sibling::*')#selects all sibling nodes after the current node that gets toPrint(result10)
XPath for the Python parsing library