This article introduces you to the Python crawler Lxml-etree and XPath use (attached case), the content is very detailed, I hope to help everyone.
Lxml:python's Html/xml Parser
Official documents: https://lxml.de/
Before use, need to install an lxml bag
Function:
1. Parsing HTML: Using Etree. HTML (text) parses HTML fragments of string format into HTML documents
2. Read the XML file
3.etree and XPath work together
Installation of lxml
"Pycharm" > "File" > "Settings" > "Project Interpreter" > "+" > "lxml" > "Install"
Specific operation:
Use of Lxml-etree
# First install lxml# with lxml to parse HTML code from lxml Import etreetext = ' <p> <ul> <li class= ' item-0 ' ><a hr ef= "0.html" >item 0 </a></li> <li class= "item-1" ><a href= "1.html" >item 1 </a> </li> <li class= "item-2" ><a href= "2.html" >item 2 </a></li> <li class= " Item-3 "><a href=" 3.html ">item 3 </a></li> <li class=" item-4 "><a href=" 4.html "> Item 4 </a></li> <li class= "item-5" ><a href= "5.html" >item 5 </a></li> </ul> </p> ' # using Etree. HTML parses the string into HTML file HTML = etree. HTML (text) s = etree.tostring (HTML). Decode () print (s)
Run results
Use of Lxml-etree
# Lxml-etree read file from lxml import etreexml = Etree.parse ("./py24.xml") sXML = etree.tostring (XML, pretty_print=true) print (sXML)
Run results
Etree and XPath used together
# Lxml-etree read file from lxml import etreexml = Etree.parse ("./py24.xml") Print (Type (XML)) # Find all book node rst = Xml.xpath ('//boo K ') print (RST) # finds elements with the category attribute value of sport Rst2 = Xml.xpath ('//book[@category = "Sport") print (Type ( RST2)) print (RST2) # finds the year element down to the book element with the Category property value of sport element rst3 = Xml.xpath ('//book[@category = "Sport"]/year ') rst3 = Rst3[0]print ('-------------\ n ', type (RST3)) print (Rst3.tag) print (Rst3.text)
Run results
Etree and XPath working with results