Environment: python2.7
Installing the lxml module
Pip Install lxml
Example:
from lxml import etreetext = ' <div> <ul> <li class= "item-0" ><a href= "link1.html" > First item</a></li> <li class= " Item-1 "><a href=" link2.html ">second item</a></li> <li class= "item-inactive" ><a href= "link3.html" >third item </a></li> <li class= "Item-1" ><a href= "Link4.html" >fourth item</a></li> <li class= "item-0" ><a href= "link5.html" >fifth item</a> </ul> </div> ' Html = etree. HTML (text) #这是一个地址result = etree.tOstring (HTML) #读出来源码, and complete, such as the output of the "body" tag print (result)
Output:
#读取文件里的内容from lxml Import etreehtml = Etree.parse (' hello.html ') result = etree.tostring (HTML, pretty_print=true) print ( Result
Get what's in the Li tag.
. parse ' hello.html ' ) print type ( html Span class= "Crayon-sy" style= "Font-family:inherit;height:inherit;color:rgb (51,51,51);" >) . xpath '//li ' Span class= "Crayon-sy" style= "Font-family:inherit;height:inherit;color:rgb (51,51,51);" >) Print Result Print len(result) print type ( result Span class= "Crayon-sy" style= "Font-family:inherit;height:inherit;color:rgb (51,51,51);" >) Print type(result[0]) |
Reference article: http://cuiqingcai.com/2621.html
Note: This blog is just for their own learning lxml module, so did not write well, the following is my QR code
650) this.width=650; "src=" https://s3.51cto.com/oss/201710/26/2852890398f48fee0c11bb77eaaf87da.jpg "title=" QRCode _for_gh_0cd223682950_344.jpg "alt=" 2852890398f48fee0c11bb77eaaf87da.jpg "/>
This article from "Tiandaochouqin" blog, declined reprint!
Python's lxml module