Pyquery Library is also a very powerful and flexible Web page parsing library, if you have the front-end development experience, should have been exposed to jQuery, then Pyquery is your very good choice, Pyquery Python is modeled after the strict implementation of jquery. The syntax is almost identical to jQuery, so there's no need to bother remembering some strange ways.
Website address: http://pyquery.readthedocs.io/en/latest/
jquery Reference Document: http://jquery.cuishifeng.cn/
1. Initialization of strings
From pyquery import pyquery as pqhtml = "<div> <ul> <li class=" item-0 ">first item</li > <li class= "item-1" ><a href= "link2.html" >second item</a></li> <li class= " item-0 Active "><a href=" link3.html "><span class=" bold ">third item</span></a></li> <li class= "item-1 active" ><a href= "link4.html" >fourth item</a></li> <li class= " item-0 "><a href=" link5.html ">fifth item</a></li> </ul></div>" doc = PQ (HTML) Print (DOC) print (DOC) print (' Li ')
<div> <ul> <liclass="item-0">first item</li> <liclass="item-1"><a href="link2.html">second item</a></li> <liclass="item-0 Active"><a href="link3.html"><spanclass="Bold">third item</span></a></li> <liclass="item-1 Active"><a href="link4.html">fourth item</a></li> <liclass="item-0"><a href="link5.html">fifth item</a></li> </ul></div><class 'Pyquery.pyquery.PyQuery'><liclass="item-0">first item</li> <liclass="item-1"><a href="link2.html">second item</a></li> <liclass="item-0 Active"><a href="link3.html"><spanclass="Bold">third item</span></a></li> <liclass="item-1 Active"><a href="link4.html">fourth item</a></li> <liclass="item-0"><a href="link5.html">fifth item</a></li>
Run Results
2. Open HTML file
Pay attention to the problem of road strength
From pyquery import pyquery as Pqdoc = PQ (filename= ' index.html ') print (DOC) print (Doc (' head '))
<title>Title</title>class="item-0">first item</li> <liclass="item-1"><a href="link2.html">second item</a></li> <liclass="item-0 Active"><a href="link3.html"><spanclass="Bold">third item</span></a></li> <liclass="item-1 Active"><a href="link4.html">fourth item</a></li> <liclass="item-0"><a href="link5.html">fifth item</a></li> </ul></div>" "</body>
Run Results
3. Open a website
Doc = PQ (' https://www.baidu.com ') # Doc1 = PQ (url= ' https://www.baidu.com ') print (DOC) print (Doc (' head '))
4. Search based on CSS selector
From pyquery import pyquery as pqhtml = ' <div> <ul id = ' haha ' > <li class= ' item-0 ' >first item& lt;/li> <li class= "item-1" ><a href= "link2.html" >second item</a></li> <li class= "item-0 Active" ><a href= "link3.html" ><span class= "bold" >third item</span></a>< /li> <li class= "item-1 active" ><a href= "link4.html" >fourth item</a></li> <li class= "item-0" ><a href= "link5.html" >fifth item</a></li> </ul></div> "doc = The span label under the A label (note the hierarchical relationship is separated by spaces) in the PQ (HTML) print (DOC) #id等于haha下面的class等于item-0 print (Doc (' #haha. item-0 a span '))
<div> <ul id="haha"> <liclass="item-0">first item</li> <liclass="item-1"><a href="link2.html">second item</a></li> <liclass="item-0 Active"><a href="link3.html"><spanclass="Bold">third item</span></a></li> <liclass="item-1 Active"><a href="link4.html">fourth item</a></li> <liclass="item-0"><a href="link5.html">fifth item</a></li> </ul></div><spanclass="Bold">third item</span>
Run Results
The basic use of the Python crawler's pyquery