Most of this article is reproduced in https://www.jianshu.com/p/c07f7cd1b548
First put your own resolution TechWeb a site image of the code
fromPyqueryImportPyquery as Pqheaders= {'user-agent':'mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36' '(khtml, like Gecko) chrome/63.0.3239.84 safari/537.36'}defget_info (URL): HTML= Requests.get (Url,headers =headers,verify=False) d=PQ (html.content) Doc= d ("Div"). Filter (". List_con") Doc= Doc ("Div"). Filter (". Picture_text") forTrinchdoc.items (): Temp=tr.find ("img") Print(Temp.attr ("src"))if __name__=="__main__": Get_info ("http://mi.techweb.com.cn/")
Objective
Python's Library of reptiles is quite numerous and has its own strengths. Knowing the front end also knows that jquery can precisely locate and manipulate objects in the DOM tree through selectors, so I think it's cool if I can crawl the web with JQuery.
Just look at Python there is no DOM-related libraries or something, and really found- pyquery !
Pyquery Introduction
Pyquery is equivalent to the Python implementation of jquery, which can be used to parse HTML pages and so on. Its syntax is almost identical to jquery and is familiar and well-known to people who have used jquery.
The quoted author's exact words are:
"The API is as much as possible the similar to jquery."
Installation
You can use PIP or Easy_install.
Note: because Pyquery relies on lxml, you will be prompted to fail if you install lxml first.
- Install lxml:https://pypi.python.org/pypi/lxml/2.3/(recommended to download the installation package directly, convenient and quick);
- Install Pyquery:easy_install pyquery or pip install pyquery;
- Verification: Input Enter
import pyquery
no error is installed success;
Initialization
There are 4 ways to initialize:
You can use Pyquery by passing in a string, lxml, file, or URL.
from pyquery import PyQuery as pqfrom lxml import etreed = pq("#传入字符串d = pq(etree.fromstring("#传入lxmld = pq(url=‘http://google.com/‘) #传入urld = pq(filename=path_to_html_file) #传入文件
Now, D is like the $ in jQuery.
Example
Quickly familiarize yourself with the usage of pyquery with a simple example, passing in the file example.html with the following:
<Div><Trclass="Item-0" ><Td>first section</Td><td>1111</Td><td>17-01-28 22:51</td></tr><tr class= "item-1" > <td>second section</td><td>2222</< Span class= "Hljs-name" >td><td>17-01-28 22:53</td></tr></DIV>
Python program:
#-*-Coding:utf-8-*-From Pyqueryimport pyquery as pq #引入 Pyquerydoc = PQ (Filename= ' example.html ') # incoming file example.html< Span class= "Hljs-keyword" >print doc.html () # html () method gets the currently selected HTML block print Doc ( Item-1 ') # equivalent to class Selector, choose the HTML block of class item-1 data = Doc ( ' tr ') # Select <tr> element
for tr
in data.items (): # traverse < in data tr> Element temp = TR ( ' TD '). EQ (2). Text () # Select a text block in the 3rd <td> element print temp
Operation Result:
# Print doc.html ()<Trclass="Item-0" ><Td>first section</Td><td>1111</Td><Td>17-01-28 22:51</Td></Tr><Trclass="Item-1" ><Td>second section</Td><td>2222</Td><td>17-01-28 22:53</td></tr># print Doc ('. item-1 ') <tr class= "item-1" ><< Span class= "Hljs-name" >td>second section</td><td>2222</td>< Span class= "Hljs-tag" ><td>17-01-28 22:53</td></tr># print tr (' TD '). EQ (2). Text () 17-01-28 22:51# Print TR (' TD '). EQ (2). Text () 17-01-28 22:53
Operation
1, .html()
and .text()
: Get the corresponding HTML block or text content,
p=pq("print p(‘head‘).html()# 获取相应的 HTML 块print p(‘head‘).text()# 获取相应的文本内容‘‘‘输出:<title>hello</title>Hello World!‘‘‘
2. .(‘selector‘)
: Get target content by selector,
d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘div‘).html()# 获取 <div> 元素内的 HTML 块print d(‘#item-0‘).text()# 获取 id 为 item-0 的元素内的文本内容print d(‘.item-1‘).text()# 获取 class 为 item-1 的元素的文本内容‘‘‘输出:<p id="item-0">test 1</p><p class="item-1">test 2</p>test 1test 2‘‘‘
3, .eq(index)
: Gets the specified element according to the index number (index starting from 0),
d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘p‘).eq(1).text()# 获取第二个 p 元素的文本内容,‘‘‘输出test 2‘‘‘
4. .find()
: Find nested elements,
d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘div‘).find(‘p‘) # 查找 <div> 内的 p 元素print d(‘div‘).find(‘p‘).eq(0) # 查找 <div> 内的 p 元素,输出第一个 p 元素‘‘‘输出:<p id="item-0">test 1</p><p class="item-1">test 2</p><p id="item-0">test 1</p>‘‘‘
5, .filter()
: Filter The specified elements according to class, ID,
d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘p‘).filter(‘.item-1‘) # 查找 class 为 item-1 的 p 元素print d(‘p‘).filter(‘#item-0‘) # 查找 id 为 item-0 的 p 元素‘‘‘输出:<p class="item-1">test 2</p><p id="item-0">test 1</p>‘‘‘
6, .attr()
: Get, modify the property value,
d = PQ ( "<div><p id= ' item-0 ' > Test 1</p><a class= ' item-1 ' >test 2</p></div> ") print D ( ' P '). attr ( ' id ') # get <p> Tag Properties id< Span class= "Hljs-keyword" >print D ( ' a '). attr ( ' class ', # modify <a> Tag class attribute New
7, Other operations:
.addClass(value)
: Add Class;
.hasClass(name)
: Determines whether the specified class is included, returns True or False;
.children()
: Gets the child element;
.parents()
: Gets the parent element;
.next()
: Gets the next element;
.nextAll()
: Gets all the elements behind the block;
.not_(‘selector‘)
: Gets all the elements that do not match the selector;
for i in d.items(‘li‘): print i.text()
: Traverse the LI element in D;
Conclusion
The above operation for daily crawl some small data, basic enough to use. Of course, Pyquery has a lot of other things to do here, so if you need to know more about pyquery, you can check the official documentation.
Official documents are in English, but they are also easier to read and understand. I found a Chinese-language tutorial site, which is also available here.
Official Document: https://pythonhosted.org/pyquery/index.html#
Chinese Course: http://www.geoinformatics.cn/lab/pyquery/
Python parsing HTML: Introduction and use of the Pyquery library