This article mainly introduces the Pyspider to the crawler to forge a random request head of the example, has a certain reference value, now share to everyone, the need for friends can refer to
Pyspider uses the Tornado Library to do HTTP requests, in the request process can add various parameters, such as request link timeout time, request transfer data timeout time, request header and so on, but according to the original framework of Pyspider, to add parameters to the crawler only through Crawl_ Config this Python dictionary to complete (see below), the framework code converts the parameters in this dictionary into task data, making HTTP requests. The disadvantage of this parameter is that it is inconvenient to make a random request header for each request.
Crawl_config = {"User_agent": "mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," timeout ": +," connect_timeout ": 60 , "retries": 5, "Fetch_type": ' JS ', "auto_recrawl": True,}
Here's how to add a random request header to the crawler:
1, write the script, put the script in Pyspider Libs folder, named header_switch.py
#!/usr/bin/env python#-*-coding:utf-8-*-# Created on 2017-10-18 11:52:26import randomimport timeclass HeadersSelector ( Object): "" "Header missing several fields Host and Cookie" "" Headers_1 = {"Proxy-connection": "Keep-alive", "Pragma": "No-cach E "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," Accept ":" Text/html,application/xh tml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," DNT ":" 1 "," accept-encoding ":" gzip, deflate, SDCH "," accep T-language ":" zh-cn,zh;q=0.8,en-us;q=0.6,en;q=0.4 "," Referer ":" https://www.baidu.com/s?wd=%bc%96%e7%a0%81& rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8& rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=if-none-match&inputt=7282&rsv_t "," Accept-Charset " : "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",} # Online search for browser headers_2 = {"Proxy-connection": "Keep-alive "," Pragma ":" No-cache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0 "," Accept ":" Image/g if,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/ vnd.ms-powerpoint,application/msword,*/* "," DNT ":" 1 "," Referer ":" Https://www.baidu.com/link?url= C-fmhf06-zphorm4twduhrakxhnsm_rzjxz-ztfnpavzn "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" Zh-CN, zh;q=0.8,en-us;q=0.6,en;q=0.4 ",} # Window 7 System Browser Headers_3 = {" Proxy-connection ":" Keep-alive "," Pragma ":" No-c Ache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (X11; Linux x86_64; rv:52.0) gecko/20100101 firefox/52.0 "," Accept ":" Image/x-xbitmap,image/jpeg,application/x-shockwave-flash, application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/* "," DNT ":" 1 "," Referer ":" https:// Www. baidu.com/s?wd=http%b4%20pragma&rsf=1&rsp=4&f=1&oq=pragma&tn=baiduhome_pg&ie=utf-8 &usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010 "," accept-encoding ":" gzip, deflate, SDCH "," accept-language ":" zh -cn,zh;q=0.8,en-us;q=0.7,en;q=0.6 ",} # Linux system Firefox Browser headers_4 = {" Proxy-connection ":" Keep-alive "," Prag Ma ":" No-cache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) gecko/20100101 firefox/55.0 "," Accept ":" */* "," DNT ":" 1 "," Referer ":" https://www.baidu.com/link?url=c- FMHF06-ZPHORM4TWDUHRAKXHNSM_RZJXZ-ZTFNP "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" Zh-cn,zh;q=0 .9,en-us;q=0.7,en;q=0.6 ",} # WIN10 system Firefox Browser headers_5 = {" Connection ":" Keep-alive "," Pragma ":" No-cache ", "Cache-control": "No-cache", "user-agent": "mozilla/5.0 (Windows NT 10.0; Win64; x64;) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 EDGE/15.15063 "," Accept ":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," Referer ":" https://www.baidu.com/link?url=c-fmhf06-zphorm4twduhrakxhnsm_rzjxz-"," accept-encoding ":" gzip, deflate, SDCH "," Ac Cept-language ":" zh-cn,zh;q=0.9,en-us;q=0.7,en;q=0.6 "," Accept-charset ":" gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7 ",} # WIN10 System Chrome Browser headers_6 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q= 0.8 "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" zh-cn,zh;q=0.8 "," Pragma ":" No-cache "," Ca Che-control ":" No-cache "," Connection ":" Keep-alive "," DNT ":" 1 "," Referer ":" https://www.baidu.com/s?wd=if-none- Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8 &rq "," Accept-charset ":" gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7 "," user-agent ":" mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) Chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0 ",} # WIN10 System browser def __init__ (self): Pass def select_header (self) : n = random.randint (1, 6) switch={1:self.headers_1 2:self.headers_2 3:self.headers_3 4:self.header S_4 5:self.headers_5 6:self.headers_6} headers = Switch[n] Return headers
Among them, I wrote only 6 request headers, if the amount of crawler is very large, can write more request headers, or even hundreds, and then the random range of randomly expanded, to choose.
2. Write the following code in the Pyspider script:
#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-08-18 11:52:26from pyspider.libs.base_handler import *from Pyspider.addings.headers_switch import Headersselectorimport sysdefaultencoding = ' utf-8 ' if sys.getdefaultencoding () ! = defaultencoding:reload (SYS) sys.setdefaultencoding (defaultencoding) class Handler (Basehandler): Crawl_config = { "User_agent": "mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," timeout ": +," Connect_timeout " : "Retries": 5, "Fetch_type": ' JS ', "auto_recrawl": True,} @every (minutes=24) def on_start (self): Header_slt = Headersselector () header = Header_slt.select_header () # Get a new Header # header["X-requested-with"] = " XMLHttpRequest "orig_href = ' http://sww.bjxch.gov.cn/gggs.html ' self.crawl (Orig_href, callback=self.index_p Age, Headers=header) # The request header must be written in crawl, and the cookie is found in the Response.Cookies @config (age=24 * 60 * 60)def index_page (self, response): Header_slt = Headersselector () header = Header_slt.select_header () # Get a new header # header["X-requested-with"] = "XMLHttpRequest" If response.cookies:header["cookies"] = Response.Cookies
One of the most important is that in each callback function on_start,index_page and so on, each time the invocation, a header selector is instantiated, adding a different header to each request. Be aware that the following code is added:
Header_slt = Headersselector () header = Header_slt.select_header () # Get a new header # header["X-requested-with" ] = "XMLHttpRequest" header["Host"] = "www.baidu.com" if Response.Cookies: header["cookies"] = Response.Cookies
When sending an AJAX request with XHR, it is often used to determine if it is an AJAX request, headers to add {' X-requested-with ': ' XMLHttpRequest ' to fetch the content.
Determine the URL also determine the host in the request header, need to add as needed, Urlparse package is given in accordance with the URL to parse out the host method function, directly call Netloc can.
If there is a cookie in the response, you need to add the cookie to the request header.
If there are other camouflage requirements, add them yourself.
So you can implement the random request header, end.