An example of fabricating a random request header to a crawler in Pyspider

Last Update:2018-05-08 Source: Internet

Author: User

Tags webp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces the Pyspider to the crawler to forge a random request head of the example, has a certain reference value, now share to everyone, the need for friends can refer to

Pyspider uses the Tornado Library to do HTTP requests, in the request process can add various parameters, such as request link timeout time, request transfer data timeout time, request header and so on, but according to the original framework of Pyspider, to add parameters to the crawler only through Crawl_ Config this Python dictionary to complete (see below), the framework code converts the parameters in this dictionary into task data, making HTTP requests. The disadvantage of this parameter is that it is inconvenient to make a random request header for each request.

Crawl_config = {"User_agent": "mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," timeout ": +," connect_timeout ": 60 , "retries": 5, "Fetch_type": ' JS ', "auto_recrawl": True,}

Here's how to add a random request header to the crawler:

1, write the script, put the script in Pyspider Libs folder, named header_switch.py

#!/usr/bin/env python#-*-coding:utf-8-*-# Created on 2017-10-18 11:52:26import randomimport timeclass HeadersSelector ( Object): "" "Header missing several fields Host and Cookie" "" Headers_1 = {"Proxy-connection": "Keep-alive", "Pragma": "No-cach E "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," Accept ":" Text/html,application/xh tml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," DNT ":" 1 "," accept-encoding ":" gzip, deflate, SDCH "," accep T-language ":" zh-cn,zh;q=0.8,en-us;q=0.6,en;q=0.4 "," Referer ":" https://www.baidu.com/s?wd=%bc%96%e7%a0%81& rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8& rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=if-none-match&inputt=7282&rsv_t "," Accept-Charset " : "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",} # Online search for browser headers_2 = {"Proxy-connection": "Keep-alive "," Pragma ":" No-cache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0 "," Accept ":" Image/g if,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/ vnd.ms-powerpoint,application/msword,*/* "," DNT ":" 1 "," Referer ":" Https://www.baidu.com/link?url= C-fmhf06-zphorm4twduhrakxhnsm_rzjxz-ztfnpavzn "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" Zh-CN, zh;q=0.8,en-us;q=0.6,en;q=0.4 ",} # Window 7 System Browser Headers_3 = {" Proxy-connection ":" Keep-alive "," Pragma ":" No-c Ache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (X11; Linux x86_64; rv:52.0) gecko/20100101 firefox/52.0 "," Accept ":" Image/x-xbitmap,image/jpeg,application/x-shockwave-flash, application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/* "," DNT ":" 1 "," Referer ":" https:// Www. baidu.com/s?wd=http%b4%20pragma&rsf=1&rsp=4&f=1&oq=pragma&tn=baiduhome_pg&ie=utf-8 &usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010 "," accept-encoding ":" gzip, deflate, SDCH "," accept-language ":" zh -cn,zh;q=0.8,en-us;q=0.7,en;q=0.6 ",} # Linux system Firefox Browser headers_4 = {" Proxy-connection ":" Keep-alive "," Prag Ma ":" No-cache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) gecko/20100101 firefox/55.0 "," Accept ":" */* "," DNT ":" 1 "," Referer ":" https://www.baidu.com/link?url=c- FMHF06-ZPHORM4TWDUHRAKXHNSM_RZJXZ-ZTFNP "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" Zh-cn,zh;q=0     .9,en-us;q=0.7,en;q=0.6 ",} # WIN10 system Firefox Browser headers_5 = {" Connection ":" Keep-alive "," Pragma ":" No-cache ", "Cache-control": "No-cache", "user-agent": "mozilla/5.0 (Windows NT 10.0; Win64; x64;) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 EDGE/15.15063 "," Accept ":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," Referer ":" https://www.baidu.com/link?url=c-fmhf06-zphorm4twduhrakxhnsm_rzjxz-"," accept-encoding ":" gzip, deflate, SDCH "," Ac Cept-language ":" zh-cn,zh;q=0.9,en-us;q=0.7,en;q=0.6 "," Accept-charset ":" gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7 ",} # WIN10 System Chrome Browser headers_6 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q= 0.8 "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" zh-cn,zh;q=0.8 "," Pragma ":" No-cache "," Ca Che-control ":" No-cache "," Connection ":" Keep-alive "," DNT ":" 1 "," Referer ":" https://www.baidu.com/s?wd=if-none- Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8 &rq "," Accept-charset ":" gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7 "," user-agent ":" mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) Chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0 ",} # WIN10 System browser def __init__ (self): Pass def select_header (self) : n = random.randint (1, 6) switch={1:self.headers_1 2:self.headers_2 3:self.headers_3 4:self.header S_4 5:self.headers_5 6:self.headers_6} headers = Switch[n] Return headers

Among them, I wrote only 6 request headers, if the amount of crawler is very large, can write more request headers, or even hundreds, and then the random range of randomly expanded, to choose.

2. Write the following code in the Pyspider script:

#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-08-18 11:52:26from pyspider.libs.base_handler import *from Pyspider.addings.headers_switch import Headersselectorimport sysdefaultencoding = ' utf-8 ' if sys.getdefaultencoding ()    ! = defaultencoding:reload (SYS) sys.setdefaultencoding (defaultencoding) class Handler (Basehandler): Crawl_config = { "User_agent": "mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," timeout ": +," Connect_timeout "    : "Retries": 5, "Fetch_type": ' JS ', "auto_recrawl": True,} @every (minutes=24) def on_start (self): Header_slt = Headersselector () header = Header_slt.select_header () # Get a new Header # header["X-requested-with"] = " XMLHttpRequest "orig_href = ' http://sww.bjxch.gov.cn/gggs.html ' self.crawl (Orig_href, callback=self.index_p  Age, Headers=header) # The request header must be written in crawl, and the cookie is found in the Response.Cookies @config (age=24 * 60 * 60)def index_page (self, response): Header_slt = Headersselector () header = Header_slt.select_header () # Get a new header # header["X-requested-with"] = "XMLHttpRequest" If response.cookies:header["cookies"] = Response.Cookies

One of the most important is that in each callback function on_start,index_page and so on, each time the invocation, a header selector is instantiated, adding a different header to each request. Be aware that the following code is added:

    Header_slt = Headersselector ()    header = Header_slt.select_header () # Get a new header    # header["X-requested-with" ] = "XMLHttpRequest"    header["Host"] = "www.baidu.com"    if Response.Cookies:      header["cookies"] = Response.Cookies

When sending an AJAX request with XHR, it is often used to determine if it is an AJAX request, headers to add {' X-requested-with ': ' XMLHttpRequest ' to fetch the content.

Determine the URL also determine the host in the request header, need to add as needed, Urlparse package is given in accordance with the URL to parse out the host method function, directly call Netloc can.

If there is a cookie in the response, you need to add the cookie to the request header.

If there are other camouflage requirements, add them yourself.

So you can implement the random request header, end.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More