An example of fabricating a random request header to a crawler in Pyspider

Source: Internet
Author: User
Tags webp
This article mainly introduces the Pyspider to the crawler to forge a random request head of the example, has a certain reference value, now share to everyone, the need for friends can refer to

Pyspider uses the Tornado Library to do HTTP requests, in the request process can add various parameters, such as request link timeout time, request transfer data timeout time, request header and so on, but according to the original framework of Pyspider, to add parameters to the crawler only through Crawl_ Config this Python dictionary to complete (see below), the framework code converts the parameters in this dictionary into task data, making HTTP requests. The disadvantage of this parameter is that it is inconvenient to make a random request header for each request.

Crawl_config = {"User_agent": "mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," timeout ": +," connect_timeout ": 60 , "retries": 5, "Fetch_type": ' JS ', "auto_recrawl": True,}

Here's how to add a random request header to the crawler:

1, write the script, put the script in Pyspider Libs folder, named header_switch.py

#!/usr/bin/env python#-*-coding:utf-8-*-# Created on 2017-10-18 11:52:26import randomimport timeclass HeadersSelector ( Object): "" "Header missing several fields Host and Cookie" "" Headers_1 = {"Proxy-connection": "Keep-alive", "Pragma": "No-cach E "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," Accept ":" Text/html,application/xh tml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," DNT ":" 1 "," accept-encoding ":" gzip, deflate, SDCH "," accep T-language ":" zh-cn,zh;q=0.8,en-us;q=0.6,en;q=0.4 "," Referer ":" https://www.baidu.com/s?wd=%bc%96%e7%a0%81& rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8& rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=if-none-match&inputt=7282&rsv_t "," Accept-Charset " : "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",} # Online search for browser headers_2 = {"Proxy-connection": "Keep-alive "," Pragma ":" No-cache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0 "," Accept ":" Image/g if,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/ vnd.ms-powerpoint,application/msword,*/* "," DNT ":" 1 "," Referer ":" Https://www.baidu.com/link?url= C-fmhf06-zphorm4twduhrakxhnsm_rzjxz-ztfnpavzn "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" Zh-CN, zh;q=0.8,en-us;q=0.6,en;q=0.4 ",} # Window 7 System Browser Headers_3 = {" Proxy-connection ":" Keep-alive "," Pragma ":" No-c Ache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (X11; Linux x86_64; rv:52.0) gecko/20100101 firefox/52.0 "," Accept ":" Image/x-xbitmap,image/jpeg,application/x-shockwave-flash, application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/* "," DNT ":" 1 "," Referer ":" https:// Www. baidu.com/s?wd=http%b4%20pragma&rsf=1&rsp=4&f=1&oq=pragma&tn=baiduhome_pg&ie=utf-8 &usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010 "," accept-encoding ":" gzip, deflate, SDCH "," accept-language ":" zh -cn,zh;q=0.8,en-us;q=0.7,en;q=0.6 ",} # Linux system Firefox Browser headers_4 = {" Proxy-connection ":" Keep-alive "," Prag Ma ":" No-cache "," Cache-control ":" No-cache "," user-agent ":" mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) gecko/20100101 firefox/55.0 "," Accept ":" */* "," DNT ":" 1 "," Referer ":" https://www.baidu.com/link?url=c- FMHF06-ZPHORM4TWDUHRAKXHNSM_RZJXZ-ZTFNP "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" Zh-cn,zh;q=0     .9,en-us;q=0.7,en;q=0.6 ",} # WIN10 system Firefox Browser headers_5 = {" Connection ":" Keep-alive "," Pragma ":" No-cache ", "Cache-control": "No-cache", "user-agent": "mozilla/5.0 (Windows NT 10.0; Win64; x64;) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 EDGE/15.15063 "," Accept ":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 "," Referer ":" https://www.baidu.com/link?url=c-fmhf06-zphorm4twduhrakxhnsm_rzjxz-"," accept-encoding ":" gzip, deflate, SDCH "," Ac Cept-language ":" zh-cn,zh;q=0.9,en-us;q=0.7,en;q=0.6 "," Accept-charset ":" gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7 ",} # WIN10 System Chrome Browser headers_6 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q= 0.8 "," accept-encoding ":" gzip, deflate, SDCH "," Accept-language ":" zh-cn,zh;q=0.8 "," Pragma ":" No-cache "," Ca Che-control ":" No-cache "," Connection ":" Keep-alive "," DNT ":" 1 "," Referer ":" https://www.baidu.com/s?wd=if-none- Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8 &rq "," Accept-charset ":" gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7 "," user-agent ":" mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) Chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0 ",} # WIN10 System browser def __init__ (self): Pass def select_header (self) : n = random.randint (1, 6) switch={1:self.headers_1 2:self.headers_2 3:self.headers_3 4:self.header S_4 5:self.headers_5 6:self.headers_6} headers = Switch[n] Return headers

Among them, I wrote only 6 request headers, if the amount of crawler is very large, can write more request headers, or even hundreds, and then the random range of randomly expanded, to choose.

2. Write the following code in the Pyspider script:

#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-08-18 11:52:26from pyspider.libs.base_handler import *from Pyspider.addings.headers_switch import Headersselectorimport sysdefaultencoding = ' utf-8 ' if sys.getdefaultencoding ()    ! = defaultencoding:reload (SYS) sys.setdefaultencoding (defaultencoding) class Handler (Basehandler): Crawl_config = { "User_agent": "mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 "," timeout ": +," Connect_timeout "    : "Retries": 5, "Fetch_type": ' JS ', "auto_recrawl": True,} @every (minutes=24) def on_start (self): Header_slt = Headersselector () header = Header_slt.select_header () # Get a new Header # header["X-requested-with"] = " XMLHttpRequest "orig_href = ' http://sww.bjxch.gov.cn/gggs.html ' self.crawl (Orig_href, callback=self.index_p  Age, Headers=header) # The request header must be written in crawl, and the cookie is found in the Response.Cookies @config (age=24 * 60 * 60)def index_page (self, response): Header_slt = Headersselector () header = Header_slt.select_header () # Get a new header # header["X-requested-with"] = "XMLHttpRequest" If response.cookies:header["cookies"] = Response.Cookies

One of the most important is that in each callback function on_start,index_page and so on, each time the invocation, a header selector is instantiated, adding a different header to each request. Be aware that the following code is added:

    Header_slt = Headersselector ()    header = Header_slt.select_header () # Get a new header    # header["X-requested-with" ] = "XMLHttpRequest"    header["Host"] = "www.baidu.com"    if Response.Cookies:      header["cookies"] = Response.Cookies

When sending an AJAX request with XHR, it is often used to determine if it is an AJAX request, headers to add {' X-requested-with ': ' XMLHttpRequest ' to fetch the content.

Determine the URL also determine the host in the request header, need to add as needed, Urlparse package is given in accordance with the URL to parse out the host method function, directly call Netloc can.

If there is a cookie in the response, you need to add the cookie to the request header.

If there are other camouflage requirements, add them yourself.

So you can implement the random request header, end.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.