Scrapy research and exploration (7) -- how to prevent large collections of ban policies

Last Update:2014-06-30 Source: Internet

Author: User

Tags send cookies

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After trying to set download_delay to less than 1, and there is no other policy to prevent ban, I am finally successfully banned. As follows:

The enemy stepped in and attacked me.

This blog focuses on the use of several policies to prevent ban and the use of scrapy.

1. Policy 1: Set download_delay

This has been used in the previous tutorial (http://blog.csdn.net/u012150179/article/details/34913315), his role is mainly set the download wait time, large-scale centralized access to the server has the greatest impact, increase the server load in a short time.

The download wait time is long and cannot meet the requirement of large-scale capturing within a period of time. If it is too short, the probability of being banned is greatly increased.

Note:

Download_delay can be set in settings. py, or in spider, which has been used in previous blogs (http://blog.csdn.net/u012150179/article/details/34913315) and is not described too much here.

2. Policy 2: Disable cookies

Cookies refer to the data (usually encrypted) stored on the Client Side of some websites to identify users ), disabling cookies prevents websites that may use cookies to identify crawler traces.

Usage:

Set COOKIES_ENABLES to False in settings. py. That is, do not enable the cookies middleware and do not want the web server to send cookies.

3. Policy 3: Use the user agent pool

A user agent is a string containing browser information, operating system information, and other information. It is also called a special network protocol. The server uses it to determine whether the current access object is a browser, a mail client, or a web crawler. You can view the user agent in request. headers. Run the scrapy shell command to view the details as follows:

scrapy shell http://blog.csdn.net/u012150179/article/details/34486677

Enter the following information to obtain the uesr agent:

As a result, scrapy itself uses Scrapy/0.22.2 to indicate its identity. This exposes information about crawlers.

Usage:

First, write your own UserAgentMiddle middleware and create rotate_useragent.py. The Code is as follows:

#-*-Coding: UTF-8-*-from scrapy import log "to avoid being banned by one of the ban policies: Use the useragent pool. Note: You must set it in settings. py. "Import randomfrom scrapy. contrib. downloadermiddleware. useragent import UserAgentMiddlewareclass RotateUserAgentMiddleware (UserAgentMiddleware): def _ init _ (self, user_agent = ''): self. user_agent = user_agent def process_request (self, request, spider): ua = random. choice (self. user_agent_list) if ua: # display the currently used useragent print "********* Current UserAgent: % s ************* "% ua # record log. msg ('current UserAgent: '+ ua, level = 'info') request. headers. setdefault ('user-agent', ua) # the default user_agent_list composes chrome, I E, firefox, Mozilla, opera, netscape # for more User Agent strings, you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1" "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1 ", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11" "(KHTML, like Gecko) Chrome/1270.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "(KHTML, like Gecko) Chrome/255.0.20.2.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6" "(KHTML, like Gecko) Chrome/255.0.0.0 Safari/536.6 "," Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "" (KHTML, like Gecko) chrome/19.77.34.5 Safari/537.1 "," Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "" (KHTML, like Gecko) Chrome/19.0.20.4.9 Safari/536.5 ", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5" "(KHTML, like Gecko) Chrome/19.0.20.4.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3" "(KHTML, like Gecko) chrome/19.0.20.3.0 Safari/536.3 "," Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3 ", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3" "(KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3" "(KHTML, like Gecko) chrome/19.0.20.1.1 Safari/536.3 "," Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3 ", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3" "(KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) appleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.1.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24" "(KHTML, like Gecko) chrome/19.0.1055.1 Safari/535.24 "," Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]

Create a user agent pool (user_agent_list) and randomly select a User_Agent that sets the request from the agent pool before each request is sent. The base class of the UserAgent middleware is UserAgentMiddle.

In addition, disable the default useragent and enable the re-implemented User Agent in settings. py (configuration file. The configuration method is as follows:

# Cancel the default useragent and use the new token = {'scrapy. contrib. downloadermiddleware. useragent. UserAgentMiddleware ': None, 'csdnblogcrawler lspider. spiders. upload': 400}

Now the configuration is complete. Now you can run it to see the effect.

We can find the UserAgent that has been changing.

4. Policy 4: Use the IP address pool

One of the web server's anti-bot policies is to directly block your IP address or the entire IP address segment and prohibit access. In this case, after the IP address is blocked, you can switch to another IP address to continue accessing the website.

Scrapy + Tor + polipo can be used

For the configuration method and tutorial, refer to: Workshop /. I will translate it if I have time.

5. Policy 5: distributed crawling

In this case, there is more content. For scrapy, there are also related GitHub repo for Distributed crawling. You can search.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More