After trying to set download_delay to less than 1, and there is no other policy to prevent ban, I am finally successfully banned. As follows:
The enemy stepped in and attacked me.
This blog focuses on the use of several policies to prevent ban and the use of scrapy.
1. Policy 1: Set download_delay
This has been used in the previous tutorial (http://blog.csdn.net/u012150179/article/details/34913315), his role is mainly set the download wait time, large-scale centralized access to the server has the greatest impact, increase the server load in a short time.
The download wait time is long and cannot meet the requirement of large-scale capturing within a period of time. If it is too short, the probability of being banned is greatly increased.
Note:
Download_delay can be set in settings. py, or in spider, which has been used in previous blogs (http://blog.csdn.net/u012150179/article/details/34913315) and is not described too much here.
2. Policy 2: Disable cookies
Cookies refer to the data (usually encrypted) stored on the Client Side of some websites to identify users ), disabling cookies prevents websites that may use cookies to identify crawler traces.
Usage:
Set COOKIES_ENABLES to False in settings. py. That is, do not enable the cookies middleware and do not want the web server to send cookies.
3. Policy 3: Use the user agent pool
A user agent is a string containing browser information, operating system information, and other information. It is also called a special network protocol. The server uses it to determine whether the current access object is a browser, a mail client, or a web crawler. You can view the user agent in request. headers. Run the scrapy shell command to view the details as follows:
scrapy shell http://blog.csdn.net/u012150179/article/details/34486677
Enter the following information to obtain the uesr agent:
As a result, scrapy itself uses Scrapy/0.22.2 to indicate its identity. This exposes information about crawlers.
Usage:
First, write your own UserAgentMiddle middleware and create rotate_useragent.py. The Code is as follows:
#-*-Coding: UTF-8-*-from scrapy import log "to avoid being banned by one of the ban policies: Use the useragent pool. Note: You must set it in settings. py. "Import randomfrom scrapy. contrib. downloadermiddleware. useragent import UserAgentMiddlewareclass RotateUserAgentMiddleware (UserAgentMiddleware): def _ init _ (self, user_agent = ''): self. user_agent = user_agent def process_request (self, request, spider): ua = random. choice (self. user_agent_list) if ua: # display the currently used useragent print "********* Current UserAgent: % s ************* "% ua # record log. msg ('current UserAgent: '+ ua, level = 'info') request. headers. setdefault ('user-agent', ua) # the default user_agent_list composes chrome, I E, firefox, Mozilla, opera, netscape # for more User Agent strings, you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1" "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1 ", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11" "(KHTML, like Gecko) Chrome/1270.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "(KHTML, like Gecko) Chrome/255.0.20.2.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6" "(KHTML, like Gecko) Chrome/255.0.0.0 Safari/536.6 "," Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "" (KHTML, like Gecko) chrome/19.77.34.5 Safari/537.1 "," Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "" (KHTML, like Gecko) Chrome/19.0.20.4.9 Safari/536.5 ", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5" "(KHTML, like Gecko) Chrome/19.0.20.4.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3" "(KHTML, like Gecko) chrome/19.0.20.3.0 Safari/536.3 "," Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3 ", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3" "(KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3" "(KHTML, like Gecko) chrome/19.0.20.1.1 Safari/536.3 "," Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3 ", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3" "(KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) appleWebKit/536.3 "(KHTML, like Gecko) Chrome/19.0.20.1.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24" "(KHTML, like Gecko) chrome/19.0.1055.1 Safari/535.24 "," Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
Create a user agent pool (user_agent_list) and randomly select a User_Agent that sets the request from the agent pool before each request is sent. The base class of the UserAgent middleware is UserAgentMiddle.
In addition, disable the default useragent and enable the re-implemented User Agent in settings. py (configuration file. The configuration method is as follows:
# Cancel the default useragent and use the new token = {'scrapy. contrib. downloadermiddleware. useragent. UserAgentMiddleware ': None, 'csdnblogcrawler lspider. spiders. upload': 400}
Now the configuration is complete. Now you can run it to see the effect.
We can find the UserAgent that has been changing.
4. Policy 4: Use the IP address pool
One of the web server's anti-bot policies is to directly block your IP address or the entire IP address segment and prohibit access. In this case, after the IP address is blocked, you can switch to another IP address to continue accessing the website.
Scrapy + Tor + polipo can be used
For the configuration method and tutorial, refer to: Workshop /. I will translate it if I have time.
5. Policy 5: distributed crawling
In this case, there is more content. For scrapy, there are also related GitHub repo for Distributed crawling. You can search.