Http://doc.scrapy.org/en/1.0/topics/practices.html#bans
1. User Agent Rotation
2. Forbidden Cookies
3. Set the Download_delay greater than 2s
4. Using Google Cache (not understood)
5. Using a rotated IP (not yet)
6. Using the Distributed downloader (I don't know if Scrapy-redis is counted)
User Agent Rotation Example
1) Create a new middlewares.py file with the following contents, the file is placed in the same folder as the items.py, settings.py.
#!/usr/bin/python #-*-coding:utf-8-*-import random from scrapy.downloadermiddlewares.useragent Import
Useragentmiddleware class Rotateuseragentmiddleware (useragentmiddleware): Def __init__ (self, user_agent= "): Self.user_agent = User_agent def process_request (self, request, spider): UA = Random.choice (self.user_agent_l IST) if Ua:print ua, '-----------------yyyyyyyyyyyyyyyyyyyyyyyyy ' Request.headers.setdefa Ult (' User-agent ', UA) #the default user_agent_list composes chrome,i e,firefox,mozilla,opera,netscape #for more U
Ser Agent Strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "\" mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 ", \" mozilla/5.0 (Wi Ndows NT 6.1; WOW64) Applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 ", \" mozilla/5.0 (Windows NT 6.2) APPLEWEBKIT/5 36.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 ", \" mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 ", \" mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 ", \" mozilla/5.0 (Windows NT 6. 0) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 ", \" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 ", \" mozilla/5.0 (Windows NT 5.1) Appl ewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 ", \" mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 ", \" mozilla/5.0 (Wind
OWS NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 ", \ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 ", \" mozilla/5.0 (Windows NT 6.2) APPL ewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 ", \" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 ", \" mozilla/5.0 (Windows NT 6.1) APPL ewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 ", \" mozilla/5.0 (Windows NT 6.2) applewebkit/536 .3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 ", \" mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 ", \" mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "]
2) Then set in settings.py as follows
Downloader_middlewares = {
' scrapy.downloadermiddlewares.useragent.UserAgentMiddleware ': None,
' Example.middlewares.RotateUserAgentMiddleware ': +,
}
Run crawler scrapy crawl spider1-l WARNING do not print debug information, you can clearly see the print out of the user agent is different.