Before using Scrapy to write the crawler crawled their own blog content and saved in JSON format data (scrapy Crawler growth diary Creation project-extract data-Save as JSON format data) and write to the database (Scrapy crawler growth Diary of the crawl content written to the MySQL database). However, the function of this reptile is too weak, once the target site set the crawler restrictions, our crawler will be invalid. So here is how to avoid scrapy crawler ban. All the content of this door is based on the previous two articles, if you miss can click this back: Scrapy crawler growth Diary Creation project-extract data-save in JSON format data, Scrapy crawler growth diary to write crawl content to MySQL database
According to Scrapy Official document: Http://doc.scrapy.org/en/master/topics/practices.html#avoiding-getting-banned inside the description, to prevent scrapy by ban, There are several strategies in the main.
- Dynamic Setup User Agent
- disabling cookies
- Set up deferred Downloads
- Using Google cache
- Use the IP address pool (Tor project, VPN, and proxy IP)
- Using Crawlera
Because Google cache is affected by the domestic network, you know, Crawlera distributed download, we can use a special article next time to explain. So this article mainly from dynamic random set user agent, disable cookies, set delay download and use proxy IP several ways. Well, into the subject:
1. Create middlewares.py
Scrapy proxy IP, user agent switch is controlled by downloader_middlewares, below we create middlewares.py file.
[[email protected] cnblogs]#VI cnblogs/middlewares.pyImportRandomImportBase64 fromSettingsImportPROXIESclassrandomuseragent (object):"""randomly rotate user agents based on a list of predefined ones""" def __init__(self, Agents): Self.agents=Agents @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Crawler.settings.getlist ('user_agents')) defprocess_request (self, request, spider):#print "**************************" + random.choice (self.agents)Request.headers.setdefault ('user-agent', Random.choice (self.agents))classProxymiddleware (object):defprocess_request (self, request, spider): Proxy=Random.choice (PROXIES)ifproxy['User_pass'] is notnone:request.meta['Proxy'] ="http://%s"% proxy['Ip_port'] Encoded_user_pass= Base64.encodestring (proxy['User_pass']) request.headers['proxy-authorization'] ='Basic'+Encoded_user_passPrint "**************proxymiddleware have pass************"+ proxy['Ip_port'] Else: Print "**************proxymiddleware No pass************"+ proxy['Ip_port'] request.meta['Proxy'] ="http://%s"% proxy['Ip_port']
Class randomuseragent is primarily used to dynamically obtain a list of user Agent,user agents user_agents configured in settings.py.
Class Proxymiddleware is used to switch proxies, and the proxy list proxies is also configured in settings.py .
2. Modify settings.py configuration user_agents and proxies
A): Add user_agents
User_agents = [ "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)", "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)", "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)", "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)", "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)", "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)", "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)", "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)", "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6", "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1", "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0", "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5", "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11", "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20", "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",]
b): Add Proxy IP Settings proxies
PROXIES = [ {'Ip_port':'111.11.228.75:80','User_pass':"'}, {'Ip_port':'120.198.243.22:80','User_pass':"'}, {'Ip_port':'111.8.60.9:8123','User_pass':"'}, {'Ip_port':'101.71.27.120:80','User_pass':"'}, {'Ip_port':'122.96.59.104:80','User_pass':"'}, {'Ip_port':'122.224.249.122:8088','User_pass':"'},]
proxy ip can be searched online, the above proxy IP gets from: http://www.xici.net.co/.
c): Disable cookies
Cookies_enabled=false
d): Set download delay
Download_delay=3
E): Last set Downloader_middlewares
downloader_middlewares = { # ' Cnblogs.middlewares.MyCustomDownloaderMiddleware ': 543, cnblogs.middlewares.randomuseragent : 1 , scrapy.contrib.downloadermiddleware.httpproxy.httpproxymiddleware ' : 110 # ' Scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware ': +, cnblogs.middlewares.proxymiddleware " : 100,}
Save settings.py
3. Testing
[[email protected] cnblogs] # scrapy Crawl Cnblogsspider
SOURCE updated to: Https://github.com/jackgitgz/CnblogsSpider
Aside from this: the User agent and proxy lists in this article are set by settings.py, and the user agent and proxy in the actual production are likely to be updated frequently, and each time the configuration file is changed it is awkward and not easy to manage. Thus, the MySQL database can be saved as needed.
How to make your scrapy crawler no longer banned by ban