How to make your scrapy crawler no longer banned by ban

Last Update:2015-06-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before using Scrapy to write the crawler crawled their own blog content and saved in JSON format data (scrapy Crawler growth diary Creation project-extract data-Save as JSON format data) and write to the database (Scrapy crawler growth Diary of the crawl content written to the MySQL database). However, the function of this reptile is too weak, once the target site set the crawler restrictions, our crawler will be invalid. So here is how to avoid scrapy crawler ban. All the content of this door is based on the previous two articles, if you miss can click this back: Scrapy crawler growth Diary Creation project-extract data-save in JSON format data, Scrapy crawler growth diary to write crawl content to MySQL database

According to Scrapy Official document: Http://doc.scrapy.org/en/master/topics/practices.html#avoiding-getting-banned inside the description, to prevent scrapy by ban, There are several strategies in the main.

Dynamic Setup User Agent
disabling cookies
Set up deferred Downloads
Using Google cache
Use the IP address pool (Tor project, VPN, and proxy IP)
Using Crawlera

Because Google cache is affected by the domestic network, you know, Crawlera distributed download, we can use a special article next time to explain. So this article mainly from dynamic random set user agent, disable cookies, set delay download and use proxy IP several ways. Well, into the subject:

　　1. Create middlewares.py

Scrapy proxy IP, user agent switch is controlled by downloader_middlewares, below we create middlewares.py file.

[[email protected] cnblogs]#VI cnblogs/middlewares.pyImportRandomImportBase64 fromSettingsImportPROXIESclassrandomuseragent (object):"""randomly rotate user agents based on a list of predefined ones"""    def __init__(self, Agents): Self.agents=Agents @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Crawler.settings.getlist ('user_agents'))    defprocess_request (self, request, spider):#print "**************************" + random.choice (self.agents)Request.headers.setdefault ('user-agent', Random.choice (self.agents))classProxymiddleware (object):defprocess_request (self, request, spider): Proxy=Random.choice (PROXIES)ifproxy['User_pass'] is  notnone:request.meta['Proxy'] ="http://%s"% proxy['Ip_port'] Encoded_user_pass= Base64.encodestring (proxy['User_pass']) request.headers['proxy-authorization'] ='Basic'+Encoded_user_passPrint "**************proxymiddleware have pass************"+ proxy['Ip_port']        Else:            Print "**************proxymiddleware No pass************"+ proxy['Ip_port'] request.meta['Proxy'] ="http://%s"% proxy['Ip_port']

Class randomuseragent is primarily used to dynamically obtain a list of user Agent,user agents user_agents configured in settings.py.

Class Proxymiddleware is used to switch proxies, and the proxy list proxies is also configured in settings.py .

　　2. Modify settings.py configuration user_agents and proxies

A): Add user_agents

User_agents = [    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)",    "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)",    "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)",    "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)",    "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6",    "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0",    "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5",    "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11",    "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20",    "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",]

b): Add Proxy IP Settings proxies

PROXIES = [    {'Ip_port':'111.11.228.75:80','User_pass':"'},    {'Ip_port':'120.198.243.22:80','User_pass':"'},    {'Ip_port':'111.8.60.9:8123','User_pass':"'},    {'Ip_port':'101.71.27.120:80','User_pass':"'},    {'Ip_port':'122.96.59.104:80','User_pass':"'},    {'Ip_port':'122.224.249.122:8088','User_pass':"'},]

proxy ip can be searched online, the above proxy IP gets from: http://www.xici.net.co/.

c): Disable cookies

Cookies_enabled=false

d): Set download delay

Download_delay=3

E): Last set Downloader_middlewares

 downloader_middlewares = { #   ' Cnblogs.middlewares.MyCustomDownloaderMiddleware ': 543,   cnblogs.middlewares.randomuseragent  : 1 ,    scrapy.contrib.downloadermiddleware.httpproxy.httpproxymiddleware   ' : 110 #   ' Scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware ': +,   cnblogs.middlewares.proxymiddleware   " : 100,}

Save settings.py

3. Testing

[[email protected] cnblogs] # scrapy Crawl Cnblogsspider

SOURCE updated to: Https://github.com/jackgitgz/CnblogsSpider

Aside from this: the User agent and proxy lists in this article are set by settings.py, and the user agent and proxy in the actual production are likely to be updated frequently, and each time the configuration file is changed it is awkward and not easy to manage. Thus, the MySQL database can be saved as needed.

How to make your scrapy crawler no longer banned by ban

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to make your scrapy crawler no longer banned by ban

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to make your scrapy crawler no longer banned by ban

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support