How to make your scrapy crawler no longer banned by ban

Source: Internet
Author: User

Before using Scrapy to write the crawler crawled their own blog content and saved in JSON format data (scrapy Crawler growth diary Creation project-extract data-Save as JSON format data) and write to the database (Scrapy crawler growth Diary of the crawl content written to the MySQL database). However, the function of this reptile is too weak, once the target site set the crawler restrictions, our crawler will be invalid. So here is how to avoid scrapy crawler ban. All the content of this door is based on the previous two articles, if you miss can click this back: Scrapy crawler growth Diary Creation project-extract data-save in JSON format data, Scrapy crawler growth diary to write crawl content to MySQL database

According to Scrapy Official document: Http://doc.scrapy.org/en/master/topics/practices.html#avoiding-getting-banned inside the description, to prevent scrapy by ban, There are several strategies in the main.

    • Dynamic Setup User Agent
    • disabling cookies
    • Set up deferred Downloads
    • Using Google cache
    • Use the IP address pool (Tor project, VPN, and proxy IP)
    • Using Crawlera

Because Google cache is affected by the domestic network, you know, Crawlera distributed download, we can use a special article next time to explain. So this article mainly from dynamic random set user agent, disable cookies, set delay download and use proxy IP several ways. Well, into the subject:

  1. Create middlewares.py

Scrapy proxy IP, user agent switch is controlled by downloader_middlewares, below we create middlewares.py file.

[[email protected] cnblogs]#VI cnblogs/middlewares.pyImportRandomImportBase64 fromSettingsImportPROXIESclassrandomuseragent (object):"""randomly rotate user agents based on a list of predefined ones"""    def __init__(self, Agents): Self.agents=Agents @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Crawler.settings.getlist ('user_agents'))    defprocess_request (self, request, spider):#print "**************************" + random.choice (self.agents)Request.headers.setdefault ('user-agent', Random.choice (self.agents))classProxymiddleware (object):defprocess_request (self, request, spider): Proxy=Random.choice (PROXIES)ifproxy['User_pass'] is  notnone:request.meta['Proxy'] ="http://%s"% proxy['Ip_port'] Encoded_user_pass= Base64.encodestring (proxy['User_pass']) request.headers['proxy-authorization'] ='Basic'+Encoded_user_passPrint "**************proxymiddleware have pass************"+ proxy['Ip_port']        Else:            Print "**************proxymiddleware No pass************"+ proxy['Ip_port'] request.meta['Proxy'] ="http://%s"% proxy['Ip_port']

Class randomuseragent is primarily used to dynamically obtain a list of user Agent,user agents user_agents configured in settings.py.

Class Proxymiddleware is used to switch proxies, and the proxy list proxies is also configured in settings.py .

  2. Modify settings.py configuration user_agents and proxies

A): Add user_agents

User_agents = [    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)",    "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)",    "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)",    "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)",    "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6",    "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0",    "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5",    "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11",    "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20",    "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",]

b): Add Proxy IP Settings proxies

PROXIES = [    {'Ip_port':'111.11.228.75:80','User_pass':"'},    {'Ip_port':'120.198.243.22:80','User_pass':"'},    {'Ip_port':'111.8.60.9:8123','User_pass':"'},    {'Ip_port':'101.71.27.120:80','User_pass':"'},    {'Ip_port':'122.96.59.104:80','User_pass':"'},    {'Ip_port':'122.224.249.122:8088','User_pass':"'},]

proxy ip can be searched online, the above proxy IP gets from: http://www.xici.net.co/.

c): Disable cookies

Cookies_enabled=false

d): Set download delay

Download_delay=3

E): Last set Downloader_middlewares

 downloader_middlewares = { #   ' Cnblogs.middlewares.MyCustomDownloaderMiddleware ': 543,   cnblogs.middlewares.randomuseragent  : 1 ,    scrapy.contrib.downloadermiddleware.httpproxy.httpproxymiddleware   ' : 110 #   ' Scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware ': +,   cnblogs.middlewares.proxymiddleware   " : 100,}  

Save settings.py

3. Testing

[[email protected] cnblogs] # scrapy Crawl Cnblogsspider

SOURCE updated to: Https://github.com/jackgitgz/CnblogsSpider

Aside from this: the User agent and proxy lists in this article are set by settings.py, and the user agent and proxy in the actual production are likely to be updated frequently, and each time the configuration file is changed it is awkward and not easy to manage. Thus, the MySQL database can be saved as needed.

How to make your scrapy crawler no longer banned by ban

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.