Python scrapy Camouflage Proxy and fake

Python scrapy Camouflage Proxy and fake_useragent use

Last Update:2018-04-23 Source: Internet

Author: User

Tags python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy the use of camouflage proxies and fake_useragent

Masquerading as browser agent in crawling Web pages is that some servers are not very high to request filtering can not use IP to disguise the request directly to the disguise of their browser is also possible.

The first method:

1. Add the following to the setting.py file, which is the header information for some browsers

User_agent_list = [' Zspider/0.9-dev http://feedback.redkolibri.com/', ' xaldon_webspider/2.0.b1 ', ' Mozilla/5.0 (Windows; U Windows NT 5.1; En-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' mozilla/5.0 (compat ible; Speedy Spider; http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' speedy spider (entireweb; beta/1.3; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.2; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.1; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.0; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (beta/1.0; www.entireweb.com ) ', ' Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/) ', '                    Speedy Spider (http://www.entireweb.com) ', ' sosospider+ (+http://help.soso.com/webspider.htm) ', ' Sogou spider ', ' Nusearch spider (www.nusearch.com) ', ' Nusearch spider ' (c ompatible; MSIE 4.01; Windows NT) ', ' Lmspider ([email protected]) ', ' Lmspider [email protected] ' , ' Ldspider (http://code.google.com/p/ldspider/wiki/Robots) ', ' iaskspider/2.0 (+http:                    iask.com/help/help_index.html) ', ' Iaskspider ', ' hl_ftien_spider_v1.1 ',                    ' Hl_ftien_spider ', ' Fyberspider (+http://www.fybersearch.com/fyberspider.php) ', ' Fyberspider ', ' everyfeed-spider/2.0 (http://www.everyfeed.com) ', ' ENVOLK[ITS]SP Ider/1.6 (+http://www.envolk.com/envolkspider.html) ', ' envolk[its]spider/1.6 (http://www.envolk.com/envo lkspider.html) ', ' baiduspider+ (+http://www.baidu.com/search/spider_jp.html) ', ' Baidu spider+ (+http://www.baidu.com/search/spider.htm) ', ' Baiduspider ', ' mozilla/4.0 (comp atible; MSIE 7.0; Windows NT 6.0) Addsugarspiderbot www.idealobserver.com ',]

2. Create a Midware file price in the spider sibling Directory write a headermidware.py file with the contents of

# encoding: utf-8from scrapy.utils.project import get_project_settingsimport randomsettings = get_project_settings()class ProcessHeaderMidware():    """process request add request info"""    def process_request(self, request, spider):        """        随机从列表中获得header， 并传给user_agent进行使用        """        ua = random.choice(settings.get(‘USER_AGENT_LIST‘))          spider.logger.info(msg=‘now entring download midware‘)        if ua:            request.headers[‘User-Agent‘] = ua            # Add desired logging message here.            spider.logger.info(u‘User-Agent is : {} {}‘.format(request.headers.get(‘User-Agent‘), request))        pass

3. Add in the setting.py file

Downloader_middlewares = {
' ProjectName.MidWare.HeaderMidWare.ProcessHeaderMidware ': 543,
}

The second method: the use of fake_useragent

Fake_useragent is an open source project on GitHub.
1. Installing Fake_useragent
Pip Install Fake-useragent
2. Create a Midware file price in the spider sibling Directory write a user_agent_middlewares.py file with the contents of

# -*- coding: utf-8 -*-from fake_useragent import UserAgentclass RandomUserAgentMiddlware(object):    #随机跟换user-agent    def __init__(self,crawler):        super(RandomUserAgentMiddlware,self).__init__()        self.ua = UserAgent()        self.ua_type = crawler.settings.get(‘RANDOM_UA_TYPE‘,‘random‘)#从setting文件中读取RANDOM_UA_TYPE值    @classmethod    def from_crawler(cls,crawler):        return cls(crawler)    def process_request(self,request,spider):  ###系统电泳函数        def get_ua():            return getattr(self.ua,self.ua_type)        # user_agent_random=get_ua()        request.headers.setdefault(‘User_Agent‘,get_ua())        pass

3. Add in setting.py

RANDOM_UA_TYPE = ‘random‘##random    chromeDOWNLOADER_MIDDLEWARES = {‘projectName.MidWare.user_agent_middlewares.RandomUserAgentMiddlware‘: 543, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘:None,}

Fake_useragent disguise proxy is configured, compared with the first method does not have to write a large list of browser headers, those browser headers will be in the https://fake-useragent.herokuapp.com/browsers/0.1.7.

There will be some errors when you first enable fake_useragent, which I think is caused by the need to cache some content when the project requests the network.

GitHub Address: Https://github.com/sea1234/fake-useragent

Python scrapy Camouflage Proxy and fake_useragent use

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More