Python scrapy Camouflage Proxy and fake_useragent use

Source: Internet
Author: User
Tags python scrapy

Scrapy the use of camouflage proxies and fake_useragent

Masquerading as browser agent in crawling Web pages is that some servers are not very high to request filtering can not use IP to disguise the request directly to the disguise of their browser is also possible.

The first method:

1. Add the following to the setting.py file, which is the header information for some browsers

User_agent_list = [' Zspider/0.9-dev http://feedback.redkolibri.com/', ' xaldon_webspider/2.0.b1 ', ' Mozilla/5.0 (Windows; U Windows NT 5.1; En-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' mozilla/5.0 (compat ible; Speedy Spider; http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' speedy spider (entireweb; beta/1.3; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.2; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.1; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.0; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (beta/1.0; www.entireweb.com ) ', ' Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/) ', '                    Speedy Spider (http://www.entireweb.com) ', ' sosospider+ (+http://help.soso.com/webspider.htm) ', ' Sogou spider ', ' Nusearch spider (www.nusearch.com) ', ' Nusearch spider ' (c ompatible; MSIE 4.01; Windows NT) ', ' Lmspider ([email protected]) ', ' Lmspider [email protected] ' , ' Ldspider (http://code.google.com/p/ldspider/wiki/Robots) ', ' iaskspider/2.0 (+http:                    iask.com/help/help_index.html) ', ' Iaskspider ', ' hl_ftien_spider_v1.1 ',                    ' Hl_ftien_spider ', ' Fyberspider (+http://www.fybersearch.com/fyberspider.php) ', ' Fyberspider ', ' everyfeed-spider/2.0 (http://www.everyfeed.com) ', ' ENVOLK[ITS]SP Ider/1.6 (+http://www.envolk.com/envolkspider.html) ', ' envolk[its]spider/1.6 (http://www.envolk.com/envo lkspider.html) ', ' baiduspider+ (+http://www.baidu.com/search/spider_jp.html) ', ' Baidu spider+ (+http://www.baidu.com/search/spider.htm) ', ' Baiduspider ', ' mozilla/4.0 (comp atible; MSIE 7.0; Windows NT 6.0) Addsugarspiderbot www.idealobserver.com ',]

2. Create a Midware file price in the spider sibling Directory write a headermidware.py file with the contents of

# encoding: utf-8from scrapy.utils.project import get_project_settingsimport randomsettings = get_project_settings()class ProcessHeaderMidware():    """process request add request info"""    def process_request(self, request, spider):        """        随机从列表中获得header, 并传给user_agent进行使用        """        ua = random.choice(settings.get(‘USER_AGENT_LIST‘))          spider.logger.info(msg=‘now entring download midware‘)        if ua:            request.headers[‘User-Agent‘] = ua            # Add desired logging message here.            spider.logger.info(u‘User-Agent is : {} {}‘.format(request.headers.get(‘User-Agent‘), request))        pass

3. Add in the setting.py file

Downloader_middlewares = {
' ProjectName.MidWare.HeaderMidWare.ProcessHeaderMidware ': 543,
}

The second method: the use of fake_useragent

Fake_useragent is an open source project on GitHub.
1. Installing Fake_useragent
Pip Install Fake-useragent
2. Create a Midware file price in the spider sibling Directory write a user_agent_middlewares.py file with the contents of

# -*- coding: utf-8 -*-from fake_useragent import UserAgentclass RandomUserAgentMiddlware(object):    #随机跟换user-agent    def __init__(self,crawler):        super(RandomUserAgentMiddlware,self).__init__()        self.ua = UserAgent()        self.ua_type = crawler.settings.get(‘RANDOM_UA_TYPE‘,‘random‘)#从setting文件中读取RANDOM_UA_TYPE值    @classmethod    def from_crawler(cls,crawler):        return cls(crawler)    def process_request(self,request,spider):  ###系统电泳函数        def get_ua():            return getattr(self.ua,self.ua_type)        # user_agent_random=get_ua()        request.headers.setdefault(‘User_Agent‘,get_ua())        pass

3. Add in setting.py

RANDOM_UA_TYPE = ‘random‘##random    chromeDOWNLOADER_MIDDLEWARES = {‘projectName.MidWare.user_agent_middlewares.RandomUserAgentMiddlware‘: 543, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘:None,}

Fake_useragent disguise proxy is configured, compared with the first method does not have to write a large list of browser headers, those browser headers will be in the https://fake-useragent.herokuapp.com/browsers/0.1.7.

There will be some errors when you first enable fake_useragent, which I think is caused by the need to cache some content when the project requests the network.

GitHub Address: Https://github.com/sea1234/fake-useragent

Python scrapy Camouflage Proxy and fake_useragent use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.