Scrapy the use of camouflage proxies and fake_useragent
Masquerading as browser agent in crawling Web pages is that some servers are not very high to request filtering can not use IP to disguise the request directly to the disguise of their browser is also possible.
The first method:
1. Add the following to the setting.py file, which is the header information for some browsers
User_agent_list = [' Zspider/0.9-dev http://feedback.redkolibri.com/', ' xaldon_webspider/2.0.b1 ', ' Mozilla/5.0 (Windows; U Windows NT 5.1; En-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' mozilla/5.0 (compat ible; Speedy Spider; http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' speedy spider (entireweb; beta/1.3; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.2; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.1; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (entireweb; beta/1.0; http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (beta/1.0; www.entireweb.com ) ', ' Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/) ', ' Speedy Spider (http://www.entireweb.com/about/search_tech/speedyspider/) ', ' Speedy Spider (http://www.entireweb.com) ', ' sosospider+ (+http://help.soso.com/webspider.htm) ', ' Sogou spider ', ' Nusearch spider (www.nusearch.com) ', ' Nusearch spider ' (c ompatible; MSIE 4.01; Windows NT) ', ' Lmspider ([email protected]) ', ' Lmspider [email protected] ' , ' Ldspider (http://code.google.com/p/ldspider/wiki/Robots) ', ' iaskspider/2.0 (+http: iask.com/help/help_index.html) ', ' Iaskspider ', ' hl_ftien_spider_v1.1 ', ' Hl_ftien_spider ', ' Fyberspider (+http://www.fybersearch.com/fyberspider.php) ', ' Fyberspider ', ' everyfeed-spider/2.0 (http://www.everyfeed.com) ', ' ENVOLK[ITS]SP Ider/1.6 (+http://www.envolk.com/envolkspider.html) ', ' envolk[its]spider/1.6 (http://www.envolk.com/envo lkspider.html) ', ' baiduspider+ (+http://www.baidu.com/search/spider_jp.html) ', ' Baidu spider+ (+http://www.baidu.com/search/spider.htm) ', ' Baiduspider ', ' mozilla/4.0 (comp atible; MSIE 7.0; Windows NT 6.0) Addsugarspiderbot www.idealobserver.com ',]
2. Create a Midware file price in the spider sibling Directory write a headermidware.py file with the contents of
# encoding: utf-8from scrapy.utils.project import get_project_settingsimport randomsettings = get_project_settings()class ProcessHeaderMidware(): """process request add request info""" def process_request(self, request, spider): """ 随机从列表中获得header, 并传给user_agent进行使用 """ ua = random.choice(settings.get(‘USER_AGENT_LIST‘)) spider.logger.info(msg=‘now entring download midware‘) if ua: request.headers[‘User-Agent‘] = ua # Add desired logging message here. spider.logger.info(u‘User-Agent is : {} {}‘.format(request.headers.get(‘User-Agent‘), request)) pass
3. Add in the setting.py file
Downloader_middlewares = {
' ProjectName.MidWare.HeaderMidWare.ProcessHeaderMidware ': 543,
}
The second method: the use of fake_useragent
Fake_useragent is an open source project on GitHub.
1. Installing Fake_useragent
Pip Install Fake-useragent
2. Create a Midware file price in the spider sibling Directory write a user_agent_middlewares.py file with the contents of
# -*- coding: utf-8 -*-from fake_useragent import UserAgentclass RandomUserAgentMiddlware(object): #随机跟换user-agent def __init__(self,crawler): super(RandomUserAgentMiddlware,self).__init__() self.ua = UserAgent() self.ua_type = crawler.settings.get(‘RANDOM_UA_TYPE‘,‘random‘)#从setting文件中读取RANDOM_UA_TYPE值 @classmethod def from_crawler(cls,crawler): return cls(crawler) def process_request(self,request,spider): ###系统电泳函数 def get_ua(): return getattr(self.ua,self.ua_type) # user_agent_random=get_ua() request.headers.setdefault(‘User_Agent‘,get_ua()) pass
3. Add in setting.py
RANDOM_UA_TYPE = ‘random‘##random chromeDOWNLOADER_MIDDLEWARES = {‘projectName.MidWare.user_agent_middlewares.RandomUserAgentMiddlware‘: 543, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘:None,}
Fake_useragent disguise proxy is configured, compared with the first method does not have to write a large list of browser headers, those browser headers will be in the https://fake-useragent.herokuapp.com/browsers/0.1.7.
There will be some errors when you first enable fake_useragent, which I think is caused by the need to cache some content when the project requests the network.
GitHub Address: Https://github.com/sea1234/fake-useragent
Python scrapy Camouflage Proxy and fake_useragent use