1 Proxy middleware
Code core is to randomly select an agent's IP and port, as for the proxy IP and port source, can be purchased IP proxy, can also be crawled from the Internet.
#-*-coding:utf-8-*-"' Created on June 14, 2017 @author: Dzm ' from eie.middlewares import u Df_config from eie.service.EieIpService import eieipservice logger = Udf_config.logger Eieipservice = Eieipservice () CLA
SS Proxymiddleware (Object): ' IP broker middleware ' def process_request (self,request,spider): ' Add IP Proxy "if Request.url.startswith (' http://') on the request, Proxy = Eieipservice.sel
Ect_rand (' HTTP ') if proxy:proxies = ' http://{}:{} '. Format (proxy[' IP '],proxy[' Port ']) Else:raise Exception (' No suitable IP proxy found ') Else:proxy = Eieipservice.select_rand (' HTTPS
' If proxy:proxies = ' https://{}:{} '. Format (proxy[' IP '],proxy[' port ') Else: Raise Exception (' No suitable IP proxy found ') request.meta[' proxy ' = proxies
2 Settings Configuration
Here need to pay attention to the order between Retrymiddleware and Proxymiddleware, otherwise Python scrapy will not retry the connection timeout, Scrapy will be consistent time-out reconnection, I searched QQ Group, no one answered my question.
The order of the middleware is important, and the detailed description can be consulted: configuration order of Downloader_middlewares middleware in Scrapy
Downloader_middlewares = {
' scrapy.downloadermiddlewares.useragent.UserAgentMiddleware ': None,
' Eie.middlewares.random_user_agent. Randomuseragent ':,
' Eie.middlewares.proxy_middleware. Proxymiddleware ':, '
scrapy.downloadermiddlewares.retry.RetryMiddleware ':
Scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware ': ","
Available in the Order of reference
downloader_middlewares_base {' Scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware ': +, ' scrapy.co Ntrib.downloadermiddleware.httpauth.HttpAuthMiddleware ': 300, ' Scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware ': 350, ' Scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware ': 400, ' Scrapy.contrib.downloadermiddleware.retry.RetryMiddleware ': 500, ' Scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware ': 550, ' Scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware ': 580, ' Scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware ': 590, ' Scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware ': 600, ' Scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware ': 700, ' Scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware ': 750, ' Scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware ': 830, ' scrapy. Contrib.downloadermiddleware.stats.DownloaderStats ': 850, ' Scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware ': 900,}
3 Check that the IP proxy is successful
from the response in the following figure, it is true that the proxy address
4 agent
IP Proxy and its role and usage are br> the pros and cons of using proxy IP.
5 Get random proxy
def select_rand_zhimadaili (self): "Get IP Proxy list from sesame Agent @see: Http://http.zhimadaili. com/index/api/newapi.html "try:proxy_keys_list = Self.redis.keys () if Proxy_ke
Ys_list:key_list = Range (1, Len (proxy_keys_list)) key = Key_list[random.choice (key_list)] # proxy_list = Json.load (Self.redis.get (Proxy_keys_list[key])) STRs = Self.redis.get (ProX Y_keys_list[key]) return eval (strs) Else:socket = Urllib2.urlopen (' Http://htt P.zhimadaili.com/index/api/new_get_use_ips.html?num=20&type=2&pro=0&city=0&yys=0&port=1 &time=1 ') result = socket.read () result = Json.loads (result) proxy_lis t = result[' data ' for proxy in proxy_list:name = ' {}:{} '. Format (proxy[' IP '],proxy[' p Ort ']) SELF.REDIS.SEtex (name, proxy, socket.close) key_list = range (1, Len (proxy_list)) Key = Key_list[random.choice (key_list)] return Proxy_list[key] except Exception,e:lo Gger.error (' request zhimadaili.com get agent failed, reason for failure:%s '% e)
The following approach is to randomly find a database from the database is MySQL
def select_rand_db (self,types=none):
If types:
sql = "Select Ip,port,types from eie_ip where types= ' {} ' ORDER by R and () Limit 1 ". Format (types)
else:
sql =" Select Ip,port,types from Eie_ip ORDER by rand () limit 1 "
df = PD.R Ead_sql (sql,self.engine)
results = Json.loads (Df.to_json (orient= ' Records '))
if results and Len (results) ==1:
return results[0]
return None