Describes how to use a proxy to collect Python Scrapy crawler framework.
1. Create "middlewares. py" under the Scrapy Project"
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authenticationimport base64# Start your middleware classclass ProxyMiddleware(object): # overwrite process request def process_request(self, request, spider): # Set the location of the proxy request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT" # Use the following lines if your proxy requires authentication proxy_user_pass = "USERNAME:PASSWORD" # setup basic authentication for the proxy encoded_user_pass = base64.encodestring(proxy_user_pass) request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2. Add (./project_name/settings. py) in the project configuration file
DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 'project_name.middlewares.ProxyMiddleware': 100,}
In only two steps, the request passes through the proxy. Test ^_^.
from scrapy.spider import BaseSpiderfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.http import Requestclass TestSpider(CrawlSpider): name = "test" domain_name = "whatismyip.com" # The following url is subject to change, you can get the last updated one from here : # http://www.whatismyip.com/faq/automation.asp start_urls = ["http://xujian.info"] def parse(self, response): open('test.html', 'wb').write(response.body)
3. Use random user-agent
By default, only one user-agent can be used for scrapy collection, which is easily blocked by websites, the following code randomly selects one from the pre-defined user-agent list to collect different pages.
Add the following code in settings. py:
DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400 }
Note: Crawler is the name of your project. The name of a directory contains the spider code.
#! /Usr/bin/python #-*-coding: UTF-8-*-import randomfrom scrapy. contrib. downloadermiddleware. useragent import UserAgentMiddlewareclass RotateUserAgentMiddleware (UserAgentMiddleware): def _ init _ (self, user_agent = ''): self. user_agent = user_agent def process_request (self, request, spider): # This statement is used to randomly select user-agent ua = random. choice (self. user_agent_list) if ua: request. headers. setdefault ('user-agent', ua) # the default user_agent_list composes chrome, I E, firefox, Mozilla, opera, netscape # for more User Agent strings, you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) chrome/22.0.1207.1 Safari/537.1 "\" Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/Listen 0.1132.57 Safari/536.11 ", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/4150.20.2.0 Safari/536.6", \ "Mozilla/5.0 (Windows NT 6.2) appleWebKit/536.6 (KHTML, like Gecko) Chrome/255.0.20.0.0 Safari/536.6 ", \" Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) chrome/19.77.34.5 Safari/537.1 ", \" Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.20.4.9 Safari/536.5 ", \ "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.20.4.36 Safari/536.5", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3 ", \" Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) chrome/19.0.20.3.0 Safari/536.3 ", \" Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3 ", \ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3 ", \" Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) chrome/19.0.20.1.1 Safari/536.3 ", \" Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3 ", \ "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.2) appleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.1.0 Safari/536.3 ", \" Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) chrome/19.0.1055.1 Safari/535.24 ", \" Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24 "]
Articles you may be interested in:
- No basic write python crawler: Use Scrapy framework to write Crawlers
- Installing and configuring Scrapy
- Install and use the Python crawler framework Scrapy
- Example of using scrapy to parse js in python
- Simple learning notes for Python Scrapy crawler framework
- In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
- Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
- Simple spider collection program based on scrapy
- Python uses scrapy to download a large page when collecting data
- How to print scrapy by using Python to capture the Tree Structure