When crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)
Here's how scrapy Configure the agent for crawling
1. Create a new "middlewares.py" under the Scrapy project
| 1234567891011121314 |
# Importing base64 library because we‘ll need it ONLY in case if the proxy we are going to use requires authentication importbase64 # Start your middleware class classProxyMiddleware(object): # overwrite process request defprocess_request(self, request, spider): # Set the location of the proxy request.meta[‘proxy‘] ="http://YOUR_PROXY_IP:PORT" # Use the following lines if your proxy requires authentication proxy_user_pass ="USERNAME:PASSWORD" # setup basic authentication for the proxy encoded_user_pass =base64.encodestring(proxy_user_pass) request.headers[‘Proxy-Authorization‘] =‘Basic ‘+encoded_user_pass |
2. Add in the project configuration file (./pythontab/settings.py)
| 1234 |
DOWNLOADER_MIDDLEWARES ={ ‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware‘: 110, ‘pythontab.middlewares.ProxyMiddleware‘: 100, } |
Http://www.qytang.com/cn/list/28/611.htm
Http://www.qytang.com/cn/list/28/610.htm
Http://www.qytang.com/cn/list/28/595.htm
Http://www.qytang.com/cn/list/28/583.htm
Http://www.qytang.com/cn/list/28/582.htm
Http://www.qytang.com/cn/list/28/576.htm
Http://www.qytang.com/cn/list/28/523.htm
Http://www.qytang.com/cn/list/28/499.htm
Http://www.qytang.com/cn/list/28/488.htm
Http://www.qytang.com/cn/list/28/466.htm
Python crawler scrapy using proxy configuration----------Yi Tang