Reprinted from: http://www.python_tab.com/html/2014/pythonweb_0326/724.html
When crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)
Here's how scrapy Configure the agent for crawling
1. Create a new "middlewares.py" under the Scrapy project
#Importing Base64 library because we ' ll need it only if the proxy we is going to use requires authentication< /c1>ImportBase64#Start your middleware classclassProxymiddleware (object):#Overwrite process request defprocess_request (self, request, spider):#Set The location of the proxyrequest.meta['Proxy'] ="Http://YOUR_PROXY_IP:PORT" #Use the following lines if your proxy requires authenticationProxy_user_pass ="Username:password" #Setup Basic Authentication for the proxyEncoded_user_pass =base64.encodestring (proxy_user_pass) request.headers['proxy-authorization'] ='Basic'+ Encoded_user_pass
2. Add in the project configuration file (./pythontab/settings.py)
Downloader_middlewares = { ' Scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':, ' Pythontab.middlewares.ProxyMiddleware ': +,}
Complete.
Python crawler scrapy using proxy configuration