When crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)
Here's how scrapy Configure the agent for crawling
1. Create a new "middlewares.py" under the Scrapy project
# Importing Base64 library because we ' ll need it only if the the proxy we is going to use requires authenticationimpo RT Base64 # Start your middleware Classclass Proxymiddleware (object): # Overwrite process request def Process_req Uest (self, request, spider): # Set The location of the proxy request.meta[' proxy ' = ' Http://YOUR_PROXY_IP:PORT ' # Use the following lines if your proxy requires authentication Proxy_user_pass = "Username:password" # Setup Basic authentication for the proxy encoded_user_pass = base64.encodestring (proxy_user_pass) request.headers[ ' proxy-authorization '] = ' Basic ' + encoded_user_pass
2. Add in the project configuration file (./pythontab/settings.py)
Downloader_middlewares = { ' scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware ':, ' Pythontab.middlewares.ProxyMiddleware ': 100,}