When crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)
Here's how scrapy Configure the agent for crawling
1. Create a new "middlewares.py" under the Scrapy project
1234567891011121314 |
# Importing base64 library because we‘ll need it ONLY in case if the proxy we are going to use requires authentication
import
base64
# Start your middleware class
class
ProxyMiddleware(
object
):
# overwrite process request
def
process_request(
self
, request, spider):
# Set the location of the proxy
request.meta[
‘proxy‘
]
=
"http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass
=
"USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass
=
base64.encodestring(proxy_user_pass)
request.headers[
‘Proxy-Authorization‘
]
=
‘Basic ‘
+
encoded_user_pass
|
2. Add in the project configuration file (./pythontab/settings.py)
1234 |
DOWNLOADER_MIDDLEWARES = { ‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware‘ : 110 , ‘pythontab.middlewares.ProxyMiddleware‘ : 100 , } |
Http://www.qytang.com/cn/list/28/611.htm
Http://www.qytang.com/cn/list/28/610.htm
Http://www.qytang.com/cn/list/28/595.htm
Http://www.qytang.com/cn/list/28/583.htm
Http://www.qytang.com/cn/list/28/582.htm
Http://www.qytang.com/cn/list/28/576.htm
Http://www.qytang.com/cn/list/28/523.htm
Http://www.qytang.com/cn/list/28/499.htm
Http://www.qytang.com/cn/list/28/488.htm
Http://www.qytang.com/cn/list/28/466.htm
Python crawler scrapy using proxy configuration----------Yi Tang