This article writes about the usage and examples of commonly used download middleware.
Downloader middleware processing process mainly when the scheduler sends requests requests and the Web page will response the results back to spiders, so from here we can know that the download middleware is between Scrapy request/ Response handles the hooks for modifying scrapy request and response.
Write your own downloader middleware
To write a downloader middleware, you need to define a Python class of one or more of the following methods
To demonstrate the use of middleware here, create a project here as a learning, here's a project about crawling to httpbin.org this site
Scrapy Startproject httpbintest
CD Httpbintest
Scrapy Genspider Example example.com
The directory structure after creation is as follows:
Here we first write a simple proxy middleware to implement IP spoofing
After creating the crawler, we say that the parse method in httpbin.py is changed to:
def Parse (self, Response): Print (Response.text)
Then start the crawler from the command line: Scrapy Crawl Httpbin
At the bottom we can see "origin": "114.250.88.66"
We are looking at our own IP:
And we have to do is through the proxy middleware to achieve IP camouflage, in middleares.py write the following middleware class:
class Proxymiddleare (object): = Logging.getlogger (__name__) def process_request (Self,request, spider): Self.logger.debug ("Using Proxy") request.meta[' Proxy ' ' http://127.0.0.1:9743 ' return None
Here because I have a local agent FQ address is: http://127.0.0.1:9743
So directly set to proxy, the address of the agent is Japan's IP
Then the function of downloading middleware is opened in the settings.py configuration file, which is turned off by default.
Then we start the crawler again: Scrapy crawl Httpbin
From the input log we can read our definition of the middleware has been started, and entered the log information we print, and we look at Origin's IP address has become the IP address of Japan, so our agent middleware succeeded
Detailed Description Class Scrapy.downloadermiddleares.DownloaderMiddleware
Process_request (Request,spider)
When each request is downloaded by the middleware, the method is called, and there is a requirement that the method must return any of the following three: None, return a response object, return a Request object or raise ignorerequest. The effect of the three return values is different.
None:scrapy will continue to process the request and execute the corresponding method of the other middleware until the appropriate downloader handler function (download handler) is called, and the request is executed (its response is downloaded).
Response object: Scrapy will not invoke any other process_request () or Process_exception () method, or download the function accordingly; It will return the response. The Process_response () method of the installed middleware is called when each response is returned.
Request object: Scrapy stops calling the Process_request method and re-dispatches the returned request. When the newly returned request is executed, the corresponding middleware chain will be called according to the downloaded response.
Raise an Ignorerequest exception: the Process_exception () method of the installed download middleware is called. If no one method handles the exception, the Request's Errback (Request.errback) method is called. If no code handles the thrown exception, the exception is ignored and is not logged.
Process_response (Request, response, Spider)
There are also three return values for Process_response: Response object, request object, or raise a Ignorerequest exception
If it returns a response (which can be the same as the incoming response, or it can be a completely new object), the response is processed by the Process_response () method of the other middleware in the chain.
If it returns a Request object, the middleware chain is stopped and the returned request is re-dispatched for download. The processing is similar to Process_request () returned by request.
If it throws a Ignorerequest exception, call the Request's Errback (request.errback). If no code handles the thrown exception, the exception is ignored and not logged (unlike other exceptions).
Here we write a simple example or the above project, we continue to add the following code in the middleware:
Then print the status code in the spider:
So when we re-run the crawler, we can see the following:
Process_exception (Request, exception, Spider)
Scrapy calls Process_exception () when the download processor (download handler) or process_request () (download middleware) throws an exception (including a Ignorerequest exception).
Process_exception () also returns one of the three: Returns None, a Response object, or a Request object.
If it returns None, Scrapy will continue to handle the exception, and then call the Process_exception () method of the other middleware that has been installed until all middleware has been called, calling the default exception handling.
If it returns a Response object, the Process_response () method of the installed middleware chain is called. Scrapy will not invoke the Process_exception () method of any other middleware.
If it returns a Request object, the returned request will be re-invoked for download. This will stop the middleware's Process_exception () method from executing, as if returning a response. This is very useful, it is equivalent to if we fail to do a failed retry here, such as when we visit a Web site because of frequent crawling of the IP can be set here to increase the number of agents continue to access, we demonstrate by the following example
Scrapy genspider Google www.google.com Here we create a Google Crawler,
Then start Scrapy crawl Google and you can see the following:
Here we write a middleware that increases the agent when Access fails
First we make changes to the google.py code so that the white timeout is set to 10 seconds or else wait too long, this is what we have to say in the Spider's Make_requests_from_url, here we rewrite this method, and set the wait time-out to 10s
So I restart the crawler: Scrapy crawl Google, you can see the following:
Here, if we don't want to retry, we can turn the retry middleware off:
After this setting we put the failed retry middleware to shut down, set to none to indicate the shutdown of the middleware, restart the crawler we can also see that no retry direct error.
We change the proxy middleware agent to the following, indicating that when an exception is encountered to add a proxy to the request, and return request, this kind will re-ask Google
Restart Google Crawler, we can see that we first returned our printed log information get Exception, and then add the agent after successfully visited Google, here my agent is Japan's proxy node, so access to the Japanese Google station
Python crawler from introductory to discard (17) scrapy Framework Download Middleware usage