Solving Scrapy Performance Issues-case three ("junk" in the Downloader)

Last Update:2018-07-25 Source: Internet

Author: User

Tags http request

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Symptom : The throughput of the system is expected to be smaller, and the request object in the downloader sometimes looks more than concurrent_requests.

Example : We use a 0.25-second download delay to mimic the download of 1000 pages, the default concurrency level is 16, according to the previous formula, it takes about 19s of time. We use Crawler.engine.download () in a pipeline to initiate an additional HTTP request to a bogus API, and the response of this request takes 1s of time. Run the program:

$ time Scrapy crawl speed-s speed_total_items=1000-s
speed_t_response=0.25-s speed_api_t_response=1-s
SPEED_ Pipeline_api_via_downloader=1 ...
S/edule d/load Scrape p/line done mem
968 0 32768
952 0 0 0 936 + +
32768
...
Real 0m55.151s

The result is somewhat odd, not only that it took three times times longer to run the program, but also that the number of requests for activity in the Downloader (D/loader) was more than concurrent_requests. The downloader is obviously a bottleneck in the system because its load has exceeded its capacity. Rerun the program, connect Scrapy with Telnet at another terminal, and check which request objects in the Downloader are active:

$ telnet localhost 6023
>>> engine.downloader.active
Set ([<post http://web:9312/ar:1/ti:1000/ Rr:0.25/benchmark/api>, ...])

It seems to be primarily in the process of processing additional HTTP requests in pipeline instead of downloading Web pages in crawlers.

discussion : Seeing here you might want to never have someone use the crawler.engine.download () function, because it looks a bit complicated, but scrapy itself has already used the function again, are in robots.txt middleware and multimedia middleware respectively. So when you need to access the Web API, this function is also a solution, at least using this function than the blocking API, such as Python's popular requests library, and it is easier than understanding the twisted programming and then using the Treq library. But there is a solution. This error is difficult to debug, so be proactive in pulling out the activity requests in the downloader when checking for performance. If you find requests for Web APIs and requests for multimedia URLs, it means that some of your pipeline use the Crawler.engine.download () function to initiate an HTTP request. The concurrent_requests limit we set is not valid for these request objects, which means we may find that the request in the downloader is more than the concurrent_requests. Unless the number of pseudo-requests in the downloader (that is, requests initiated with the Crawler.engine.download () function) drops below concurrent_requests, the scheduler does not send new page requests (that is, requests generated by crawlers).

Therefore, the actual throughput of the system will be the same as the throughput of the set delay of 1s (that is, the latency of accessing the Web API). This case is especially true because we will not notice performance degradation unless API access is slower than page requests.

Solution : You can solve this problem by using the Treq library instead of the Crawler.engine.download () function. You will find that using this method will increase the performance of scrapy, where I will slowly increase the number of concurrent_requests to ensure that the API server does not load too high.

The following program is the same function as the previous program, except that it uses the Treq:

$ time Scrapy crawl speed-s speed_total_items=1000-s
speed_t_response=0.25-s speed_api_t_response=1-s
SPEED_ Pipeline_api_via_treq=1 ...
S/edule d/load Scrape p/line done mem
936 0 49152 887 +-
-----66560 823 .
Real 0m19.922s

You will find a very interesting thing, the item in pipeline is more than the one in the downloader. This is normal, and it's interesting to understand why.

As expected, the downloader is filled with 16 requests, which means that the throughput of the system is T = N/S = 16/0.25 = 64 Requests/sec. We can confirm this by observing the growth of the done column. A request in the downloader will take 0.25s of time, but due to the slow API access, it will take 1s of time in the pipeline. This means that in pipeline (p/line), we expect to see the average of item being n = T * S = 64. But this does not mean that pipeline is the bottleneck because we have no limit on the number of responses that can be processed concurrently in pipeline. As long as the number of responses in the pipeline does not increase indefinitely, it is OK.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More