Scrapy-redis implementation of Scrapy distributed crawl analysis

Source: Internet
Author: User
Tags redis
(1) in "http://www.zhihu.com/question/20899988", referred to:

"Well, suppose you now have 100 machines to work with and how to implement a distributed crawl algorithm with Python."

We called the 99 smaller machines in the 100 Taichung Slave, and the other larger machine called Master, so look back at the Url_queue in the code above, and if we can put this queue on this master machine, All slave can be connected to master via the network, and whenever a slave completes downloading a webpage, it asks master for a new page to crawl. And each time slave a new Web page, the link to the page all the links to master's queue. Similarly, Bloom filter is placed on master, but now master only sends a URL that determines that it has not been visited to slave. The Bloom filter is placed in master's memory, and the URL visited is placed in a redis running on master so that all operations are O (1). (At least the split is O (1), the access efficiency of Redis see: Linsert–redis) "

One of the distributed spiders crawls on multiple machines at the same time (there is no direct description of the number of spiders running on each stage, but the analysis seems to be a), and this distribution is implemented through Scrapy-redis, where the queue refers to the Redis queue. The realization is to use the Redis storage url (divided into the url_no and visited Url_yes, which he accesses through the Url_yes+bloom filter), which is the role of redis in distributed crawling.

(2) as described in "http://www.douban.com/group/topic/38363928/":
"Distributed use of REDIS implementation, Redis storage of engineering request,stats information, can be centralized management of the crawler on each machine, so as to solve the performance bottleneck of the crawler, Efficient and easy-to-scale redis makes it easy to download: When Redis storage or access speed encounters bottlenecks, you can improve by increasing the number of redis clusters and reptile clusters. ”

This is also the case, but what is indicated here is that the request is stored in Redis, which is similar to the example in Scrapy-redis and, of course, a crawler that can read the URL in Redis (Younghz:) is also implemented in example. This is not (1) in the analysis, right, right).

so the above two methods are the application of Redis in Scrapy distributed crawler. Essentially, everyone (all machines, all reptiles) put the things they got (url,request) together (request queue) to dispatch.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.