Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
(1) in "http://www.zhihu.com/question/20899988", referred to:
"Well, suppose you now have 100 machines to work with and how to implement a distributed crawl algorithm with Python."
We called the 99 smaller machines in the 100 Taichung Slave, and the other larger machine called Master, so look back at the Url_queue in the code above, and if we can put this queue on this master machine, All slave can be connected to master via the network, and whenever a slave completes downloading a webpage, it asks master for a new page to crawl. And each time slave a new Web page, the link to the page all the links to master's queue. Similarly, Bloom filter is placed on master, but now master only sends a URL that determines that it has not been visited to slave. The Bloom filter is placed in master's memory, and the URL visited is placed in a redis running on master so that all operations are O (1). (At least the split is O (1), the access efficiency of Redis see: Linsert–redis) "
One of the distributed spiders crawls on multiple machines at the same time (there is no direct description of the number of spiders running on each stage, but the analysis seems to be a), and this distribution is implemented through Scrapy-redis, where the queue refers to the Redis queue. The realization is to use the Redis storage url (divided into the url_no and visited Url_yes, which he accesses through the Url_yes+bloom filter), which is the role of redis in distributed crawling.
(2) as described in "http://www.douban.com/group/topic/38363928/":
"Distributed use of REDIS implementation, Redis storage of engineering request,stats information, can be centralized management of the crawler on each machine, so as to solve the performance bottleneck of the crawler, Efficient and easy-to-scale redis makes it easy to download: When Redis storage or access speed encounters bottlenecks, you can improve by increasing the number of redis clusters and reptile clusters. ”
This is also the case, but what is indicated here is that the request is stored in Redis, which is similar to the example in Scrapy-redis and, of course, a crawler that can read the URL in Redis (Younghz:) is also implemented in example. This is not (1) in the analysis, right, right).
so the above two methods are the application of Redis in Scrapy distributed crawler. Essentially, everyone (all machines, all reptiles) put the things they got (url,request) together (request queue) to dispatch.
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or
reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or
complaint, to firstname.lastname@example.org. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
and provide relevant evidence. A staff member will contact you within 5 working days.