Analysis and implementation of Crawler's distributed crawl in Scrapy-redis

Source: Internet
Author: User
Tags redis
This article link: http://blog.csdn.net/u012150179/article/details/38091411
a Scrapy-redis implementation of distributed crawl analysisThe so-called Scrapy-redis is actually Scrapy+redis, which uses the Redis-py client for Redis operations. Here the role of Redis and in the direction of Scrapy-redis I fork in the Repository (link: https://github.com/younghz/scrapy-redis) has been translated (Readme.rst).
In the previous article, I have analyzed two related articles to use Redis to implement the crawler distributed center. It boils down to the following: All crawlers get the URL (request) into a redis queue, and all crawlers get the request (URL) from a single Redis queue.
Scrapy-redis has been a long time no update, how is it compatible with newer versions of Scrapy I'm in the post (link: http://blog.csdn.net/u012150179/article/details/ 38087661) It has also been stated that later I may rewrite the Scrapy-redis with a newer version of the SCRAPR interface.

two distributed crawl implementations 1. Analysis of self-contained example in Scrapy-redisThe use of example has been explained in the library's readme, but there are a lot of questions about the spider running example in the initial contact, such as where the distribution manifests itself. is achieved through those aspects. Second, it's hard to find distributed shadows in the running results, feeling like two spiders crawling their own stuff.
For the first question, I have already explained settings.py in translating and labeling Scrapy-redis. And the second question is to achieve 2 of their own example to do.

2. More clearly verify Scrapy-redis implementation of distributed thinking and coding. (1) IdeasImplement the two crawlers, define the crawler a crawl dmoz.com keyword bussiness under all links (through Start_urls settings). Crawler B crawls all the links under the game, observing the URLs of the crawled links at the same time, whether they are the URLs of the respective scopes or the intersection of the two. This is because the crawl range defined by the two is inconsistent, and the result can be obtained by crawling the phenomenon.
(2) RealizeThe code is placed in GitHub's repo (https://github.com/younghz/scrapy-redis/). For easy observation, set Depth_limit to 1.
(3) Phenomena and analysisPhenomenon: It can be found that the two are the first to crawl the link under a single keyword (first crawl which depends on the first run the crawler start_urls), and then crawl to another keyword under the link.
Analysis: By simultaneously crawling a single keyword can show that two crawlers are simultaneously dispatched, this is the crawler's distribution. And the crawler by default is breadth-first search. The steps for crawling are as follows:

i) First run crawler A (b), the crawler engine request spider a start_urls in the link and delivery scheduler, and then the engine to the scheduler request crawl URL and to the download, downloaded response to the spider, The spider is linked to the defined rules and continues to be handed over to the scheduler via the engine. (This series of procedures can be viewed in the Scrapy architecture). Where the request (URL) order in the Scheduler Scheduler is implemented by the Redis queue, which is the push of requests (URLs) into the queue, the pop comes out when requested.


II) to start B, the start_urls of the same B is first given to the Scheduler (note and a scheduler is the same), and B's engine request crawl URL, the scheduler dispatched to the B download URL or A is not the download completed URL (the default schedule is to dispatch the returned URL first, and is breadth-first), which is a and b at the same time download a not completed link, pending completion, at the same time download the request link B.

III) Question: How the scheduling method in the above II is implemented.
In Scrapy-redis, the Spiderpriorityqueue method is used by default, which is a non-FIFO, LIFO method implemented by sorted set.
3. Detail analysis and attention points

Each time a re-crawl is performed, the data stored in the Redis should be emptied, otherwise affecting the crawl behavior. 4. Other request and URL differences:
The former is implemented by the latter through the function Make_request_from_url, and the process is done by the spider. The spider returns the (return, yield) request to the Scrapy engine and then the delivery scheduler. The URL is also defined in the spider or acquired by the spider.
Spider and Crawler:

Spiders are different from crawler. The crawler contains the spider. Scrapy architecture is the Crawler,spider function: Provide start_url, according to download to the response analysis to obtain the desired content, continue to extract URLs and so on.


Original, this article link: http://blog.csdn.net/u012150179/article/details/38091411

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.