Scrapy-redis implements distributed crawling Analysis and Implementation of Crawlers

Source: Internet
Author: User
Scrapy-redis implements distributed crawling and analysis. The so-called scrapy-redis is actually scrapy + redis. The redis-py client is used for redis operations. The role of redis here and the direction of scrapy-redis I have translated (readme. rst) in the repository (link :) of my fork ).
In the previous article, I used two related articles to analyze how to use redis to implement the distributed crawler center. All the URLs (requests) retrieved by crawlers are put in a redis queue, and all crawlers obtain the request (URL) from a single redis queue ).
Scrapy-redis has not been updated for a long time, how is it compatible with the updated version of scrapy I have also stated in the blog (link: http://blog.csdn.net/u012150179/article/details/38087661, later I may use a newer version of scrapr interface to rewrite scrapy-redis.

Implementation of binary distributed crawling 1. the Analysis of scrapy-redis's self-contained example has been described in readme of the database. However, spider in example may have many questions when it comes to initial contact. For example, what is distributed? Which of the following aspects are implemented? Second, it is difficult to find the distributed shadow in the running results. It is like two spider crawling their own things.
For the first question, I have explained settings. py in the translation and labeling scrapy-redis. The second question is what we need to do to implement our example in 2.

2. More clearly verify scrapy-redis's distributed ideas and coding implementation. (1) The idea is to implement two crawlers, and define all the links under crawler a's crawling dw.com keyword bussiness (set through start_urls ). Crawler B crawls all the links in the game and observes whether the URLs of the links are in their respective ranges or the intersection of the two. In this way, because the crawling ranges defined by the two are inconsistent, the results can be obtained through the crawling phenomenon.
(2) The implementation code is put in the GitHub repo. For ease of observation, set depth_limit to 1.
(3) symptom and analysis symptom: we can find that the two are the first to crawl the Link under a single keyword at the same time (the first to crawl depends on the start_urls that runs the crawler first ), then, the system crawls The Link under another keyword.
Analysis: crawling a single keyword at the same time indicates that two crawlers are scheduled at the same time, which is the distributed crawler. By default, crawlers search for breadth first. The steps for crawling are as follows:

I) first run crawler A (B is the same). The crawler engine requests the link in start_urls of spider a and delivers it to the scheduler. Then, the engine requests the crawler URL from the scheduler and submits it to the downloader for download, after the downloaded response is sent to the spider, the Spider gets a link based on the defined rules and continues to pass it to the scheduler through the engine. (This series of processes can be viewed in the scrapy architecture ). In schedue, the request (URL) sequence is implemented by redis queue, that is, the request (URL) is pushed to the queue, and the request pops up.


Ii) then start B. Similarly, start_urls of B is first handed over to the scheduler (note that it is the same as the scheduler in a). When engine B requests to crawl the URL, the URL that the scheduler schedules to B is still the URL that is not downloaded in A (the default scheduling method is to schedule the returned URL first, and the breadth is preferred ), this is the unfinished link for both A and B to download a. After the link is completed, B's request link is also downloaded.

Iii) Question: How is the scheduling method in the above II implemented?

3. detail analysis and attention

Each time you re-crawl, you should clear the data stored in redis; otherwise, crawling will be affected.

4. Others

In scrapy, request = URL.

Spider is different from crawler. Crawler contains spider. The architecture of scrapy is the crawler. Spider provides start_url, obtains the desired content based on the downloaded response analysis, and continues to extract the URL.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.