Three kinds of distributed crawler strategy based on Redis

Source: Internet
Author: User

Preface:

Crawler is a partial IO-type task, the implementation of distributed crawler is more difficult than distributed computing and distributed storage is much simpler.
The main points that a distributed crawler should consider are the following:

    • Unified scheduling of Crawler tasks
    • The unified de-weight of crawler tasks
    • Storage issues
    • Speed issues
    • The easier it is to be "robust", the better.
    • It is best to support the "continue crawling" feature

Python distributed crawler is more commonly used should be scrapy framework plus Redis memory database, intermediate scheduling tasks, such as Scrapy-redis module implementation.
Here is a brief introduction to the Redis-based three distributed strategies, in fact, they are very similar to each other, just to adapt to different networks or crawler environment made some adjustments (if there is a mistake welcome message shot Brick).


"Strategy One"


The slaver end fetches data from the master-side task (REQUEST/URL/ID), generates new tasks while fetching the data, and throws the task to master. The master side has only one Redis database, which is responsible for slaver the tasks submitted by the server and joining the queue to be crawled.

Advantages: Scrapy-redis is the default use of this strategy, we implement it is very simple, because task scheduling and other work Scrapy-redis have helped us do, we only need to inherit Redisspider, specify Redis_key on the line.
disadvantage: Scrapy-redis scheduling task is the request object, the information is relatively large (not only include the URL, as well as callback functions, headers, etc.), the result is to reduce the speed of the crawler, And it takes up a lot of redis storage space. Of course we can override the method to implement the dispatch URL or user ID.


"Strategy II"


This is an optimization improvement to the strategy: Run a program on the master side to build the task (REQUEST/URL/ID). The Master side is responsible for the production tasks, and the task to be heavy, to join the queue to be crawled. Slaver just take the task from master to climb.

Advantages: separate the generating task and fetch data, and reduce the data exchange between Master and slaver; one benefit of the master-side build task is that it is easy to rewrite the weighing strategy (it is important to optimize the performance and speed of the weighing when the amount of data is large).
Cons: such as QQ or Sina Weibo web site, send a request, the content returned may contain dozens of to crawl user ID, that is, dozens of new crawler tasks. However, some sites can only get one or two new tasks for a request, and the content returned contains the crawler to crawl the target information, if the build task and crawl task separation will reduce the crawler efficiency. After all, bandwidth is also a bottleneck problem of reptiles, we want to send as few requests as the principle, but also to alleviate the pressure of the Web server, to do a moral crawler. So, depending on the situation.


"Strategy three"


There is only one collection in master, which only functions as a query. Slaver asks the master if this task has crawled when it encounters a new task, and if it is not, it joins Slaver's own backlog, which the master notes as crawled. It's more like strategy than it is, but it's obviously simpler than strategy. Strategy one is simply because Scrapy-redis implements the scheduler middleware, which does not apply to non-scrapy-frame crawlers.

Advantages: Simple, non-scrapy frame crawler is also applicable. Master side pressure is small, master and slaver data exchange is not small.
Disadvantages: "Robustness" is not enough, you need to save the queue to be crawled periodically to achieve the "breakpoint continue crawling" function. The tasks to be crawled by slaver are not universal.


Conclusion:

If Slaver is likened to a worker, master is compared to the foreman. Strategy one is that when workers meet new tasks, they report to the foreman, and when they need to work, they go to the foreman to take the task; Strategy two is the foreman to find a new task, the workers only from the foreman to take the task to work; Strategy Three is the workers encountered a new task when asked the foreman this task is not done, the worker added to Itinerary table ".



Reprint please indicate the source, thank you! (Original link: http://blog.csdn.net/bone_ace/article/details/50989104)

Three kinds of distributed crawler strategy based on Redis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.