Three kinds of distributed crawler strategies based on Redis

Source: Internet
Author: User
Tags redis
Preface:

Crawler is a task of partial IO type, the implementation of distributed crawler is much simpler than distributed computing and distributed storage.
One of the most important points that a distributed reptile needs to consider is the following: the unification of the reptile task's unified dispatch reptile task to the problem of storage problems the speed problem is sufficiently "robust" to achieve the simpler/more convenient the better support for "breakpoint crawling" feature

Python distributed crawler is commonly used in the Scrapy framework plus the Redis memory database, the middle of the scheduling tasks, such as the Scrapy-redis module implementation.
Here is a brief introduction to the three kinds of distributed policies based on Redis, but they are still very similar, just to adapt to different network or reptile environment made some adjustments (such as the error of the welcome message to Pat Bricks).


"Strategy One"


The slaver end takes the task (REQUEST/URL/ID) from the master to crawl the data, and it also generates new tasks and throws the task to master. The master end has only one Redis database, which is responsible for the task of slaver submission and adding to the queue to be crawled.

Advantages: Scrapy-redis is the default use of this strategy, we realize it is very simple, because task scheduling and other work Scrapy-redis have helped us to do, we only need to inherit Redisspider, designated Redis_key on the line.
Disadvantages: Scrapy-redis Scheduling task is the request object, inside the information is relatively large (not only include the URL, there are callback functions, headers, etc.), resulting in the result is to reduce the speed of the crawler, It also takes up a lot of redis storage space. Of course we can rewrite the method to implement the dispatch URL or user ID.


"Strategy two"


This is a refinement of the strategy: running a program on the master side to generate a task (REQUEST/URL/ID). The master end is responsible for the production tasks, and the task is to be added to the queue to be crawled. Slaver from master to climb the task.

Advantages: Separate the build task from the fetching data, and the Division of labor is clear, reducing the data exchange between Master and Slaver; The master-side build task also has the advantage that it is easy to rewrite the weighing policy (it is important to optimize the performance and speed of the weighing when the data is large).
Disadvantages: such as QQ or Sina Weibo site, send a request, returned content may contain dozens of to crawl user ID, that is, dozens of new crawler tasks. However, some Web sites can only get one or two new tasks, and the returned content contains the target information that the crawler will crawl, if the build task and the crawl task are separated, it will reduce the crawler crawl efficiency. After all, bandwidth is a bottleneck of the crawler, we have to send as little as possible the principle of the request, but also in order to reduce the pressure of the Web server, to do a moral crawler. So, depending on the situation.


"Strategy three"


There is only one set in master, which is only the function of a query. Slaver asks Master if the task has crawled when it encounters a new task, and if it is not crawled join Slaver's own crawl queue, Master remembers the task as crawled. It's a bit like a strategy, but it's obviously more simple than a strategy. Strategy one is simple because the Scrapy-redis implements the scheduler middleware, and it does not apply to scrapy-frame crawlers.

Advantages: Simple to implement, scrapy-frame crawler also applies. Master-side pressure ratio is small, master and slaver data exchange is not small.
Disadvantages: "Robustness" is not enough, you need to save the queue to be crawled on a regular basis to achieve the "breakpoint crawling" feature. The slaver tasks are not universal.


Conclusion:

If Slaver is likened to a worker, Master is likened to a foreman. One strategy is for the workers to report to the foreman when they meet new tasks. Go to the foreman to get a job when you need to work; Strategy Two is the foreman to find a new job, the workers just take the task from the foreman; Strategy three is when a worker encounters a new task and asks the foreman if the task has been done, then the worker adds the task to his " Itinerary list. "



Reprint please indicate the source, thank you. (Original link: http://blog.csdn.net/bone_ace/article/details/50989104)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.