Hadoop Scheduler Summary _hadoop

Source: Internet
Author: User
With the popularity of MapReduce, the Open-source implementation of Hadoop has become increasingly popular. In a Hadoop system, it is important to have a component that is the scheduler that allocates the idle resources in the system to the job in a certain policy. In Hadoop, a scheduler is a pluggable module that allows users to design dispatchers according to their actual application requirements. There are three types of schedulers common in Hadoop, respectively:

(Note: The Hadoop scheduler described in this article is not systematic enough, and if you want to learn more about the systematized Hadoop Scheduler, read my latest book, "The Insider of Hadoop technology: An in-depth analysis of MapReduce architecture design and Implementation Principles" (purchase Instructions), Chapter 10th " Hadoop Multiuser Job Scheduler Analysis ", the current more popular FIFO, capacity fair three kinds of scheduler configuration method, implementation mechanism and advantages and disadvantages of comparison, of course, also introduced several other types of scheduler. )

(1) Default scheduler FIFO

The default scheduler in Hadoop, which chooses the job to be executed according to the priority of the job and then the time of arrival.

(2) Computational Capacity Scheduler capacity Scheduler

Support for multiple queues, each queue can be configured with a certain amount of resources, each queue using a FIFO scheduling strategy, in order to prevent the same user's job exclusive queue resources, the scheduler will be submitted by the same user of the amount of resources of the job limit. When scheduling, first, select a suitable queue by using the following policy: Calculate the ratio of the number of running tasks in each queue to the computing resources that should be assigned, select a queue that has the lowest ratio, and then select one job in the queue according to the job priority and submit time order. Consider both user resource limits and memory limits.

(3) Fair Scheduler Fair Scheduler

Similar to the computing Power Scheduler, which supports multiple-queue multiuser, the amount of resources in each queue can be configured, and the jobs in the same queue share all resources in the queue in a fair, specific algorithm see my blog. Hadoop Fair Scheduler Algorithm parsing

In fact, Hadoop's scheduler is much more than three, and recently, there have been many Hadoop schedulers for new applications.

(4) Suitable for heterogeneous cluster scheduler late

The existing Hadoop scheduler is based on the assumption that the homogeneous cluster is the same, and the specific assumption is as follows:

1 The performance of each node in the cluster is exactly the same

2 for reduce task, its three phases: copy, sort, and reduce, each accounting for 1/3

3 the same job of the same type of task is a batch of completed, they are basically the same.

The existing Hadoop scheduler has a large defect, which is mainly embodied in the algorithm of detecting backward tasks: If a task lags behind 20% of the same task's progress, the task is treated as a backward task (which determines the completion time of the job and minimizes its execution time). To start a backup task (speculative task) for it. If the clusters are heterogeneous, the execution time on the same task, even on the same node, can vary considerably, making it easy to generate a large number of backup tasks in a heterogeneous cluster.

LATE (longest approximate, reference [4]) scheduler solves the problem of the existing scheduler to some extent, it defines three thresholds: Speculativecap, the maximum simultaneous speculative in the system Task number (the author recommended value is the total number of slot 10%); Slownodethreshold (recommended by the author is 25%): Score (the method of fractional calculation, see paper) does not start the speculative task on node (Fast node) below this threshold value Slowtaskthreshold (recommended by the author is 25%): When the task progress is lower than the average progress of the same batch of similar task Slowtaskthreshold, the task is started speculative task. Its scheduling strategy is: When a node appears idle and the total number of backup tasks in the system is less than Speculativecap, (1) If the node is a slow node (the node score is higher than slownodethreshold), the request is ignored. (2) to start a backup task for a task that is currently running, sorted by the estimated remaining finish time (3), and selects the task with the largest remaining finish and less progress than slowtaskthreshold.

(5) For real-time operations scheduler deadline Scheduler and constraint-based Scheduler

This scheduler is primarily used for time constrained jobs (deadline jobs), which give the job a deadline time to complete in that time. In fact, this type of scheduler is divided into two types: the job scheduler and the hard time (the job must be completed strictly on time) in the soft real time (allowing the job to have a certain timeout).

Deadline Scheduler (reference [5]) is primarily for soft real-time operations, which dynamically adjust the amount of resources acquired by the job based on the running schedule and remaining time of the job so that the job can be completed as deadline time as possible.

constraint-based Scheduler (reference [6]) is mainly for hard real-time operations, the scheduler according to the operation of the deadline and real-time operation of the current system operation, to predict the new submitted real-time operations can be completed within the deadline time, If not, the job is fed back to the user, allowing him to deadline the job.

————————————————————————————————————————–

Resources:

"1" Capacity Scheduler Introduction: http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html

Download: http://hadoop.apache.org/common/docs/r0.20.0/capacity_scheduler.pdf

"2" Fair Scheduler Introduction: http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html

Download: Http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/fairscheduler/designdoc/fair_scheduler_ Design_doc.pdf

"3" Fair Scheduler thesis: M. Zaharia, D. Borthakur, J. Sarma, K. Elmeleegy, S. Shenker, and I Stoica, "Job Scheduling for multi-user MapReduce ters, "EECS Department, University of California, Berkeley, Tech. Rep., APR 2009."

"4" Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica "improving MapReduce performance in Heterog eneous environments, "

"5" Deadline Scheduler thesis: J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguade and, M. Steinder, and I. Whalley, Performance-driven task Co-scheduli ng for MapReduce environments, "in Network Operations and Management Symposium (NOMS), IEEE, pp. 373–380.

"6" constraint-based Scheduler thesis K. Kc and K. Anyanwu, "Scheduling Hadoop jobs to meet deadlines," in 2nd IEEE International Conference on Cloud Computing Tec Hnology and Science (cloudcom), pp. 388–392.

Original articles, reproduced please specify: Reprinted from Dong's Blog

This article link address: http://dongxicheng.org/mapreduce/hadoop-schedulers/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.