The purpose of the Hadoop Scheduler is to assign the idle resources in the system to a job by a certain policy. The Scheduler is a pluggable module that allows the user to design the scheduler according to their actual application requirements. There are three types of schedulers common in Hadoop, namely:
1. Queue-based FIFO (advanced first-out)
The default resource scheduler for Hadoop. Advantages : Simple and clear. cons : Ignores the differences in requirements for different jobs.
2. Computing Power Scheduler Capacity Scheduler
Support for multiple queues, each of which can be configured with a certain amount of resources, each with a FIFO scheduling policy, in order to prevent the same user's job exclusive queue of resources, the scheduler will be the same user submitted a job to limit the amount of resources. choose the least-occupied, high-priority first execution
When scheduling, first select a suitable queue according to the following policy: Calculate the ratio of the number of running tasks in each queue to the compute resources that should be assigned, select a queue with the lowest ratio; then select one of the jobs in the queue by the following policy: Select by job priority and commit time order, Consider both user resource limits and memory limits.
3. Fair Scheduler Fair Scheduler
Fair scheduling is a method of assigning job resources, which is designed to allow all jobs to acquire equal shared resources on average over time. all jobs have the same resources
When a single job is run, it will use the entire cluster. When other jobs are submitted, the System assigns task (Task) idle resources (container) to these new jobs so that each job gets roughly the same amount of CPU time.
Unlike the Hadoop default scheduler, which maintains a job queue, this feature allows small jobs to be completed in a reasonable amount of time without being "hungry" to large jobs that consume longer periods of time. It is also a simple way to share clusters among multiple users. Fair scheduling can be used in conjunction with job priorities-priority is used as a weight to determine the percentage of the overall calculation time that can be obtained for each job. Similar to the computational Power Scheduler, supports multi-queue multi-users , the amount of resources in each queue can be configured, and the jobs in the same queue share all the resources in the queue fairly.
Summary (RPM):
With the evolution of the Hadoop version, the Fair Scheduler and Capacity Scheduler are becoming more and more perfect, including hierarchical queue organization, resource preemption, batch scheduling, and so on, and as such, two schedulers are becoming more and more serious, at present, The two schedulers are very similar from design to supported features, and since Fair Scheduler supports a variety of scheduling strategies, it now seems that Fair Scheduler has all the features capacity scheduler .
The table below compares the similarities and differences between the two schedulers in Hadoop 2.0 (YARN), which allows readers to better understand the same points and differences between capacity scheduler and fair scheduler.
The FIFO, fair and DRF respectively refer to the first- come-first service , the fair dispatching and the Fair dispatch of the main resources , the specific meanings are as follows:
? FIFO: Priority scheduling first, if the same priority, then according to the time of submission scheduling, if the commit time is the same, according to (queue or application) name size (string comparison) scheduling
? FAIR: According to the memory resource utilization ratio scheduling, that is, according to used_memory/minshare size scheduling (the core idea is to follow the scheduling algorithm to determine the scheduling sequence, but also to consider some boundary conditions)
? DRF: Based on the design strategy of Mesos, the scheduling algorithm is carried out according to the main resource fair scheduling, which has been introduced in Apache Mesos Scheduler mechanism.
In fact, Hadoop has more than three schedulers, and recently there have been many Hadoop schedulers for new applications.
4. Scheduler late for heterogeneous clusters
The existing Hadoop scheduler is based on the hypothetical premise of homogeneous cluster, late is based on the situation of cluster heterogeneity.
5. Scheduler for real-time jobs deadline Scheduler and constraint-based Scheduler
This scheduler is mainly used for time-limited jobs (Deadline job), which gives the job a Deadline time, Let it be done within that time. In fact, this type of scheduler is divided into two, soft real-time (allow the job has a certain time-out) job scheduler and Hard (the job must be strictly completed on time) job scheduler .
- Deadline Scheduler is mainly for soft real-time job, the scheduler dynamically adjusts the amount of resources available to the job based on the running progress and remaining time of the job, so that the job can be completed as much as possible in Deadline time.
- constraint-based Scheduler is mainly for hard real-time job, the scheduler based on the deadline of the job and the real-time operation of the current system, predict whether the new submitted real-time jobs can be completed in deadline time, If not, give the job back to the user, and let him readjust the deadline of the job.
Hadoop Resource Scheduler