Hadoop is a distributed system infrastructure under the Apache Foundation. It has two core components: Distributed File System HDFS, which stores files on all storage nodes in the hadoop cluster; it consists of namenode and datanode. the distributed computing engine mapreduce is composed of jobtracker and tasktracker.
Hadoop allows you to easily develop distributed applications based on your business needs without understanding the underlying details of the distributed system.Program. In actual hadoop applications, multiple applications often share hadoop, for example:
- Productive applications: data analysis and statistical computing;
- Batch Processing Applications: machine learning, etc;
- Interactive applications: SQL queries.
Therefore, in a hadoop cluster, multiple jobs may run simultaneously. Different types of jobs may also have dependencies between jobs, in this case, how can we make full use of the computing resources of the entire cluster? This requires a Job scheduler to effectively schedule and execute jobs in the entire cluster.
The hadoop Job scheduler is designed with a plug-in mechanism, that is, the Job scheduler is dynamically loaded and pluggable, at the same time, third parties can develop their own job schedulers to replace hadoop's default schedulers. Currently, hadoop job schedulers mainly include the following three:
- FIFO scheduler : uses a FIFO queue for scheduling. Based on this, hadoop also provides an extended scheduler, for each job tasks the total number is limited . The advantage is that the implementation is very simple and the scheduling process is fast; the disadvantage is that the resource utilization is not high.
- Capacity schedity:Multiple queues are used, and each queue is allocated with a certain system capacity. idle resources can be dynamically allocated to load-intensive queues and support job priority. The advantage is that multiple jobs can be executed concurrently to improve resource utilization, dynamically adjusts resource allocation to improve job execution efficiency. The disadvantage is that you need to know a large amount of system information before Set and Select queue.
- Fair Scheduler:Create a job group into a job pool. Each job pool allocates the minimum shared resources and distributes the excess resources evenly to each job. The advantage is that jobs can be classified and scheduled, different types of jobs are allocated with different resources to improve service quality, dynamically adjust the number of parallel jobs, and make full use of resources. The disadvantage is that the actual load status of nodes is not considered, the node load is not balanced.
although the Job scheduler provided by hadoop is simple and used, analyzing its principles is not difficult the following problems are found (some problems may exist capacity schedity and fair scheduler has not been considered or well solved ), for discussion:
- Long jobs and short jobs are not clearly differentiated: Long jobs generally need to ensure the quality of service, while short jobs require a short enough response time, but the FIFO scheduler does not consider differentiation (Facebook'sFair Scheduler).
- Not fully considering the actual computing load of each computing node tasknode: In the FIFO scheduler, jobtracker completely follows MAP/reduce Slots allocation of tasknode MAP/reduce jobs run, but each Task Node Actual load, Jobtracker is rarely considered. One way is The heartbeat between jobtracker and tasktracker, Tasktracker actively reports its real-time load information Jobtracker To Jobtracker comprehensively considers whether to allocate a new MAP/reduce Job.
- Resource allocation and actual usage of each node on the virtualization platform are not considered: If hadoop is deployed and applied on the virtualization platform, each VM is a hadoop node, different VMS on the same physical machine and on different physical machines need to be treated differently, these VMS may be pre-allocated with CPU, memory, disk, and other resources, but there may also be resource contention at the same time, which will affect the usage of these VMS as hadoopUsed separately when computing node tasknodeThe allocation method of MAP/reduce slots is not reasonable enough. A task running on a VM may lag behind the entire job. In addition, some virtualization features (such as the real-time migration technology of virtual machines to balance the load on physical machines, which is transparent to upper-layer applications of hadoop) can also be considered for applications in hadoop clusters.
-
- Dependency Analysis Between jobs is not supported: If key paths can be found based on dependencies between jobs, the execution efficiency and response speed of jobs can be improved.
here are just a few questions listed based on your understanding. You are welcome to discuss them.