This article will go on to the wordcount example in the previous article to abstract the simplest process and explore how the System Scheduling works in the mapreduce operation process.
Scenario 1: Separate data from operations
Wordcount is the hadoop helloworld program. It counts the number of times each word appears. The process is as follows:
Now I will describe this process in text.
1. The client submits a job and sends mapreduce programs and data to HDFS.
2. Initiate a job. hadoop schedules one task (or N tasktracker machines for map operations) based on the idle status of each machine)
3. The tasktacker machine copies programs and data to its own machine.
4. tasktacker starts JVM and performs map operations.
5 tasktacker: After the computation is completed, the data is stored on the local machine and the jobtacker node is notified.
6. jobtacker waits for all machines to complete, schedules an idle machine, performs reduce operations, and notifies the machine where the data storage is located.
7. tasktacker for reduce operation copies data to its own machine through RPC, and copies the program from HDFS to its own machine.
8. Start JVM, load programs, and perform reduce operations.
9 after the operation is completed, the reduce operation machine stores the data in HDFS and notifies jobtacker.
10 jobtacker: when the task is completed, notify the client that you have finished the task.
11 The client accesses HDFS to obtain the final calculation data.
Why is the map intermediate data stored on the local machine rather than on HDFS? The reason is that the intermediate calculation may fail. If it fails, it is not necessary to store it on HDFS, jobtacker selects another machine to complete the task. Only the final data is valuable.
Scenario 2: Data and node together
The actual situation is certainly not the case 1, because:Mobile computing is more economical than mobile data.In hadoop, the same machine is often both a datanode and a tasktraker. During the scheduling process, hadoop will give priority to the machine where the data is scheduled for calculation, so that the data will not be copied between machines, and the network bandwidth will not become the computing bottleneck. The example is as follows:
This figure is based on the above description, and I believe it should be easy for everyone to understand. So since the actual process of hadoop is Case 2, why should I first describe case 1? There are two reasons:
1. Situation 1 is easier to understand.
2. Case 1 is easierImplementation.
Based on hadoop's scheduling principle,Write your own cluster scheduling frameworkThis is one of my recent thoughts and practices. If you are interested, you can write one by yourself ~
Hadoop practice 4 ~ Hadoop Job Scheduling (2)