Hadoop practice 4 ~ Hadoop Job Scheduling (2)

Source: Internet
Author: User

This article will go on to the wordcount example in the previous article to abstract the simplest process and explore how the System Scheduling works in the mapreduce operation process.

 

Scenario 1: Separate data from operations

Wordcount is the hadoop helloworld program. It counts the number of times each word appears. The process is as follows:

Now I will describe this process in text.

1. The client submits a job and sends mapreduce programs and data to HDFS.

2. Initiate a job. hadoop schedules one task (or N tasktracker machines for map operations) based on the idle status of each machine)

3. The tasktacker machine copies programs and data to its own machine.

4. tasktacker starts JVM and performs map operations.

5 tasktacker: After the computation is completed, the data is stored on the local machine and the jobtacker node is notified.

6. jobtacker waits for all machines to complete, schedules an idle machine, performs reduce operations, and notifies the machine where the data storage is located.

7. tasktacker for reduce operation copies data to its own machine through RPC, and copies the program from HDFS to its own machine.

8. Start JVM, load programs, and perform reduce operations.

9 after the operation is completed, the reduce operation machine stores the data in HDFS and notifies jobtacker.

10 jobtacker: when the task is completed, notify the client that you have finished the task.

11 The client accesses HDFS to obtain the final calculation data.

Why is the map intermediate data stored on the local machine rather than on HDFS? The reason is that the intermediate calculation may fail. If it fails, it is not necessary to store it on HDFS, jobtacker selects another machine to complete the task. Only the final data is valuable.

Scenario 2: Data and node together

The actual situation is certainly not the case 1, because:Mobile computing is more economical than mobile data.In hadoop, the same machine is often both a datanode and a tasktraker. During the scheduling process, hadoop will give priority to the machine where the data is scheduled for calculation, so that the data will not be copied between machines, and the network bandwidth will not become the computing bottleneck. The example is as follows:

 

This figure is based on the above description, and I believe it should be easy for everyone to understand. So since the actual process of hadoop is Case 2, why should I first describe case 1? There are two reasons:

1. Situation 1 is easier to understand.

2. Case 1 is easierImplementation.

 

Based on hadoop's scheduling principle,Write your own cluster scheduling frameworkThis is one of my recent thoughts and practices. If you are interested, you can write one by yourself ~

Hadoop practice 4 ~ Hadoop Job Scheduling (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.