International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Hadoop practice 4 ~ Hadoop Job Scheduling (2)

Last Update:2014-09-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article will go on to the wordcount example in the previous article to abstract the simplest process and explore how the System Scheduling works in the mapreduce operation process.

Scenario 1: Separate data from operations

Wordcount is the hadoop helloworld program. It counts the number of times each word appears. The process is as follows:

Now I will describe this process in text.

1. The client submits a job and sends mapreduce programs and data to HDFS.

2. Initiate a job. hadoop schedules one task (or N tasktracker machines for map operations) based on the idle status of each machine)

3. The tasktacker machine copies programs and data to its own machine.

4. tasktacker starts JVM and performs map operations.

5 tasktacker: After the computation is completed, the data is stored on the local machine and the jobtacker node is notified.

6. jobtacker waits for all machines to complete, schedules an idle machine, performs reduce operations, and notifies the machine where the data storage is located.

7. tasktacker for reduce operation copies data to its own machine through RPC, and copies the program from HDFS to its own machine.

8. Start JVM, load programs, and perform reduce operations.

9 after the operation is completed, the reduce operation machine stores the data in HDFS and notifies jobtacker.

10 jobtacker: when the task is completed, notify the client that you have finished the task.

11 The client accesses HDFS to obtain the final calculation data.

Why is the map intermediate data stored on the local machine rather than on HDFS? The reason is that the intermediate calculation may fail. If it fails, it is not necessary to store it on HDFS, jobtacker selects another machine to complete the task. Only the final data is valuable.

Scenario 2: Data and node together

The actual situation is certainly not the case 1, because:Mobile computing is more economical than mobile data.In hadoop, the same machine is often both a datanode and a tasktraker. During the scheduling process, hadoop will give priority to the machine where the data is scheduled for calculation, so that the data will not be copied between machines, and the network bandwidth will not become the computing bottleneck. The example is as follows:

This figure is based on the above description, and I believe it should be easy for everyone to understand. So since the actual process of hadoop is Case 2, why should I first describe case 1? There are two reasons:

1. Situation 1 is easier to understand.

2. Case 1 is easierImplementation.

Based on hadoop's scheduling principle,Write your own cluster scheduling frameworkThis is one of my recent thoughts and practices. If you are interested, you can write one by yourself ~

Hadoop practice 4 ~ Hadoop Job Scheduling (2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop practice 4 ~ Hadoop Job Scheduling (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support