016_ General overview of the MapReduce execution process combined with the WordCount program

Last Update:2016-03-15 Source: Internet

Author: User

Tags map class

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One, Map task processing

1, read the input file contents, parse into key, value pair. For each line of the input file, parse to key, value pair. Each key-value pair is called once to the map function.

2, write their own logic, the input key, value processing, converted into a new key, value output.
3. Partition the output key and value.
4, for different partitions of data, according to key sorting, grouping. The value of the same key is placed in the
In a collection.
5. (optional) The data after grouping is normalized.

Second, reduce task processing

1, the output of multiple map tasks, according to different partitions, through the network copy to the different reduce node.
2, the output of multiple map tasks are merged, sorted. Write the reduce function's own logic, the input key, value processing, converted to a new key, value output.
3. Save the output of reduce to a file.

Iii. Jobtracker and Tasktracke and related task division

Jobtracker is responsible for receiving jobs submitted by users, responsible for initiating and tracking task execution.
Tasktracke is responsible for performing tasks assigned by Jobtracker, managing individual tasks
The performance on each node.
Job, a user's every compute request, called a job.
Task, each job, need to split up, to multiple servers to complete, split out the execution unit, called the task.
Task is divided into Maptask and reducetask two, respectively map operation and reduce operation, according to the job set map class and reduce class

Iv. WordCount Treatment Process

1, the file is split into splits, because the test file is small, so each file for a split, and the file is divided into lines to form <key,value> pairs, shown. This step is done automatically by the MapReduce framework, where the offset (that is, the key value) includes 2 characters (different windows/linux environments) for carriage return and line feeds.

2, the division of good <key,value> to the user-defined map method for processing, generate a new <key,value> pair, shown.

3, after getting the <key,value> of the map method output, mapper will sort them according to the key value and execute the combine process, accumulate the key to the same value, and get the final output of mapper. As shown:

4, Reducer first to the data received from the mapper sorted, and then by the user-defined reduce method for processing, to obtain a new <key,value> pair, and as the output of WordCount, shown.

Analysis of operation flow of MR operation

1, start a job on the client;
2, request a job ID to Jobtracker;
3. Copy the resource files required to run the job to HDFs, including the jar files packaged by the MapReduce program, the configuration files, and the client computed input partitioning information. These files are stored in a folder that is created specifically for the job by Jobtracker. The folder name is the job ID of the activity. The jar file will have 10 copies (Mapred.submit.replication property control) By default, and the input partitioning information tells Jobtracker how many map tasks should be started for the job, 4, Jobtracker after receiving the job, Put it in a job queue, waiting for the job scheduler to schedule it (here is not much like a microcomputer in the process of scheduling it, hehe), when the job scheduler according to its own scheduling algorithm dispatched to the job, according to the input partition information for each partition to create a map task, and assign the map task to tasktracker execution. For the map and reduce tasks, Tasktracker has a fixed number of map slots and reduce slots based on the number of host cores and the size of the memory. It is emphasized here that the map task is not randomly assigned to a tasktracker, there is a concept called: Data Localization(data-local). This means that the map task is assigned to the Tasktracker that contains the data block processed by the map, and the program jar package is copied to the Tasktracker to run, which is called " operation movement, data not moving”。 When you assign a reduce task, data localization is not considered. 5, Tasktracker every time will send a heartbeat to Jobtracker, tell Jobtracker it is still running, while the heartbeat also carries a lot of information, such as the current map task to complete the progress of information. When Jobtracker receives the last task completion information for the job, it sets the job to "success". When Jobclient queries the state, it learns that the task is complete and displays a message to the user.

016_ General overview of the MapReduce execution process combined with the WordCount program

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More