Detailed description of the work principle of mapreduce

Last Update:2015-03-11 Source: Internet

Author: User

Tags emit shuffle hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly analyzes the following two points:
Directory:
1.MapReduce Job Run Process
Process of shuffle and sequencing in 2.Map, reduce tasks

Body:

1.MapReduce Job Run Process

The following is a process I draw with visio2010:

Process Analysis:

1. Start a job on the client.

2. Request a job ID to Jobtracker.

3. Copy the resource files required to run the job to HDFs, including the jar files packaged by the MapReduce program, the configuration files, and the client computed input partitioning information. These files are stored in a folder that is created specifically for the job by Jobtracker. The folder name is the job ID of the activity. The jar file will have 10 copies (Mapred.submit.replication property control) By default, and the input partitioning information tells the Jobtracker how many map tasks should be started for the job.

4.JobTracker received the job, put it in a job queue, waiting for the job scheduler to dispatch it (here is not much like a microcomputer in the process of scheduling it, hehe), when the job scheduler according to its own scheduling algorithm to the job, A map task is created for each partition based on the input partitioning information, and the map task is assigned to Tasktracker execution. For the map and reduce tasks, Tasktracker has a fixed number of map slots and reduce slots based on the number of host cores and the size of the memory. It is emphasized here that the map task is not randomly assigned to a tasktracker, there is a concept called: Data localization (data-local). This means that the map task is assigned to the Tasktracker that contains the data block processed by the map, and the program jar package is copied to the Tasktracker to run, which is called "Operation movement, data not moving". When you assign a reduce task, data localization is not considered.

5.TaskTracker sends a heartbeat to jobtracker every once in a while, telling Jobtracker that it is still running, and that there is a lot of information in the heartbeat, such as the progress of the current map task. When Jobtracker receives the last task completion information for the job, it sets the job to "success". When Jobclient queries the state, it learns that the task is complete and displays a message to the user.

The above is in the client, Jobtracker, Tasktracker level to analyze the working principle of mapreduce, below we are a little more detailed, from the map task and reduce the level of the task to analyze it.

Process of shuffle and sequencing in 2.Map, reduce tasks

Also post the process I drew in Visio:

Process Analysis:

Map End:

1. Each input shard will have a map task to handle, by default, the size of one block in HDFs (64M by default) is a shard, and of course we can set the size of the block. The result of the map output is temporarily placed in a ring memory buffer (the buffer size defaults to 100M, controlled by the Io.sort.mb property), when the buffer is about to overflow (default is 80% of the buffer size, Controlled by the Io.sort.spill.percent property, an overflow file is created in the local file system and the data in that buffer is written to this file.

2. Before writing to disk, the thread first divides the data into the same number of partitions based on the number of reduce tasks, which is the data for one partition for a reduce task. This is done to avoid some of the reduce tasks being allocated to large amounts of data, while some reduce tasks have little or no data embarrassment. In fact, partitioning is the process of hashing data. The data in each partition is then sorted, and if combiner is set at this point, the sorted result is combia and the purpose is to have as little data as possible to write to the disk.

3. When the map task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The process of merging is continuously sorted and combia, with two purposes: 1. Minimize the amount of data that is written to the disk each time; 2. Minimize the amount of data transmitted over the next replication phase of the network. Finally, it is merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, the data can be compressed, as long as the mapred.compress.map.out is set to true.

4. Copy the data from the partition to the corresponding reduce task. One might ask: How does the data in the partition know which reduce it corresponds to? In fact, the map task has been and its father Tasktracker keep in touch, and Tasktracker has been and jobtracker keep heartbeat. So the macro information in the whole cluster is saved in the Jobtracker. OK, as long as the reduce task gets the corresponding map output location to Jobtracker.

Here, the map end is analyzed. So what is shuffle? Shuffle Chinese means "shuffle", if we look at this: a map generated data, the result of the hash process is allocated to different reduce tasks, is not a data shuffling process? Oh.

Reduce side:

1. Reduce receives data from different map missions, and the data from each map is ordered. If the amount of data accepted by the reduce side is quite small, is stored directly in memory (the buffer size is controlled by the Mapred.job.shuffle.input.buffer.percent property, which represents the percentage of heap space used for this purpose), if the amount of data exceeds a certain percentage of the buffer size (by Mapred.job.shuffle.merg E.percent), the data is merged and then overflowed to disk.

2. As overflow files increase, background threads merge them into a larger, ordered file to save time for subsequent merges. In fact, regardless of the map or the reduce side, MapReduce is repeated to perform the sort, merge operation, now finally understand why some people say: sort is the soul of Hadoop.

3. The process of merging produces many intermediate files (written to disk), but MapReduce makes the data written to disk as small as possible, and the result of the last merge is not written to disk, but is entered directly into the reduce function.

It's been a long time lately to understand how map reduce works,shuffle is the heart of MapReduce, and this process helps to write more efficient mapreduce programs and Hadoop tuning . Draw a flowchart (click to view the full picture):

In addition, also found an article, very good, quote.

Hadoop

is an Apache project that consists of HDFs,MapReduce,HBase, Hive, and zookeeper members. Of these, HDFS and MapReduce are the two most basic and most important members.

HDFS is an open source version of Google GFS, a highly fault-tolerant Distributed file system that provides high-throughput data access and is suitable for storing large volumes (petabytes ) of big files (typically more than 64M), as shown in the following principle:

Adopt Master/slave structure. NameNode maintains metadata within the cluster, providing external capabilities to create, open, delete, and rename files or directories. Datannode stores data and reads and writes requests for processing data. Datanode regularly reports a heartbeat to NameNode, NameNode to control Datanode by responding to a heartbeat.

InfowordThe MapReduceRated 2009The champion of the ten emerging technologies of the year. Mapreduceis large-scale data (TBLevel) Calculation of the weapon, Mapand reduce Two interfaces, you can complete the calculation of Tb level data, common applications include: Log analysis and data mining and other data analysis applications. In addition, it can be used for scientific data calculation, such as the calculation of pi Pi structure. Master < Span lang= "ZH-CN". Jobtracker is responsible for Job , while Tasktracker is responsible for executing tasks

analysis of Shuffle and sort in MapReduce

MapReduce is a very popular distributed computing framework that is designed to compute massive amounts of data in parallel. Google was the first to propose the technology framework, and Google was inspired by functional programming languages such as LISP,SCHEME,ML. The core steps of the MapReduce framework are divided into two main parts: Map and reduce. When you submit a compute job to the MapReduce framework, it first splits the calculation job into several map tasks and then assigns them to different nodes to execute, and each map task processes part of the input data, and when the map task is completed, it generates some intermediate files. These intermediate files will be used as input data for the reduce task. The main goal of the Reduce task is to summarize and output the output of several previous maps. From a high-level abstraction, the MapReduce streaming diagram 1 shows:

The focus of this paper is to analyze the core processes of mapreduce----shuffle and sort. In this article, shuffle refers to the process of generating output from a map, including system execution sequencing and transferring map output to reducer as input. Here we will explore how shuffle works, because understanding of the basics helps to tune the MapReduce program.

First from the map end of the analysis, when the map began to produce output, he did not simply write data to disk, because the frequent operation will lead to severe performance degradation, his processing more complex, the data is written to a buffer in memory, and some pre-ordering to improve efficiency,

Each map task has a circular memory buffer that is used to write output data, the default size of which is 100M, which can be set by the Io.sort.mb property, when the amount of data in the buffer reaches a specific threshold (io.sort.mb * Io.sort.spill.percent, where io.sort.spill.percent defaults to 0.80), a background thread is started to spill the contents of the buffer to disk. During the spill process, the map's output will continue to be written to the buffer, but if the buffer is full, the map will be blocked straight spill complete. Spill thread writes the buffer's data to the disk in a two-order sequence, starting with the partition that the data belongs to, and then sorting by key in each partition. The output includes an index file and a data file, and if Combiner is set, it will be based on the sort output. Combiner is a mini Reducer that runs on the node itself that performs the map task, first making a simple reduce to the output of the map, making the map output more compact, and less data being written to disk and transferred to Reducer. The spill file is saved in the directory specified by Mapred.local.dir and is deleted after the map task finishes.

Whenever the in-memory data reaches the spill threshold, a new spill file is generated, so when the map task finishes writing his last output record, there may be multiple spill files, and before the map task is completed, All spill files will be sorted into an index file and a data file. As shown in 3. This is a multi-path merge process, with the maximum number of merges controlled by Io.sort.factor (default is 10). If Combiner is set and the number of spill files is at least 3 ( controlled by the Min.num.spills.for.combine property), then Combiner will run to compress the data before the output file is written to disk.

Compressing data written to disk (which is not the same as combiner compression) is often a good way to write data to disk faster, save disk space, and reduce the amount of data that needs to be transferred to reducer. The default output is not compressed, but it can be very simple to set mapred.compress.map.output to True to enable this feature. The libraries used by the compression are set by Mapred.map.output.compression.codec

When the spill file is merged, Map deletes all temporary spill files and informs the Tasktracker that the task is complete. Reducers to get the corresponding data via HTTP. The number of worker threads used to transmit partitions data is controlled by Tasktracker.http.threads, which is for each tasktracker, not a single map, the default value is 40, Larger clusters that run large jobs can be increased to increase data transfer rates.

Now let's go to the reduce section of shuffle. The output file of the map is placed on the local disk of the Tasktracker that runs the map task (note: The map output is always written to the local disk, but the reduce output is not, generally written to HDFs) and is the input data required to run the tasktracker of the reduce task. The input data for the reduce task is distributed across the output of multiple map tasks within the cluster, and the map task may be completed at different times, and the reduce task will start copying his output as soon as one of the map tasks is complete. This phase is called the copy phase, and the reduce task has multiple copy threads, which can obtain the map output in parallel. You can change the number of threads by setting mapred.reduce.parallel.copies.

How does reduce know which tasktrackers to get the output of the map? When the map task is complete, it notifies their parent tasktracker, informs the status update, and then tasktracker to jobtracker that the notification messages are transmitted through the heartbeat communication mechanism, so for a specific job, Jobtracker knows the mapping relationship between map output and tasktrackers. A thread in the reducer will intermittently ask Jobtracker for the address of the map output until all the data has been taken. After reducer takes the map output, Tasktracker does not immediately delete the data, because reducer may fail, and they will delete the entire job after the Jobtracker tells them to delete it.

If the map output is small enough, they will be copied into the memory of the reduce tasktracker (the size of the buffer is controlled by mapred.job.shuffle.input.buffer.percnet), or the size of the threshold of the map output ( Controlled by Mapred.inmem.merge.threshold), the data in the buffer will be merged and then spill to disk.

The copied data is superimposed on the disk, and a background thread merges them into a larger sort file, which saves the time of late merging. For compressed map output, the system automatically extracts them to memory for easy merging.

When all the map outputs are copied, the Reduce task goes to the sort stage (more appropriately, the merge phase, because the sort is done at the map end), and this phase merges all the map outputs, and the work repeats several times to complete.

Suppose there are 50 map outputs (which may have been stored in memory), and the merge factor is 10 (controlled by Io.sort.factor , just like the map-side merge), which will eventually need to be merged 5 times. Each merge will merge 10 files into one, resulting in 5 intermediate files. After this step, the system no longer merges 5 intermediate files into one, but instead directly "feeds" to the reduce function, eliminating the step of writing data to disk. The final merged data can be mixed data, both in memory and on disk. Because the purpose of merging is to merge the fewest number of files, so that the total number of files at the last merge reaches the amount of the merge factor, so the number of files involved in each operation will be more subtle in practice. For example, if there are 40 files, not every time you merge 10 eventually get 4 files, instead of merging only 4 files for the first time, and then three merges, each time 10, and finally get 4 merged files and 6 files that are not merged. Note that this does not change the number of merges, but minimizes the data optimizations that are written to the disk, because the last merged data is always sent directly to the reduce function. In the reduce phase, the reduce function acts on each key of the sort output. The output of this phase is written directly to the output file system, typically HDFs. In HDFs, because the Tasktracker node also runs a datanode process, the first block backup is written directly to the local disk. Here, the shuffle and sort analysis of MapReduce is complete.

0) A copy of the data uploaded into HDFs, will be cut into many slices (such as 64MB), and each fragment will be stored in several datanode (redundant storage, to prevent a node from failing to cause incomplete data, the job can not be done)

1) The output of map is both the input of reduce.

2) Map outputs the output of each record in the form of a <key,value> pair.

3) before entering the reduce phase, the relevant data in each map (key same data) will be shuffled, sorted, and sent to a reducer.

4) Enter the reduce phase, the same key map output will reach the same reducer

The MapReduce program is designed to parallel compute massive amounts of data, which requires the workflow to be divided into a large number of machines, if the components (component) can be arbitrarily shared between the data, then the model cannot be extended to large-scale cluster up (hundreds of or thousands of nodes), The communication overhead that is used to maintain synchronization of data between nodes makes the system unreliable and inefficient in large-scale clusters.

In fact, all data elements on the MapReduce are immutable, which means they cannot be updated. If you change an input-key-value pair in a mapping task, it does not feed back to the input file, and communication between nodes occurs only when a new output key-value pair ((key,value) pairs) is generated, and the Hadoop system uploads the output to the next execution stage.

List processing (lists processing)

Conceptually, the MapReduce program transforms the input data elements list into a list of output data elements. A mapreduce program repeats this step two times and is described in two different terms: map and reduce, which are derived from the list processing language, such as lisp,scheme, or ML.

Mapping data table (Lists)

The first step of the MapReduce program is called mapping, where there will be data elements as input data for the Mapper function, one at a time, and mapper will pass the results of each map separately to an output data element.

Mapping creates a new list of output data by applying a function to each element in the input data list

Here's an example of a map function: Suppose you have a function toupper (str) that returns an uppercase version of the input string. You can use this function in map to convert a list of regular strings into uppercase strings. Note that we did not change the input string here: We returned a new string, which is one of the components of the new output list.

Reducing data table (Lists)

Reducing allows you to gather data together. The REDUCER function receives an iterator from the input list, which aggregates the data together and returns an output value.

The input data is reducing by a list iterator to output the aggregated results.

Reducing is typically used to generate "summary" data that transforms large-scale data into smaller summaries. For example, "+" can be used as a reducing function to return the sum of the values of the input data list.

Put them together in the MapReduce.

The MapReduce framework for Hadoop uses the above concepts to handle large-scale data information. The MapReduce program has two components: one implements the Mapper and the other implements the reducer. The mapper and reducer terms described above have more subtle extensions in Hadoop, but the basic concepts are the same.

Keys and values: In MapReduce, no value is separate, each value has a key associated with it, and the key identifies the associated value. For example, the time-coded speedometer log that is read from multiple vehicles can be identified by a license plate number, as follows:

AAA-123 65mph, 12:00pm

ZZZ-789 50mph, 12:02pm

AAA-123 40mph, 12:05pm

CCC-456 25mph, 12:15pm

...

The mapping and reducing functions do not receive values only (values), but rather (key, value) pairs. Each of these functions has the same output: both a key and a value, which are sent to the next list in the data flow.

For how mapper and reducer work, MapReduce is not as strict as other languages. In more formal functional mapping and reducing settings, Mapper generates an output element for each INPUT element, and reducer generates an output element for each input list. However, in mapreduce, arbitrary values can be generated at each stage, and the mapper may map an input to 0, 1, or 100 outputs. Reducer may calculate more than one input list and generate one or more different outputs.

Divide the reduce space by key: the function of reducing is to convert a large list of values into one (or several) output values. In MapReduce, all output values are generally not reduce together. All values with the same key are sent together into a reducer. Action is performed independently between the reduce operations on a list of values with different key associations.

Different colors represent different keys, and values with the same key are passed to the same reduce task.

Application Example: Word frequency statistics

Writing a simple MapReduce program can be used to count the number of occurrences of different words in a file set. For example, we have such a file:

Foo.txt:Sweet, this is the Foo file

Bar.txt:This is the bar file

We expect the output to look like this:

Sweet 1

This 2

is 2

The 2

Foo 1

Bar 1

File 2

Of course no problem, we can write a mapreduce program to calculate this output. The tall structure will look like this:

Mapper (filename, file-contents):

For each word in file-contents:

Emit (Word, 1)

REDUCER (Word, values):

sum = 0

For each value in values:

sum = sum + value

Emit (word, sum)

List 4.1 MapReduce word frequency statistic pseudo-code

Instances of several mapper functions are created on different machines in our cluster, each receiving a different input file (assuming we have a lot of files here). The mappers output (word,1) key value pair will be transferred to Reducers. Several instances of reducer methods are also instantiated on different machines. Each reducer is responsible for dealing with a list of values associated with different words, and the values in the list of values are 1;reducer the sum of these "1" values into a final count associated with a word. Reducer then generates the final (Word,count) output and writes it to an output file.

For this, we can write a very similar program in Hadoop MapReduce, which is included in the Hadoop bundle, specifically in Src/examples/org/apache/hadoop/examples/wordcount.java. Some of its code is as follows:

Java code

public static class Mapclass extends Mapreducebase
Implements Mapper<longwritable, text, text, intwritable> {
Private final static intwritable one = new intwritable (1);
Private text Word = new text ();
public void Map (longwritable key, Text value,
Outputcollector<text, intwritable> output,
Reporter Reporter) throws IOException {
String line = value.tostring ();
StringTokenizer ITR = new StringTokenizer (line);
while (Itr.hasmoretokens ()) {
Word.set (Itr.nexttoken ());
Output.collect (Word, one);
}
}
}
/**
* A Reducer class that just emits the sum of the input values.
*/
public static class Reduce extends Mapreducebase
Implements Reducer<text, Intwritable, Text, intwritable> {
public void reduce (Text key, iterator<intwritable> values,
Outputcollector<text, intwritable> output,
Reporter Reporter) throws IOException {
int sum = 0;
while (Values.hasnext ()) {
Sum + = Values.next (). get ();
}
Output.collect (Key, New intwritable (sum));
}
}

Detailed description of the work principle of mapreduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More