MapReduce Data Flow

Last Update:2016-04-30 Source: Internet

Author: User

Tags shuffle hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MapReduce Data Flow

The core components of Hadoop work together as shown in the following:

Figure 4.4 High-level mapreduce work line

The input to MapReduce typically comes from files in HDFs, which are stored on nodes within the cluster. Running a MapReduce program runs the mapping task on many nodes and even all nodes of the cluster, and each mapping task is equal: Mappers does not have a specific "identifier" associated with it. As a result, any mapper can handle any input file. Each mapper will load a set of files stored locally on the running node for processing (this is a mobile calculation that moves the computation to the node where the data is located, which avoids the additional data transfer overhead).

When the mapping phase is complete, the intermediate key values generated at this stage must be exchanged between nodes, and values with the same key are sent to the same reducer. The reduce task distributes nodes within the cluster the same as mappers. This is the only communication process between the task nodes in MapReduce. There is no exchange of information between map tasks and no concern for the existence of other map tasks. Similarly, there is no communication between the different reduce tasks. The user cannot explicitly marshal information from one machine to another machine; All data transfers are made by the Hadoop MapReduce platform itself, which is implicitly guided by a different key associated to the value. This is the fundamental element of the reliability of Hadoop mapreduce. If the nodes in the cluster fail, the task must be able to be restarted. If the task has performed a side effect (side-effect) operation, for example, communicating with the outside, the shared state must exist on a task that can be restarted. By eliminating communication and side-effects problems, the restart can be done more gracefully.

Near-distance observation

In the previous figure, a high-level view of Hadoop MapReduce was described. From that diagram you can see how the Mapper and reducer components are used in the word frequency Statistics program, and how they accomplish their goals. Next, we'll take a closer look at the system to get more details.

Figure 4.5 Details of Hadoop mapreduce data flow

Figure 4.5 shows the more mechanisms in the streamlined water. Although there are only 2 nodes, the same pipeline can be replicated to a system that spans a large number of nodes. The next few paragraphs will detail the various stages of the MapReduce program.

1. Enter the file:

The file is the initial storage place of data for the MapReduce task. Normally, the input file is usually present in HDFs. The format of these files can be arbitrary; we can use row-based log files, or we can use binary format, multiple-line input records, or some other format. These files will be large-dozens of g or larger.

2. Input format:

The InputFormat class defines how to split and read the input file, which provides several features:

Select the file or object as input;
Define the inputsplits to divide the document into tasks;
Provides a factory method for Recordreader to read files;

Hadoop comes with several input formats. There is an abstract class called Fileinputformat, and all the InputFormat classes that manipulate files inherit functionality and properties from it. When the Hadoop job is turned on, Fileinputformat gets a path parameter that contains the files that need to be processed, and Fileinputformat reads all the files in the folder (the default does not include subfolders). It then splits the files into one or more inputsplit. You can use the Setinputformat () method of the Jobconf object to set the input format applied to your job input file. The following table shows some of the standard input formats:

Input format	Describe	Key	Value
Textinputformat	Default format, read line of file	Byte offset of the row	The contents of the line
Keyvalueinputformat	Parse rows into key-value pairs	All characters before the first tab character	What's left of the line
Sequencefileinputformat	High-performance binary format defined by Hadoop	User definable	User definable

Table 4.1MapReduce provides the input format

The default input format is Textinputformat, which takes each line of the input file as a separate record, but does not parse. This is useful for data that is not formatted or for row-based records, such as log files. One of the more interesting input formats is Keyvalueinputformat, which also takes each line of the input file as a separate record. However, the difference is that textinputformat the entire file line as the value data, Keyvalueinputformat is the Search tab character to split the row into key-value pairs. This is especially useful when outputting a mapreduce job as input to the next job, because the default output format (described in more detail below) is to output data in Keyvalueinputformat format. Finally speaking Sequencefileinputformat, it reads special Hadoop-specific binaries that contain many features that enable Hadoop's mapper to read data quickly. The sequence file is block-compressed and provides direct serialization and deserialization of several data types (not just text types). The Squence file can be used as the output data for a mapreduce task, and it is efficient to use it to do a mapreduce job to the intermediate data of another job.

3. Data fragment (Inputsplit):

An input block describes a unit that constitutes a single map task in a MapReduce program. Applying a MapReduce program to a dataset, meaning a job, consists of several (and possibly hundreds of) tasks. The map task may read the entire file, but it is generally read as part of the file. By default, Fileinputformat and its subclasses will split the file in 64MB (the same size as the block default for HDFs, where Hadoop suggests that the split size is the same) as the cardinality. You can do it in Hadoop-site.xml: 0.20.* Later in the Mapred-default.xml) file set the Mapred.min.split.size parameter to control the specific partition size, or in the specific MapReduce job jobconf object to override this parameter. By processing files in chunks, we can let multiple map tasks work in parallel with one file. If the file is very large, this feature can greatly improve performance by parallel processing. More importantly, because a file consisting of multiple blocks may be scattered over several nodes within the cluster (in fact, this is the case), the task can be dispatched on different nodes, so all the individual blocks are processed locally, rather than transferring data from one node to another. Of course, log files can be handled in a smart block manner, but some file formats do not support block processing. In this case, you can write a custom inputformat so that you can control how your files are split (or not split) into chunks of files. The custom file format is described in part five.
The input format defines the list of map tasks that make up the mapping phase, each of which corresponds to an input block. Then, depending on the physical address of the input file block, these tasks are dispatched to the corresponding system nodes, and multiple map tasks may be assigned to the same node. When the task is dispatched, the node begins to run the task, attempting to go to maximum parallelization. The maximum number of task parallelism on a node is controlled by the Mapred.tasktracker.map.tasks.maximum parameter.

4. Record Reader (Recordreader)

Inputsplit defines how to slice the work, but does not describe how to access it. The Recordreader class is the actual key-value pair used to load the data and convert the data into a suitable mapper read. The Recordreader instance is defined by the input format, the default input format, Textinputformat, provides a linerecordreader that the class will take each line of the input file as a new value, The key associated to each row is the byte offset of the row in the file. Recordreader will be repeated on the input block until the entire input block is processed, and each call to Recordreader invokes the Mapper map () method.

5. Mapper:

Mapper performed the interesting user-defined work in the first phase of the MapReduce program. Given a key-value pair, the map () method generates one or more key-value pairs that are sent to reducer. For each map task (input block) of the entire job input section, each new mapper instance is initialized in a separate Java process and cannot be communicated between mapper. This makes the reliability of each map task unaffected by other map tasks, and is determined only by the reliability of the local machine. The map () method will receive an additional two parameters in addition to the key value (note: In the version after 0.20.x, the interface has changed, the context object instead of these two parameters):

The Outputcollector object has a method called Collect (), which can use this method to send key-value pairs to the reduce phase of the job.
The Reporter object provides information about the current task, and its Getinputsplit () method returns an object that describes the current input block, and also allows the map task to provide additional information about the progress of the system execution. The SetStatus () method allows you to generate a status message that feeds back to the user, and the Incrcounter () method allows you to increment the shared high-performance counter, and in addition to the default counters, you can define more of the counters you want. Each mapper can increment the counter, and Jobtracker collects incremental data from different processes and aggregates them together for reading after the job is finished.

6. Partition & Shuffle:

When the first map task is completed, the node may continue to perform more map tasks, but this time it also begins to swap the intermediate output of the map task to the reducer where it is needed, and the process of moving the map output to reducer is called shuffle. Each reduce node is assigned to a different subset of the key sets in the intermediate output, which (known as "Partitions") is the input data for the reduce task. The key-value pairs generated by each map task may be subordinate to any partition, and values with the same key will always be reduce together, regardless of the mapper. Therefore, all map nodes must agree on where to send the different intermediate data. The Partitioner class is used to determine the whereabouts of a given key-value pair, the default classifier (Partitioner) calculates the hash value of the key and assigns the key to the corresponding partition based on the result, and the custom classifier is described in detail in part five.

7. Sort:

Each reduce task is responsible for reduceing all values associated with the same key, and each node receives an intermediate key set that has been automatically sorted by Hadoop before being sent to the specific reducer.

8. (Reduce):

Each reduce task creates a reducer instance, which is an instance of a user-defined code that is responsible for performing the second important phase of a particular job. For each key in the partition that has been assigned to reducer, the reduce () method of Reducer is called only once, and it receives a key and an iterator associated to all values of the key, and the iterator returns the value associated to the same key in an undefined order. Reducer also receives a outputcollector and report object, which is used as it did in the map () method.

9. output Format:

The key-value pairs supplied to the Outputcollector are written to the output file, and the write is controlled by the output format. OutputFormat functions like the InputFormat class described earlier, and instances of OutputFormat provided by Hadoop write files on local disks or HDFS, which are inherited from the common Fileinputformat class. Each reducer will write the result output in a separate file in the public folder, which is typically named PART-NNNNN,NNNNN is the ID of partition associated to a reduce task. The output folder is set by Fileoutputformat.setoutputpath (). You can use the Setoutputformat () method of the Jobconf object of the specific MapReduce job to set the specific output format. The following table shows the output formats that are provided:

Output format	describes /p>
Textoutputformat	Default output format, with "key \ t value "output line
Sequencefileoutputformat	Output binary, suitable for reading input to sub-mapreduce jobs
Nulloutputformat	Ignores the data received, that is, do not output

Table 4.2:hadoop provides the output format
Hadoop provides some OutputFormat instances for writing to a file, and the basic (default) instance is Textoutputformat, which writes data to a text file in a single key-value pair. The subsequent MapReduce task can simply re-read the required input data through the Keyvalueinputformat class, and is also suitable for human reading. There is also an intermediate format that is more suitable for use between mapreduce jobs, which is sequencefileoutputformat, which can quickly serialize arbitrary data types into a file, The corresponding Sequencefileinputformat will deserialize the file into the same type and submit the input data for the next mapper in the same way as the previous reducer. Nulloutputformat does not generate an output file and discards any key-value pairs passed to it by Outputcollector, if you explicitly write your own output file in the reduce () method and do not want the Hadoop framework to output additional empty output files, That kind of class is very useful.
　　Recordwriter: This is similar to the implementation of InputFormat reading a single record through Recordreader, OutputFormat class is the factory method of the Recordwriter object, Used to write a single record to a file, as if it were written directly by Ouputformat.
Reducer output files will remain in HDFs for your other applications, such as another mapreduce job, or a separate program for manual inspection.

MapReduce Data Flow

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More