MapReduce Data Flow

Last Update:2015-08-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The input to MapReduce typically comes from files in HDFs, which are stored on nodes within the cluster. Running a MapReduce program runs the mapping task on many nodes and even all nodes of the cluster, and each mapping task is equal: Mappers does not have a specific "identifier" associated with it. As a result, any mapper can handle any input file. Each mapper will load a set of files stored locally on the running node for processing (this is a mobile calculation that moves the computation to the node where the data is located, which avoids the additional data transfer overhead).

When the mapping phase is complete, the intermediate key values generated at this stage must be exchanged between nodes, and values with the same key are sent to the same reducer. The reduce task distributes nodes within the cluster the same as mappers. This is the only communication process between the task nodes in MapReduce. There is no exchange of information between map tasks and no concern for the existence of other map tasks. Similarly, there is no communication between the different reduce tasks. The user cannot explicitly marshal information from one machine to another machine; All data transfers are made by the Hadoop MapReduce platform itself, which is implicitly guided by a different key associated to the value. This is the fundamental element of the reliability of Hadoop mapreduce. If the nodes in the cluster fail, the task must be able to be restarted. If the task has performed a side effect (side-effect) operation, for example, communicating with the outside, the shared state must exist on a task that can be restarted. By eliminating communication and side-effects problems, the restart can be done more gracefully.

Near-distance observation

In the previous figure, a high-level view of Hadoop MapReduce was described. From that diagram you can see how the Mapper and reducer components are used in the word frequency Statistics program, and how they accomplish their goals. Next, we'll take a closer look at the system to get more details.

Figure 4.5 Details of Hadoop mapreduce data flow

Figure 4.5 shows the more mechanisms in the streamlined water. Although there are only 2 nodes, the same pipeline can be replicated to a system that spans a large number of nodes. The next few paragraphs will detail the various stages of the MapReduce program.

input File: The file is the initial storage place of data for the MapReduce task. Normally, the input file is usually present in HDFs. The format of these files can be arbitrary; we can use row-based log files, or we can use binary format, multiple-line input records, or some other format. These files will be large-dozens of g or larger.

Input Format: The InputFormat class defines how to split and read the input file, which provides several features:

Select the file or object as input;
Define the inputsplits to divide the document into tasks;
Provides a factory method for Recordreader to read files;

Hadoop comes with several input formats. There is an abstract class called Fileinputformat, and all the InputFormat classes that manipulate files inherit functionality and properties from it. When the Hadoop job is turned on, Fileinputformat gets a path parameter that contains the files that need to be processed, and Fileinputformat reads all the files in the folder (the default does not include subfolders). It then splits the files into one or more inputsplit. You can use the Setinputformat () method of the Jobconf object to set the input format applied to your job input file. The following table shows some of the standard input formats:

Input format	Describe	Key	Value
Textinputformat	Default format, read line of file	Byte offset of the row	The contents of the line
Keyvalueinputformat	Parse rows into key-value pairs	All characters before the first tab character	What's left of the line
Sequencefileinputformat	High-performance binary format defined by Hadoop	User definable	User definable

Table 4.1MapReduce provides the input format

The default input format is Textinputformat, which takes each line of the input file as a separate record, but does not parse. This is useful for data that is not formatted or for row-based records, such as log files. One of the more interesting input formats is Keyvalueinputformat, which also takes each line of the input file as a separate record. However, the difference is that textinputformat the entire file line as the value data, Keyvalueinputformat is the Search tab character to split the row into key-value pairs. This is especially useful when outputting a mapreduce job as input to the next job, because the default output format (described in more detail below) is to output data in Keyvalueinputformat format. Finally speaking Sequencefileinputformat, it reads special Hadoop-specific binaries that contain many features that enable Hadoop's mapper to read data quickly. The sequence file is block-compressed and provides direct serialization and deserialization of several data types (not just text types). The Squence file can be used as the output data for a mapreduce task, and it is efficient to use it to do a mapreduce job to the intermediate data of another job.

input Block (inputsplit):An input block describes a unit that constitutes a single map task in a MapReduce program. Applying a MapReduce program to a dataset, meaning a job, consists of several (and possibly hundreds of) tasks. The map task may read the entire file, but it is generally read as part of the file. By default, Fileinputformat and its subclasses will split the file in 64MB (the same size as the block default for HDFs, where Hadoop suggests that the split size is the same) as the cardinality. You can do it in Hadoop-site.xml: 0.20.* Later in the Mapred-default.xml) file set the Mapred.min.split.size parameter to control the specific partition size, or in the specific MapReduce job jobconf object to override this parameter. By processing files in chunks, we can let multiple map tasks work in parallel with one file. If the file is very large, this feature can greatly improve performance by parallel processing. More importantly, because a file consisting of multiple blocks may be scattered over several nodes within the cluster (in fact, this is the case), the task can be dispatched on different nodes, so all the individual blocks are processed locally, rather than transferring data from one node to another. Of course, log files can be handled in a smart block manner, but some file formats do not support block processing. In this case, you can write a custom inputformat so that you can control how your files are split (or not split) into chunks of files. The custom file format is described in part five.
The input format defines the list of map tasks that make up the mapping phase, each of which corresponds to an input block. Then, depending on the physical address of the input file block, these tasks are dispatched to the corresponding system nodes, and multiple map tasks may be assigned to the same node. When the task is dispatched, the node begins to run the task, attempting to go to maximum parallelization. The maximum number of task parallelism on a node is controlled by the Mapred.tasktracker.map.tasks.maximum parameter.
　　 Record Reader (Recordreader):Inputsplit defines how to slice the work, but does not describe how to access it. The Recordreader class is the actual key-value pair used to load the data and convert the data into a suitable mapper read. The Recordreader instance is defined by the input format, the default input format, Textinputformat, provides a linerecordreader that the class will take each line of the input file as a new value, The key associated to each row is the byte offset of the row in the file. Recordreader will be repeated on the input block until the entire input block is processed, and each call to Recordreader invokes the Mapper map () method.
　　 Mapper:Mapper performed the interesting user-defined work in the first phase of the MapReduce program. Given a key-value pair, the map () method generates one or more key-value pairs that are sent to reducer. For each map task (input block) of the entire job input section, each new mapper instance is initialized in a separate Java process and cannot be communicated between mapper. This makes the reliability of each map task unaffected by other map tasks, and is determined only by the reliability of the local machine. The map () method will receive an additional two parameters in addition to the key value (note: In the version after 0.20.x, the interface has changed, the context object instead of these two parameters):

The Outputcollector object has a method called Collect (), which can use this method to send key-value pairs to the reduce phase of the job.
The Reporter object provides information about the current task, and its Getinputsplit () method returns an object that describes the current input block, and also allows the map task to provide additional information about the progress of the system execution. The SetStatus () method allows you to generate a status message that feeds back to the user, and the Incrcounter () method allows you to increment the shared high-performance counter, and in addition to the default counters, you can define more of the counters you want. Each mapper can increment the counter, and Jobtracker collects incremental data from different processes and aggregates them together for reading after the job is finished.

　　 Partition & Shuffle:When the first map task is completed, the node may continue to perform more map tasks, but this time it also begins to swap the intermediate output of the map task to the reducer where it is needed, and the process of moving the map output to reducer is called shuffle. Each reduce node is assigned to a different subset of the key sets in the intermediate output, which (known as "Partitions") is the input data for the reduce task. The key-value pairs generated by each map task may be subordinate to any partition, and values with the same key will always be reduce together, regardless of the mapper. Therefore, all map nodes must agree on where to send the different intermediate data. The Partitioner class is used to determine the whereabouts of a given key-value pair, the default classifier (Partitioner) calculates the hash value of the key and assigns the key to the corresponding partition based on the result, and the custom classifier is described in detail in part five.
　　 Sort by:Each reduce task is responsible for reduceing all values associated with the same key, and each node receives an intermediate key set that has been automatically sorted by Hadoop before being sent to the specific reducer.
(Reduce):Each reduce task creates a reducer instance, which is an instance of a user-defined code that is responsible for performing the second important phase of a particular job. For each key in the partition that has been assigned to reducer, the reduce () method of Reducer is called only once, and it receives a key and an iterator associated to all values of the key, and the iterator returns the value associated to the same key in an undefined order. Reducer also receives a outputcollector and report object, which is used as it did in the map () method.
　　 output Format:The key-value pairs supplied to the Outputcollector are written to the output file, and the write is controlled by the output format. OutputFormat functions like the InputFormat class described earlier, and instances of OutputFormat provided by Hadoop write files on local disks or HDFS, which are inherited from the common Fileinputformat class. Each reducer will write the result output in a separate file in the public folder, which is typically named PART-NNNNN,NNNNN is the ID of partition associated to a reduce task. The output folder is set by Fileoutputformat.setoutputpath (). You can use the Setoutputformat () method of the Jobconf object of the specific MapReduce job to set the specific output format. The following table shows the output formats that are provided:

Output format	describes /p>
Textoutputformat	Default output format, with "key \ t value "output line
Sequencefileoutputformat	Output binary, suitable for reading input to sub-mapreduce jobs
Nulloutputformat	Ignores the data received, that is, do not output

Table 4.2:hadoop provides output format
Hadoop provides some OutputFormat instances for writing files, and the basic (default) instance is Textoutputformat, which writes data to a text file in a single key-value pair. The subsequent MapReduce task can simply re-read the required input data through the Keyvalueinputformat class, and is also suitable for human reading. There is also an intermediate format that is more suitable for use between mapreduce jobs, which is sequencefileoutputformat, which can quickly serialize arbitrary data types into a file, The corresponding Sequencefileinputformat will deserialize the file into the same type and submit the input data for the next mapper in the same way as the previous reducer. Nulloutputformat does not generate an output file and discards any key-value pairs passed to it by Outputcollector, if you explicitly write your own output file in the reduce () method and do not want the Hadoop framework to output additional empty output files, That kind of class is very useful.
Recordwriter: This is similar to the implementation of InputFormat reading a single record through Recordreader, the OutputFormat class is the factory method of the Recordwriter object, Used to write a single record to a file, as if it were written directly by Ouputformat.
Reducer files will be left in HDFs for your other applications, such as another mapreduce job, or a separate program for manual checks.

Additional MapReduce functionality

Figure 4.6 a MapReduce data stream with combiner inserted
combiner: The pipeline shown earlier ignores a step that optimizes the bandwidth used by the MapReduce job, called Combiner, which runs before reducer after mapper. Combiner is optional, and if this process is appropriate for your job, the combiner will run on each node running the map task. The combiner receives the output of the mapper instance on a particular node as input, and then Combiner's output is sent to reducer, instead of sending mapper output. Combiner is a "mini-reduce" process that only processes data generated by a single machine.
Word frequency statistics is a basic example of the usefulness of combiner, which generates a (word,1) key value pair for each of the words it sees. So if "cat" appears in the same document 3 times, ("Cat", 1) key-value pairs will be generated 3 times, these key-value pairs will be sent to reducer there. By using combiner, these key-value pairs can be compressed into a key-value pair sent to reducer ("Cat", 3). Each node now sends only one value to reducer for each word, greatly reducing the bandwidth required for the shuffle process and speeding up the execution of the job. The cool thing about this is that we don't have to write any extra code to enjoy this feature! If your reduce is interchangeable and composable, it can also be used as a combiner. You can enable combiner in the word frequency statistics program by simply adding the following line of code to the driver.

Conf.setcombinerclass (Reduce.class);

Combiner should be an instance of the Reducer interface, and if your reducer is not interchangeable or not combinable as a combiner, you can still write a third-party class as the combiner of your job.

Fault Tolerance
One of the main reasons for using Hadoop to run your job is its high fault tolerance, which allows the job to be completed successfully, even if it runs within a large cluster of nodes or networks with high failure rates.
The main way to achieve fault tolerance in Hadoop is to re-execute tasks, and a single task node (Tasktracker) will constantly communicate with the core node (jobtracker) of the system. If a tasktracker is unable to communicate with Jobtracker for a certain amount of time (by default, 1 minutes), the Jobtracker will assume that the tasktracker problem hangs. Jobtracker understands that the map and reduce tasks are assigned to each tasktracker.
If the job is still in the mapping phase, the other tasktracker will be required to re-execute all the map tasks performed by the previous failed Tasktracker. If the job is in the reduce phase, other tasktracker will be required to re-execute all of the reduce tasks performed by the previous failed Tasktracker.
Once the reduce task is completed, the data is written to HDFs. Therefore, if a tasktracker has completed 2 of the 3 reduce tasks assigned to it, then only the third task will be re-executed. The map task is a little more complicated: even if a node has completed 10 map tasks, reducer may still not be able to get all the output of these map tasks. If the node is hung at this point, its mapper output is inaccessible. So the completed map task must also be re-executed so that their output will be available to the remaining reducing machines, all of which are done automatically by the Hadoop platform.
This fault tolerance emphasizes the need for program execution without side effects, and if mapper and reducer have their own identities and communicate with the outside, then re-executing a task may require other nodes to communicate with the new map or reduce task instance, and the restarted tasks may need to rebuild their intermediate state. The process is complex and error-prone. MapReduce greatly simplifies this problem by removing the task identity or communication between tasks. A single task can only see its own input and output, which makes the error and restart process clear and reliable.
　　 Speculative Execution (speculative execution):There is a problem with the Hadoop system, which assigns tasks to a number of nodes, and it is likely that some slow nodes will limit the execution of the remaining programs. For example, if there is a slow disk controller in a node, it can read the input data at a speed of only 10% of the speed of all other nodes. So when 99 map tasks are complete, the system is still waiting for the last time-consuming map task to complete.
By forcing tasks to run independently of other tasks, the individual tasks do not know where their input data comes from. The mission believes that the Hadoop platform will dispatch the appropriate inputs to them. Therefore, for the same input data, we can process multiple times in parallel to take advantage of the load capacity of different machines. Because most of the tasks in the job are completed, the Hadoop platform dispatches copies of the remaining tasks on several idle nodes, a process called speculative execution. When the task is complete, it advertises to the Jobtracker. Any copy task that is completed first will become an authoritative copy, and if other copy tasks are still in speculative execution, Hadoop will tell Tasktracker to terminate these tasks and discard their output, and reducer will fetch the input data from the first completed mapper.
Speculative execution is enabled by default, You can disable the push of mapper and reducer by setting the Mapred.map.tasks.speculative.execution and Mapred.reduce.tasks.speculative.execution in jobconf to False Implementation of the test.

MapReduce Data Flow

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MapReduce Data Flow

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support