MapReduce Data Flow

Source: Internet
Author: User
Tags shuffle hadoop mapreduce

MapReduce Data Flow

The core components of Hadoop work together as shown in the following:

Figure 4.4 High-level mapreduce work line

The input to MapReduce typically comes from files in HDFs, which are stored on nodes within the cluster. Running a MapReduce program runs the mapping task on many nodes and even all nodes of the cluster, and each mapping task is equal: Mappers does not have a specific "identifier" associated with it. As a result, any mapper can handle any input file. Each mapper will load a set of files stored locally on the running node for processing (this is a mobile calculation that moves the computation to the node where the data is located, which avoids the additional data transfer overhead).

When the mapping phase is complete, the intermediate key values generated at this stage must be exchanged between nodes, and values with the same key are sent to the same reducer. The reduce task distributes nodes within the cluster the same as mappers. This is the only communication process between the task nodes in MapReduce. There is no exchange of information between map tasks and no concern for the existence of other map tasks. Similarly, there is no communication between the different reduce tasks. The user cannot explicitly marshal information from one machine to another machine; All data transfers are made by the Hadoop MapReduce platform itself, which is implicitly guided by a different key associated to the value. This is the fundamental element of the reliability of Hadoop mapreduce. If the nodes in the cluster fail, the task must be able to be restarted. If the task has performed a side effect (side-effect) operation, for example, communicating with the outside, the shared state must exist on a task that can be restarted. By eliminating communication and side-effects problems, the restart can be done more gracefully.

Near-distance observation

In the previous figure, a high-level view of Hadoop MapReduce was described. From that diagram you can see how the Mapper and reducer components are used in the word frequency Statistics program, and how they accomplish their goals. Next, we'll take a closer look at the system to get more details.

Figure 4.5 Details of Hadoop mapreduce data flow

Figure 4.5 shows the more mechanisms in the streamlined water. Although there are only 2 nodes, the same pipeline can be replicated to a system that spans a large number of nodes. The next few paragraphs will detail the various stages of the MapReduce program.

1. Enter the file:

The file is the initial storage place of data for the MapReduce task. Normally, the input file is usually present in HDFs. The format of these files can be arbitrary; we can use row-based log files, or we can use binary format, multiple-line input records, or some other format. These files will be large-dozens of g or larger.

2. Input format:

The InputFormat class defines how to split and read the input file, which provides several features:

    • Select the file or object as input;
    • Define the inputsplits to divide the document into tasks;
    • Provides a factory method for Recordreader to read files;

Hadoop comes with several input formats. There is an abstract class called Fileinputformat, and all the InputFormat classes that manipulate files inherit functionality and properties from it. When the Hadoop job is turned on, Fileinputformat gets a path parameter that contains the files that need to be processed, and Fileinputformat reads all the files in the folder (the default does not include subfolders). It then splits the files into one or more inputsplit. You can use the Setinputformat () method of the Jobconf object to set the input format applied to your job input file. The following table shows some of the standard input formats:

Input format

Describe

Key

Value

Textinputformat

Default format, read line of file

Byte offset of the row

The contents of the line

Keyvalueinputformat

Parse rows into key-value pairs

All characters before the first tab character

What's left of the line

Sequencefileinputformat

High-performance binary format defined by Hadoop

User definable

User definable

Table 4.1MapReduce provides the input format

The default input format is Textinputformat, which takes each line of the input file as a separate record, but does not parse. This is useful for data that is not formatted or for row-based records, such as log files. One of the more interesting input formats is Keyvalueinputformat, which also takes each line of the input file as a separate record. However, the difference is that textinputformat the entire file line as the value data, Keyvalueinputformat is the Search tab character to split the row into key-value pairs. This is especially useful when outputting a mapreduce job as input to the next job, because the default output format (described in more detail below) is to output data in Keyvalueinputformat format. Finally speaking Sequencefileinputformat, it reads special Hadoop-specific binaries that contain many features that enable Hadoop's mapper to read data quickly. The sequence file is block-compressed and provides direct serialization and deserialization of several data types (not just text types). The Squence file can be used as the output data for a mapreduce task, and it is efficient to use it to do a mapreduce job to the intermediate data of another job.

3. Data fragment (Inputsplit):

An input block describes a unit that constitutes a single map task in a MapReduce program. Applying a MapReduce program to a dataset, meaning a job, consists of several (and possibly hundreds of) tasks. The map task may read the entire file, but it is generally read as part of the file. By default, Fileinputformat and its subclasses will split the file in 64MB (the same size as the block default for HDFs, where Hadoop suggests that the split size is the same) as the cardinality. You can do it in Hadoop-site.xml: 0.20.* Later in the Mapred-default.xml) file set the Mapred.min.split.size parameter to control the specific partition size, or in the specific MapReduce job jobconf object to override this parameter. By processing files in chunks, we can let multiple map tasks work in parallel with one file. If the file is very large, this feature can greatly improve performance by parallel processing. More importantly, because a file consisting of multiple blocks may be scattered over several nodes within the cluster (in fact, this is the case), the task can be dispatched on different nodes, so all the individual blocks are processed locally, rather than transferring data from one node to another. Of course, log files can be handled in a smart block manner, but some file formats do not support block processing. In this case, you can write a custom inputformat so that you can control how your files are split (or not split) into chunks of files. The custom file format is described in part five.
The input format defines the list of map tasks that make up the mapping phase, each of which corresponds to an input block. Then, depending on the physical address of the input file block, these tasks are dispatched to the corresponding system nodes, and multiple map tasks may be assigned to the same node. When the task is dispatched, the node begins to run the task, attempting to go to maximum parallelization. The maximum number of task parallelism on a node is controlled by the Mapred.tasktracker.map.tasks.maximum parameter.

4. Record Reader (Recordreader)

Inputsplit defines how to slice the work, but does not describe how to access it. The Recordreader class is the actual key-value pair used to load the data and convert the data into a suitable mapper read. The Recordreader instance is defined by the input format, the default input format, Textinputformat, provides a linerecordreader that the class will take each line of the input file as a new value, The key associated to each row is the byte offset of the row in the file. Recordreader will be repeated on the input block until the entire input block is processed, and each call to Recordreader invokes the Mapper map () method.

5. Mapper:

Mapper performed the interesting user-defined work in the first phase of the MapReduce program. Given a key-value pair, the map () method generates one or more key-value pairs that are sent to reducer. For each map task (input block) of the entire job input section, each new mapper instance is initialized in a separate Java process and cannot be communicated between mapper. This makes the reliability of each map task unaffected by other map tasks, and is determined only by the reliability of the local machine. The map () method will receive an additional two parameters in addition to the key value (note: In the version after 0.20.x, the interface has changed, the context object instead of these two parameters):

    • The Outputcollector object has a method called Collect (), which can use this method to send key-value pairs to the reduce phase of the job.
    • The Reporter object provides information about the current task, and its Getinputsplit () method returns an object that describes the current input block, and also allows the map task to provide additional information about the progress of the system execution. The SetStatus () method allows you to generate a status message that feeds back to the user, and the Incrcounter () method allows you to increment the shared high-performance counter, and in addition to the default counters, you can define more of the counters you want. Each mapper can increment the counter, and Jobtracker collects incremental data from different processes and aggregates them together for reading after the job is finished.

6. Partition & Shuffle:

When the first map task is completed, the node may continue to perform more map tasks, but this time it also begins to swap the intermediate output of the map task to the reducer where it is needed, and the process of moving the map output to reducer is called shuffle. Each reduce node is assigned to a different subset of the key sets in the intermediate output, which (known as "Partitions") is the input data for the reduce task. The key-value pairs generated by each map task may be subordinate to any partition, and values with the same key will always be reduce together, regardless of the mapper. Therefore, all map nodes must agree on where to send the different intermediate data. The Partitioner class is used to determine the whereabouts of a given key-value pair, the default classifier (Partitioner) calculates the hash value of the key and assigns the key to the corresponding partition based on the result, and the custom classifier is described in detail in part five.

7. Sort:

Each reduce task is responsible for reduceing all values associated with the same key, and each node receives an intermediate key set that has been automatically sorted by Hadoop before being sent to the specific reducer.

8. (Reduce):

Each reduce task creates a reducer instance, which is an instance of a user-defined code that is responsible for performing the second important phase of a particular job. For each key in the partition that has been assigned to reducer, the reduce () method of Reducer is called only once, and it receives a key and an iterator associated to all values of the key, and the iterator returns the value associated to the same key in an undefined order. Reducer also receives a outputcollector and report object, which is used as it did in the map () method.

9. output Format:

The key-value pairs supplied to the Outputcollector are written to the output file, and the write is controlled by the output format. OutputFormat functions like the InputFormat class described earlier, and instances of OutputFormat provided by Hadoop write files on local disks or HDFS, which are inherited from the common Fileinputformat class. Each reducer will write the result output in a separate file in the public folder, which is typically named PART-NNNNN,NNNNN is the ID of partition associated to a reduce task. The output folder is set by Fileoutputformat.setoutputpath (). You can use the Setoutputformat () method of the Jobconf object of the specific MapReduce job to set the specific output format. The following table shows the output formats that are provided:

Output format

describes /p>

Textoutputformat

Default output format, with "key \ t value "output line

Sequencefileoutputformat

Output binary, suitable for reading input to sub-mapreduce jobs

Nulloutputformat

Ignores the data received, that is, do not output  

Table 4.2:hadoop provides the output format
Hadoop provides some OutputFormat instances for writing to a file, and the basic (default) instance is Textoutputformat, which writes data to a text file in a single key-value pair. The subsequent MapReduce task can simply re-read the required input data through the Keyvalueinputformat class, and is also suitable for human reading. There is also an intermediate format that is more suitable for use between mapreduce jobs, which is sequencefileoutputformat, which can quickly serialize arbitrary data types into a file, The corresponding Sequencefileinputformat will deserialize the file into the same type and submit the input data for the next mapper in the same way as the previous reducer. Nulloutputformat does not generate an output file and discards any key-value pairs passed to it by Outputcollector, if you explicitly write your own output file in the reduce () method and do not want the Hadoop framework to output additional empty output files, That kind of class is very useful.
  Recordwriter: This is similar to the implementation of InputFormat reading a single record through Recordreader, OutputFormat class is the factory method of the Recordwriter object, Used to write a single record to a file, as if it were written directly by Ouputformat.
Reducer output files will remain in HDFs for your other applications, such as another mapreduce job, or a separate program for manual inspection.



MapReduce Data Flow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.