MapReduce input and Output type

Source: Internet
Author: User

The default mapper is Identitymapper, and the default reducer is Identityreducer, which writes the input keys and values intact to the output.

The default partitioner is Hashpartitinoer, which is partitioned according to the hash of each record's key.

Input file: The file is the initial storage place of data for the MapReduce task. Normally, the input file is usually present in HDFs. The format of these files can be arbitrary; we can use row-based log files, or we can use binary format, multiple-line input records, or some other format. These files will be large-dozens of g or larger.

Small files with Combinefileinputformat

Hadoop has a slightly less performance when dealing with a large number of small files, because the inputsplit generated by Fileinputformat is always an entire or part of the input file. If the file is small, and a large number, each time the map operation will only deal with very little input data, but there will be many map tasks, each time the new map operation will result in a certain performance loss.

Combinefileinputformat can alleviate this problem, it has done some optimization to this situation. Fileinputformat divides each file into 1 or more cells, and Combinefileinputformat can package multiple files into one input unit, so that each map operation will have more data to handle. Combinefileinputformat will take into account the location of nodes and clusters to determine which files should be packaged into a single unit, and the efficiency of all the original mapreduce will be reduced.

Input format: The InputFormat class defines how to split and read the input file, which provides several features:

    • Select the file or object as input;
    • Define the inputsplits to divide the document into tasks;
    • Provides a factory method for Recordreader to read files;

Hadoop comes with several input formats. There is an abstract class called Fileinputformat, and all the InputFormat classes that manipulate files inherit functionality and properties from it. When the Hadoop job is turned on, Fileinputformat gets a path parameter that contains the files that need to be processed, and Fileinputformat reads all the files in the folder (the default does not include subfolders). It then splits the files into one or more inputsplit. You can use the Setinputformat () method of the Jobconf object to set the input format applied to your job input file. The following table shows some

standard input format:
Input format Describe Key Value
Textinputformat Default format, read line of file Byte offset of the row The contents of the line
Keyvalueinputformat Parse rows into key-value pairs All characters before the first tab character What's left of the line
Sequencefileinputformat High-performance binary format defined by Hadoop User definable User definable
Sequencefileastextinputformat is a variant of the Sequencefileinputformat that converts keys and values to text objects. The ToString method of the key and value is called when the conversion occurs. This format can be an input to a sequential file as a stream operation.
Sequencefileasbinaryinputformat Sequencefileasbinaryinputformat is another variant of Sequencefileinputformat, which takes the keys and values of sequential files as binary objects, which are encapsulated as byteswritable objects. Thus, the application can arbitrarily interpret these byte arrays as the type they want.
Dbinputforma Dbinputforma is an input format that uses JDBC and reads data from a relational database. Since it does not have any fragmentation technology, you must be very careful when accessing the database, and too many mapper may be overwhelmed by the database. Therefore Dbinputformat is best used when loading a small data set.

Table 4.1MapReduce provides the input format

The default input format is Textinputformat, which takes each line of the input file as a separate record, but does not parse. This is useful for data that is not formatted or for row-based records, such as log files. One of the more interesting input formats is Keyvalueinputformat, which also takes each line of the input file as a separate record. However, the difference is that textinputformat the entire file line as the value data, Keyvalueinputformat is the Search tab character to split the row into key-value pairs. This is especially useful when outputting a mapreduce job as input to the next job, because the default output format (described in more detail below) is to output data in Keyvalueinputformat format. Finally speaking Sequencefileinputformat, it reads special Hadoop-specific binaries that contain many features that enable Hadoop's mapper to read data quickly. The sequence file is block-compressed and provides direct serialization and deserialization of several data types (not just text types). The Squence file can be used as the output data for a mapreduce task, and it is efficient to use it to do a mapreduce job to the intermediate data of another job.

Input block (inputsplit): An input block describes a unit that constitutes a single map task in a MapReduce program. Applying a MapReduce program to a dataset, meaning a job, consists of several (and possibly hundreds of) tasks. The map task may read the entire file, but it is generally read as part of the file. By default, Fileinputformat and its subclasses will split the file in 64MB (the same size as the block default for HDFs, where Hadoop suggests that the split size is the same) as the cardinality. You can do it in Hadoop-site.xml: 0.20.* Later in the Mapred-default.xml) file set the Mapred.min.split.size parameter to control the specific partition size, or in the specific MapReduce job jobconf object to override this parameter. By processing files in chunks, we can let multiple map tasks work in parallel with one file. If the file is very large, this feature can greatly improve performance by parallel processing. More importantly, because a file consisting of multiple blocks may be scattered over several nodes within the cluster (in fact, this is the case), the task can be dispatched on different nodes, so all the individual blocks are processed locally, rather than transferring data from one node to another. Of course, log files can be handled in a smart block manner, but some file formats do not support block processing. In this case, you can write a custom inputformat so that you can control how your files are split (or not split) into chunks of files.
The input format defines the list of map tasks that make up the mapping phase, each of which corresponds to an input block. Then, depending on the physical address of the input file block, these tasks are dispatched to the corresponding system nodes, and multiple map tasks may be assigned to the same node. When the task is dispatched, the node begins to run the task, attempting to go to maximum parallelization. The maximum number of task parallelism on a node is controlled by the Mapred.tasktracker.map.tasks.maximum parameter.
Record Reader (Recordreader): Inputsplit defines how to slice the work, but does not describe how to access it. The Recordreader class is the actual key-value pair used to load the data and convert the data into a suitable mapper read. The Recordreader instance is defined by the input format, the default input format, Textinputformat, provides a linerecordreader that the class will take each line of the input file as a new value, The key associated to each row is the byte offset of the row in the file. Recordreader will be repeated on the input block until the entire input block is processed, and each call to Recordreader invokes the Mapper map () method.
Mapper:mapper performed the interesting user-defined work in the first phase of the MapReduce program. Given a key-value pair, the map () method generates one or more key-value pairs that are sent to reducer. For each map task (input block) of the entire job input section, each new mapper instance is initialized in a separate Java process and cannot be communicated between mapper. This makes the reliability of each map task unaffected by other map tasks, and is determined only by the reliability of the local machine. The map () method will receive an additional two parameters in addition to the key value (note: In the version after 0.20.x, the interface has changed, the context object instead of these two parameters):

    • The Outputcollector object has a method called Collect (), which can use this method to send key-value pairs to the reduce phase of the job.
    • The Reporter object provides information about the current task, and its Getinputsplit () method returns an object that describes the current input block, and also allows the map task to provide additional information about the progress of the system execution. The SetStatus () method allows you to generate a status message that feeds back to the user, and the Incrcounter () method allows you to increment the shared high-performance counter, and in addition to the default counters, you can define more of the counters you want. Each mapper can increment the counter, and Jobtracker collects incremental data from different processes and aggregates them together for reading after the job is finished.

Partition & Shuffle: When the first map task is completed, the node may have to continue to perform more map tasks, but this also starts swapping the intermediate output of the map task to the reducer that need them, The process of moving the map output to reducer is called shuffle. Each reduce node is assigned to a different subset of the key sets in the intermediate output, which (known as "Partitions") is the input data for the reduce task. The key-value pairs generated by each map task may be subordinate to any partition, and values with the same key will always be reduce together, regardless of the mapper. Therefore, all map nodes must agree on where to send the different intermediate data. The Partitioner class is used to determine the whereabouts of a given key-value pair, and the default classifier (Partitioner) calculates the hash value of the key and assigns the key to the corresponding partition based on the result, and the custom classifierFifth Partare described in detail.
Sort: Each reduce task is responsible for reduceing all values associated to the same key, and each node receives an intermediate key set that has been automatically sorted by Hadoop before being sent to the specific reducer.
Reduction: Each Reduce task creates a reducer instance, which is an instance of a user-defined code that is responsible for performing the second important phase of a particular job. For each key in the partition that has been assigned to reducer, the reduce () method of Reducer is called only once, and it receives a key and an iterator associated to all values of the key, and the iterator returns the value associated to the same key in an undefined order. Reducer also receives a outputcollector and report object, which is used as it did in the map () method.
  Output format: The key value pairs provided to the Outputcollector are written to the output file, and the write is controlled by the output format. OutputFormat functions like the InputFormat class described earlier, and instances of OutputFormat provided by Hadoop write files on local disks or HDFS, which are inherited from the common Fileinputformat class. Each reducer will write the result output in a separate file in the public folder, which is typically named PART-NNNNN,NNNNN is the ID of partition associated to a reduce task. The output folder is set by Fileoutputformat.setoutputpath (). You can use the Setoutputformat () method of the Jobconf object of the specific MapReduce job to set the specific output format. The following table shows the output formats that are provided:

The
output format
textoutputformat default output format, output rows as "key \ t value"
sequencefileoutputformat output binary, suitable for reading input for sub-mapreduce jobs
nulloutputformat
sequencefileasbinaryoutputformat corresponds to Sequencefileasbinaryinputformat, which writes a key/value pair as binary data to a sequential file
mapfileoutputformat Mapfileoutputformat writes the results to a mapfile. The keys in the mapfile must be ordered, so the output key must be ordered in reducer.

Table 4.2:hadoop provides the output format
Hadoop provides some OutputFormat instances for writing to a file, and the basic (default) instance is Textoutputformat, which writes data to a text file in a single key-value pair. The subsequent MapReduce task can simply re-read the required input data through the Keyvalueinputformat class, and is also suitable for human reading. There is also an intermediate format that is more suitable for use between mapreduce jobs, which is sequencefileoutputformat, which can quickly serialize arbitrary data types into a file, The corresponding Sequencefileinputformat will deserialize the file into the same type and submit the input data for the next mapper in the same way as the previous reducer. Nulloutputformat does not generate an output file and discards any key-value pairs passed to it by Outputcollector, if you explicitly write your own output file in the reduce () method and do not want the Hadoop framework to output additional empty output files, That kind of class is very useful.
Recordwriter: This is similar to the implementation of InputFormat reading a single record through Recordreader, the OutputFormat class is the factory method of the Recordwriter object, which is used to write a single record to a file. It's like Ouputformat writes directly.
Reducer output files will remain in HDFs for your other applications, such as another mapreduce job, or a separate program for manual inspection.

MapReduce input and Output type

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.