MR input and Output modes.

Source: Internet
Author: User

Increase MapReduce value, customize inputs and outputs.
such as skipping storage to HDFS in this time-consuming arrangement. Instead, it simply accepts the data from the original data source, or sends the data directly to some handlers. These handlers use this data after the MapReduce job is completed. The underlying Hadoop paradigm, which sometimes consists of file blocks and input split, does not meet demand. Customize InputFormat and OutputFormat at this time.



Three modes of processing input:1Generate Data (generating) 2 external source inputs (external source input)
3 partition clipping (partition pruning)
        The Map feature does not know exactly how this complex thing happens until it gets the input pair.

A mode that processes the output 1 external source output.

customizing input and output in HadoopHadoop allows users to modify how data is loaded from disk:configure how to generate continuous input blocks based on the block of HDFS;how configuration records appear in the map phase;There are two classes to do this: Recordreader and InputFormat. In the Hadoop MapReduce framework. These two classes are used in a similar way to adding mapper and reducer.
Hadoop allows users to modify the way data is stored in a similar way: through the Outputfirnat and Recordwriter classes.


InputFormat Hadoop relies on the input format of a job to complete three tasks.
1) Verify the input configuration of the job (check that the data exists
2) Split the input blocks and files into logical chunks by Inputsplit type and assign each data chunk to a map task processing.
3) Create a Recordreader implementation that uses it to generate key-value pairs based on the original inputsplit. For these key-value pairs will catch a send to their corresponding mapper

OutputFormat The output format of a Hadoop dependent job to complete the following two main tasks:
1) Verify job input configuration
2) Create Recordwriter, he is responsible for the output of the writing industry

InputFormat InterfaceThe most common input format in Hadoop is the subclass of Fileinputformat. The default input format for Hadoop is Textinputformat. The input format verifies the input of the job first, ensuring that all input paths are present. Each input file is then logically split according to the total byte size of the fileand the block size as the upper bound of the logical split.
give me a chestnut: the HDFS block size is set to 64M. Now another 200MB file will generate 4 input split, his byte range is: 0~64MB, 64~128MB, 128~172MB, 172B~200MB. Each MAP task can only be assigned to one input split, and then Recordreader is responsible for assigning it to all bytes to generate a key-value pair.

The Recordreader also has an additional fixed boundary effect. Because the boundary of the input split is arbitrary, it is most likely not to fall on the record boundary. For example: Textinputformat uses Linerecordreader to read text files and create key-value pairs for each line of text (newline character splits) for each map task. To see the number of bytes read so far from the file, the value is a string of all characters before the newline character for each line. Because the byte block of each input split is unlikely to be aligned with the plus and newline characters, Linerecordreader will pass its given "end" to ensure that a full row of data is read. This small piece of data comes from a different database and is not stored on the same node, so he gets it streaming from the fast DataNode store. The data flow is handled by an instance of the Fsdataimputstream class, and we don't need to care where these blocks are. Don't be afraid of your own format of the split boundary, as long as the full test, you will no longer repeat or lose any data
The InputFormat abstract class consists of two abstract methods:
1) getsplits
    when Getsplits is implemented, it is usually done by a given jobcontextof theobject gets the input of the configuration and returns a List of the Inputsplit objects. The input split has a way toReturns an array of machines that are related to where the data is stored in the cluster, which provides a basis for the framework to decide which Tasktracker should perform that map task. Because the method is also used in the front-end (that is, before the job is submitted to jobtracker), it is also a good fit to validate the job configuration and throw any necessary exceptions.     2) Createrecordreader    This method is used to generate a Recordreader implementation on the backend. Record Reader has a initialize method that will be called by the framework


Recordreader Abstract classCreating a key-value pair based on a given Inputsplit Inputsplit is a byte-oriented split view. Recordreader is able to parse Inputsplit and make it available for mapper processing. This is also why Hadoop and MapReduce are considered to be read-time patterns (schema on read). Schema (schema) is defined in Recordreader. It relies solely on the implementation of record reader. His change is based on what kind of input the job wants to get. Recordreader reads the bytes from the input source and converts them to writablecomparable keys and writable values. Custom data types are also common when wearing custom input formats, which is a good object-oriented way of presenting information to mapper.
Reducereader generates a key-value pair using data from the boundary generated at the time the input split was created, and the start of the read data is where Recordreader in the file can start generating key-value pairs. The end of the read data is where the Recordreader in the file should stop reading.


Some methods that must be overloaded
Initialize
    initializing the record reader is based on the file input format, which is suitable for locating the byte position where the file starts to read.
Getcurrentkey and GetCurrentValue
    The framework uses these two methods to pass the generated key-value pairs to the mapper implementation.
Nextkeyvalue
    method reads a key-value pair and returns true until all data has been consumed.        Same as InputFormat . Getprogerss
    The framework uses this method to collect metric information.        Same as InputFormat .
Close
    after the wood has a key value, the framework uses it to do cleanup work.


OutputFormatThe default output format for Hadoop is Textoutputformat, which writes all key-value pairs to the default HDFS output directory of the configuration file, and the keys and values are tab-delimited, and each reduce task outputs a part file separately and writes to the configured output The record. Textoutputformat uses Linerecordwriter to write key-value pairs for each map task or reduce task, using the ToString method to serialize each key-value pair to the part file that is output on HDFS, and the key value is tab-Delimited. The Key value tab can be modified through the job configuration.
OutputFormat Abstract class
Checkoutputspecs
    make sure the output directory does not exist or the output directory will be overwritten.
Getrecordwriter
    responsible for serializing key-value pairs to an output (usually a FileSystem object)
Getoutputcommiter
    When initializing, the submitter of the job sets each task, and when the task completes successfully, and when each task finishes (regardless of success or failure), the file output-based fileoutputcommiter can be used to handle these heavy things.

Recordwriter Abstract class
responsible for writing key-value pairs to the file system, Recordwriter does not include the initialization phase, but the constructor can be used to complete the setup of the record writer when needed.
    writer
        The framework calls the method when writing each key-value pair. The method to achieve more hits depends on your usage scenario. can write to an external memory key-value pair (such as Riduce)
    Close
        This method is called when there is no key-value pair























From for notes (Wiz)

MR input and Output modes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.