Hadoop Custom Input format

Source: Internet
Author: User

The starting stage of a task is determined by InputFormat!

1. In the MapReduce framework, the role played by InputFormat:
– Cut the input data into logical shards (split), and a shard will be assigned to a separate mapper
– Provides an Recordreader object that reads <Key-Value> mapper from the Shard
Processing

Effects of 1.1InputFormat on mapper:
– Determines the number of mapper
– Determines the key and value received by the Mapper map function

1.2InputFormat:

InputFormat Source Interpretation:

Getsplits is responsible for dividing the input data to generate a set of shards
The object returned by Createrecordreader, responsible for reading from the Shard <Key-Value>

1.3InputSplit Source code Interpretation:

Inputsplit is an abstract class, and the class of shards inherits from it
Method GetLength () to get the size of the Shard
Method Getlocations () to get a list of locations where the shards are stored

2.MapReduce calling the run function of the Recordreader,mapper class

Mapper through the context to obtain key-value, while the context of the Nextkeyvalue,
The Getcurrentkey, Getcurrentkey method is the Recordreader object that is called InputFormat returned

3.InputFormat: Class Hierarchy

3.1FileInputFormat:

Fileinputformat is a subclass of InputFormat, and all input format classes that use files as data sources inherit
Since it
---implements the Getsplits method
The type of shard returned by the---is filesplit, which is a subclass of Inputsplit, with the description file path added, the Shard start position
of information
--- does not implement the Createrecordreader method, is an abstract class

Fileinputformat: Generating Shards

A shard is generated by default for each block of files on HDFs
---can be set by the job configuration parameters mapred.min.split.size and mapred.max.split.size to large
Small minimum and maximum values, after setting these two parameters, a shard may be generated for a contiguous block of files.
Causes the Shard size to be within the specified range.
---A shard contains blocks that come from only one file

4. The default input format for MapReduce is: Textinputformat

Textinputformat

Textinputformat is the default input format
? is a subclass of Fileinputformat that inherits its Getsplit method
? Createrecordreader returns the object of the Linerecordreader
– Each row of data generates a <key, value> record
–key: The byte offset of each data record in the data shard, the type is longwritable
–value: The content of each line, type is text

If there are tens of thousands of input files, then you need to call at least tens of thousands of mapper!

4.1 Custom input format to solve small file problems:

Combinefileinputformat

Combinefileinputformat is an input format for small file designs
? Inherits the class Fileinputformat
? Overridden The Getsplit method
– The Returned shard type is Combinefilesplit, which is a subclass of inputsplit and can contain paths to multiple files
? is an abstract class, writing a specific class requires implementing the Createrecordreader method
– The type of the recommended return value is Combinefilerecordreader, which is used to process a type of combinefilesplit
Chip
In the –combinefilerecordreader constructor, you also specify a recordreader to handle the intra-shard
The Single file

Combinefileinputformat: Generating Shards

Block with multiple different files in the output shard
? File Segmentation principle: http://blog.sina.com.cn/s/blog_5673f78b0101etz4.html

Custom Input Format Myinputformat:

Ensure that files are not split, and that each file is assigned to only one Shard
? A shard can contain multiple files
? Output of each <key, value> corresponding to a complete text file
–key: The category name to which the file belongs, the type is text
–value: Text content of the file, type

Hadoop Custom Input format

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.