Hadoop Custom Input format

Last Update:2015-04-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The starting stage of a task is determined by InputFormat!

1. In the MapReduce framework, the role played by InputFormat:
– Cut the input data into logical shards (split), and a shard will be assigned to a separate mapper
– Provides an Recordreader object that reads <Key-Value> mapper from the Shard
Processing

Effects of 1.1InputFormat on mapper:
– Determines the number of mapper
– Determines the key and value received by the Mapper map function

1.2InputFormat:

InputFormat Source Interpretation:

Getsplits is responsible for dividing the input data to generate a set of shards
The object returned by Createrecordreader, responsible for reading from the Shard <Key-Value>

1.3InputSplit Source code Interpretation:

Inputsplit is an abstract class, and the class of shards inherits from it
Method GetLength () to get the size of the Shard
Method Getlocations () to get a list of locations where the shards are stored

2.MapReduce calling the run function of the Recordreader,mapper class

Mapper through the context to obtain key-value, while the context of the Nextkeyvalue,
The Getcurrentkey, Getcurrentkey method is the Recordreader object that is called InputFormat returned

3.InputFormat: Class Hierarchy

3.1FileInputFormat:

Fileinputformat is a subclass of InputFormat, and all input format classes that use files as data sources inherit
Since it
---implements the Getsplits method
The type of shard returned by the---is filesplit, which is a subclass of Inputsplit, with the description file path added, the Shard start position
of information
--- does not implement the Createrecordreader method, is an abstract class

Fileinputformat: Generating Shards

A shard is generated by default for each block of files on HDFs
---can be set by the job configuration parameters mapred.min.split.size and mapred.max.split.size to large
Small minimum and maximum values, after setting these two parameters, a shard may be generated for a contiguous block of files.
Causes the Shard size to be within the specified range.
---A shard contains blocks that come from only one file

4. The default input format for MapReduce is: Textinputformat

Textinputformat

Textinputformat is the default input format
? is a subclass of Fileinputformat that inherits its Getsplit method
? Createrecordreader returns the object of the Linerecordreader
– Each row of data generates a <key, value> record
–key: The byte offset of each data record in the data shard, the type is longwritable
–value: The content of each line, type is text

If there are tens of thousands of input files, then you need to call at least tens of thousands of mapper!

4.1 Custom input format to solve small file problems:

Combinefileinputformat

Combinefileinputformat is an input format for small file designs
? Inherits the class Fileinputformat
? Overridden The Getsplit method
– The Returned shard type is Combinefilesplit, which is a subclass of inputsplit and can contain paths to multiple files
? is an abstract class, writing a specific class requires implementing the Createrecordreader method
– The type of the recommended return value is Combinefilerecordreader, which is used to process a type of combinefilesplit
Chip
In the –combinefilerecordreader constructor, you also specify a recordreader to handle the intra-shard
The Single file

Combinefileinputformat: Generating Shards

Block with multiple different files in the output shard
? File Segmentation principle: http://blog.sina.com.cn/s/blog_5673f78b0101etz4.html

Custom Input Format Myinputformat:

Ensure that files are not split, and that each file is assigned to only one Shard
? A shard can contain multiple files
? Output of each <key, value> corresponding to a complete text file
–key: The category name to which the file belongs, the type is text
–value: Text content of the file, type

Hadoop Custom Input format

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop Custom Input format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop Custom Input format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support