The starting stage of a task is determined by InputFormat!
1. In the MapReduce framework, the role played by InputFormat:
– Cut the input data into logical shards (split), and a shard will be assigned to a separate mapper
– Provides an Recordreader object that reads <Key-Value> mapper from the Shard
Processing
Effects of 1.1InputFormat on mapper:
– Determines the number of mapper
– Determines the key and value received by the Mapper map function
1.2InputFormat:
InputFormat Source Interpretation:
Getsplits is responsible for dividing the input data to generate a set of shards
The object returned by Createrecordreader, responsible for reading from the Shard <Key-Value>
1.3InputSplit Source code Interpretation:
Inputsplit is an abstract class, and the class of shards inherits from it
Method GetLength () to get the size of the Shard
Method Getlocations () to get a list of locations where the shards are stored
2.MapReduce calling the run function of the Recordreader,mapper class
Mapper through the context to obtain key-value, while the context of the Nextkeyvalue,
The Getcurrentkey, Getcurrentkey method is the Recordreader object that is called InputFormat returned
3.InputFormat: Class Hierarchy
3.1FileInputFormat:
Fileinputformat is a subclass of InputFormat, and all input format classes that use files as data sources inherit
Since it
---implements the Getsplits method
The type of shard returned by the---is filesplit, which is a subclass of Inputsplit, with the description file path added, the Shard start position
of information
--- does not implement the Createrecordreader method, is an abstract class
Fileinputformat: Generating Shards
A shard is generated by default for each block of files on HDFs
---can be set by the job configuration parameters mapred.min.split.size and mapred.max.split.size to large
Small minimum and maximum values, after setting these two parameters, a shard may be generated for a contiguous block of files.
Causes the Shard size to be within the specified range.
---A shard contains blocks that come from only one file
4. The default input format for MapReduce is: Textinputformat
Textinputformat
Textinputformat is the default input format
? is a subclass of Fileinputformat that inherits its Getsplit method
? Createrecordreader returns the object of the Linerecordreader
– Each row of data generates a <key, value> record
–key: The byte offset of each data record in the data shard, the type is longwritable
–value: The content of each line, type is text
If there are tens of thousands of input files, then you need to call at least tens of thousands of mapper!
4.1 Custom input format to solve small file problems:
Combinefileinputformat
Combinefileinputformat is an input format for small file designs
? Inherits the class Fileinputformat
? Overridden The Getsplit method
– The Returned shard type is Combinefilesplit, which is a subclass of inputsplit and can contain paths to multiple files
? is an abstract class, writing a specific class requires implementing the Createrecordreader method
– The type of the recommended return value is Combinefilerecordreader, which is used to process a type of combinefilesplit
Chip
In the –combinefilerecordreader constructor, you also specify a recordreader to handle the intra-shard
The Single file
Combinefileinputformat: Generating Shards
Block with multiple different files in the output shard
? File Segmentation principle: http://blog.sina.com.cn/s/blog_5673f78b0101etz4.html
Custom Input Format Myinputformat:
Ensure that files are not split, and that each file is assigned to only one Shard
? A shard can contain multiple files
? Output of each <key, value> corresponding to a complete text file
–key: The category name to which the file belongs, the type is text
–value: Text content of the file, type
Hadoop Custom Input format