Source code analysis of Hadoop Data Input

Source: Internet
Author: User

Source code analysis of Hadoop Data Input

We know that the most important part of any project is input, intermediate processing, and output. Today, let's take a closer look at how input is made in Hadoop systems that we know well?

In hadoop, the input data is implemented through the corresponding InputFormat class and RecordReader class. InputFormat is used to split the corresponding input file, and the RecordReader class reads the data in the corresponding part. The specific method is as follows:

(1) The InputFormat class is an interface.

Public interface InputFormat <K, V> {

InputSplit [] getSplits (JobConf job, int numSplits) throws IOException;

RecordReader <K, V> getRecordReader (InputSplit split,

JobConf job,

Reporter reporter) throws IOException;

}

(2) The FileInputFormat class implements the InputFormat interface. This class implements the getSplits method, but it does not implement the corresponding getRecordReader method. That is to say, FileInputFormat is still an abstract class. One problem to be noted here is that FileInputFormat uses the isSplitable method to specify whether the corresponding file supports data splitting. It is supported by default. Generally, sub-classes need to implement it again.

Public abstract class FileInputFormat <K, V> implements InputFormat <K, V> {

Public InputSplit [] getSplits (JobConf job, int numSplits) throws IOException {

FileStatus [] files = listStatus (job );

// Save the number of input files in the job-conf

Job. setLong (NUM_INPUT_FILES, Files. length );

Long totalSize = 0; // compute total size

For (FileStatus file: files) {// check we have valid files

If (file. isDir ()){

Throw new IOException ("Not a file:" + file. getPath ());

}

TotalSize + = file. getLen ();

}

Long goalSize = totalSize/(numSplits = 0? 1: numSplits );

Long minSize = Math.Max(Job. getLong ("mapred. min. split. size", 1 ),

MinSplitSize );

// Generate splits

ArrayList <FileSplit> splits = new ArrayList <FileSplit> (numSplits );

NetworkTopology clusterMap = new NetworkTopology ();

For (FileStatus file: files ){

Path path = file. getPath ();

FileSystem fs = path. getFileSystem (job );

Long length = file. getLen ();

BlockLocation [] blkLocations = fs. getFileBlockLocations (file, 0, length );

If (length! = 0) & isSplitable (fs, path )){

Long blockSize = file. getBlockSize ();

Long splitSize = computeSplitSize (goalSize, minSize, blockSize );

 

Long bytesRemaining = length;

While (double) bytesRemaining)/splitSize>SPLIT_SLOP){

String [] splitHosts = getSplitHosts (blkLocations,

Length-bytesRemaining, splitSize, clusterMap );

Splits. add (new FileSplit (path, length-bytesRemaining, splitSize,

SplitHosts ));

BytesRemaining-= splitSize;

}

If (bytesRemaining! = 0 ){

Splits. add (new FileSplit (path, length-bytesRemaining, bytesRemaining,

BlkLocations [blkLocations. length-1]. getHosts ()));

}

} Else if (length! = 0 ){

String [] splitHosts = getSplitHosts (blkLocations, 0, length, clusterMap );

Splits. add (new FileSplit (path, 0, length, splitHosts ));

} Else {

// Create empty hosts array for zero length files

Splits. add (new FileSplit (path, 0, length, new String [0]);

}

}

LOG. Debug ("Total # of splits:" + splits. size ());

Return splits. toArray (new FileSplit [splits. size ()]);

}

// This method is used to determine whether data can be split.

Protected boolean isSplitable (FileSystem fs, Path filename ){

Return true;

}

// But it does not implement the corresponding getRecordReader method. That is to say, FileInputFormat is still an abstract class.

Public abstract RecordReader <K, V> getRecordReader (InputSplit split,

JobConf job,

Reporter reporter) throws IOException;

}

(3) The TextFileInputFormat class only implements the getRecordReader method of the FileInputFormat class and overwrites the isSplitable method. It does not implement the getSplits method, the implementation of getSplits is implemented by the parent class FileInputFormat. (Note that TextFileInputFormat is not a subclass of InputFormat. TextFileInputFormat only inherits the getRecordReader method of InputFormat .)

Public class TextInputFormat extends FileInputFormat <LongWritable, Text>

Implements JobConfigurable {

 

Private CompressionCodecFactory compressionCodecs = null;

 

Public void configure (JobConf conf ){

CompressionCodecs = new CompressionCodecFactory (conf );

}

 

// Subclass implements the isSplitable method again

Protected boolean isSplitable (FileSystem fs, Path file ){

Final CompressionCodec codec = compressionCodecs. getCodec (file );

If (null = codec ){

Return true;

}

Return codec instanceof SplittableCompressionCodec;

}

// This method reads the data in the file into the corresponding Map method.

Public RecordReader <LongWritable, Text> getRecordReader (

InputSplit genericSplit, JobConf job,

Reporter reporter)

Throws IOException {

Reporter. setStatus (genericSplit. toString ());

String delimiter = job. get ("textinputformat. record. delimiter ");

Byte [] recordDelimiterBytes = null;

If (null! = Delimiter) recordDelimiterBytes = delimiter. getBytes ();

Return new LineRecordReader (job, (FileSplit) genericSplit,

RecordDelimiterBytes );

}

}

We can see from the above that a Text file is input to the map method through the class inheritance hierarchy. The following describes how to split a file? We can see from the hierarchy of class inheritance that the specific splitting method is implemented through the FileInputFormat class. Therefore, to understand how files are split, you only need to check the implementation details of the getSplits method in the FileInputFormat class. Next I will paste the getSplits method in the FileInputFormat class again: And then analyze each sentence of code.

Public InputSplit [] getSplits (JobConf job, int numSplits)

Throws IOException {

FileStatus [] files = listStatus (job); // list all input files in the current job

// Save the number of input files in the job-conf

Job. setLong (NUM_INPUT_FILES, Files. length); // you can specify the number of input files for the current job.

 

// Calculate the total size of all input files in the current job

Long totalSize = 0; // compute total size

// Traverse each object

For (FileStatus file: files) {// check we have valid files

If (file. isDir ()){

Throw new IOException ("Not a file:" + file. getPath ());

}

TotalSize + = file. getLen ();

}

// NumSplits indicates the number of parts, goalSize indicates the average size of each part, and minSize indicates the minimum value of each part.

Long goalSize = totalSize/(numSplits = 0? 1: numSplits );

Long minSize = Math.Max(Job. getLong ("mapred. min. split. size", 1 ),

MinSplitSize );

 

// Generate splits Calculate parts

ArrayList <FileSplit> splits = new ArrayList <FileSplit> (numSplits );

NetworkTopology clusterMap = new NetworkTopology ();

For (FileStatus file: files ){

Path path = file. getPath ();

FileSystem fs = path. getFileSystem (job );

Long length = file. getLen ();

// Obtain the Object Location

BlockLocation [] blkLocations = fs. getFileBlockLocations (file, 0, length );

// The isSplitable method determines whether the corresponding file can be split based on the corresponding file name

If (length! = 0) & isSplitable (fs, path )){

Long blockSize = file. getBlockSize (); // obtain the size of the file Block

// The computeSplitSize method calculates the actual part size.

Long splitSize = computeSplitSize (goalSize, minSize, blockSize );

 

Long bytesRemaining = length; // remaining file size

 

//SPLIT_SLOP = 1.1,File Size/Shard size> SPLIT_SLOPSplit.

While (double) bytesRemaining)/splitSize>SPLIT_SLOP){

// SplitHosts is used to record the shard metadata information (including the shard location, size, and so on)

String [] splitHosts = getSplitHosts (blkLocations,

Length-bytesRemaining, splitSize, clusterMap );

Splits. add (new FileSplit (path, length-bytesRemaining, splitSize,

SplitHosts ));

BytesRemaining-= splitSize;

}

If (bytesRemaining! = 0 ){

Splits. add (new FileSplit (path, length-bytesRemaining, bytesRemaining,

BlkLocations [blkLocations. length-1]. getHosts ()));

}

} Else if (length! = 0) {// if the file cannot be split, the entire file will be split accordingly.

String [] splitHosts = getSplitHosts (blkLocations, 0, length, clusterMap );

Splits. add (new FileSplit (path, 0, length, splitHosts ));

} Else {

// Create empty hosts array for zero length files

Splits. add (new FileSplit (path, 0, length, new String [0]);

}

}

LOG. Debug ("Total # of splits:" + splits. size ());

Return splits. toArray (new FileSplit [splits. size ()]);

}

// Calculate the part size.

Protected long computeSplitSize (long goalSize, long minSize,

Long blockSize ){

Return Math.Max(MinSize, Math.Min(GoalSize, blockSize ));

}

To sum up, the MR input file adopts the FileInputFormat class to split data. before splitting, the isSplitable method is used to determine whether data can be split. If not, the entire file is used as a part as the input. Therefore, if you need to split the corresponding file, you can set the isSplitable method to "false.

Note that if all your files are small files, the corresponding getSplits method will not split them. Generally, a small file is smaller than the size of the HDFS block in hadoop (128 MB ).

You may also like the following articles about Hadoop:

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.