public List<InputSplit> getSplits(JobContext job) throws IOException { long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); long maxSize = getMaxSplitSize(job); List splits = new ArrayList(); List files = listStatus(job); for (FileStatus file : files) { Path path = file.getPath(); long length = file.getLen(); if (length != 0L) { FileSystem fs = path.getFileSystem(job.getConfiguration()); BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0L, length); if (isSplitable(job, path)) { long blockSize = file.getBlockSize(); long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length; while (bytesRemaining / splitSize > 1.1D) { int blkIndex = getBlockIndex(blkLocations, length - bytesRemaining); splits.add(makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts())); bytesRemaining -= splitSize; } if (bytesRemaining != 0L) { int blkIndex = getBlockIndex(blkLocations, length - bytesRemaining); splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts())); } } else { splits.add(makeSplit(path, 0L, length, blkLocations[0].getHosts())); } } else { splits.add(makeSplit(path, 0L, length, new String[0])); } } job.getConfiguration().setLong( "mapreduce.input.fileinputformat.numinputfiles", files.size()); LOG.debug("Total # of splits: " + splits.size()); return splits; }
Yarn does not seem to have 1 * of the expected number of maps set by the user.
Core code long minsize = math. max (getformatminsplitsize (), getminsplitsize (job); getformatminsplitsize returns 1 by default. getminsplitsize indicates the minimum number of parts set by the user. If the value is greater than 1, long maxsize = getmaxsplitsize (job); getmaxsplitsize is the maximum number of parts set by the user. The default value is 9223372036854775807 llong splitsize = computesplitsize (blocksize, minsize, maxsize); protected long computesplitsize (long blocksize, long minsize, long maxsize) {return math. max (minsize, math. min (maxsize, blocksize ));}
Test file size: 297 MB (311349250)
Block Size 128 m
Test code
Test 1
Fileinputformat. setmininputsplitsize (job, 301349250 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );
After the test, the number of maps is 1. Based on the above partition formula, the partition size is calculated as 301349250, which is smaller than 311349250. The theory should be two maps. Then, let's look at the partition function.
While (bytesremaining/splitsize> 1.1D ){
Int blkindex = getblockindex (blklocations, Length
-Bytesremaining );
Splits. Add (makesplit (path, length-bytesremaining,
Splitsize, blklocations [blkindex]. gethosts ()));
Bytesremaining-= splitsize;
}
As long as the remaining file size does not exceed 1.1 times of the partition size, it will be divided into one partition to avoid opening two maps. One of the running data is too small, wasting resources.
Test 2
Fileinputformat. setmininputsplitsize (job, 150*1024*1024 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );
The number of maps is 2.
Test 3
In the original input directory, add a small file, several K to test whether the file will be merged
Fileinputformat. setmininputsplitsize (job, 150*1024*1024 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );
Map number changed to 3
View Source Code
For (filestatus file: Files ){
..
}
The original input is based on the file name, which is also known in common sense. Different file content formats are different.
In summary, the sharding process is roughly as follows: First traverse the target file, filter out some non-conforming files, and then add them to the list, split the parts according to the file name (the size is the formula used to calculate the part size before, and the end of a file may be merged. In fact, network programs are often known), and then add the parts to the part list, then each part reads its own part for map processing.