Map number control in yarn

Source: Internet
Author: User
public List<InputSplit> getSplits(JobContext job) throws IOException {        long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));        long maxSize = getMaxSplitSize(job);        List splits = new ArrayList();        List files = listStatus(job);        for (FileStatus file : files) {            Path path = file.getPath();            long length = file.getLen();            if (length != 0L) {                FileSystem fs = path.getFileSystem(job.getConfiguration());                BlockLocation[] blkLocations = fs.getFileBlockLocations(file,                        0L, length);                if (isSplitable(job, path)) {                    long blockSize = file.getBlockSize();                    long splitSize = computeSplitSize(blockSize, minSize,                            maxSize);                    long bytesRemaining = length;                    while (bytesRemaining / splitSize > 1.1D) {                        int blkIndex = getBlockIndex(blkLocations, length                                - bytesRemaining);                        splits.add(makeSplit(path, length - bytesRemaining,                                splitSize, blkLocations[blkIndex].getHosts()));                        bytesRemaining -= splitSize;                    }                    if (bytesRemaining != 0L) {                        int blkIndex = getBlockIndex(blkLocations, length                                - bytesRemaining);                        splits.add(makeSplit(path, length - bytesRemaining,                                bytesRemaining,                                blkLocations[blkIndex].getHosts()));                    }                } else {                    splits.add(makeSplit(path, 0L, length,                            blkLocations[0].getHosts()));                }            } else {                splits.add(makeSplit(path, 0L, length, new String[0]));            }        }        job.getConfiguration().setLong(                "mapreduce.input.fileinputformat.numinputfiles", files.size());        LOG.debug("Total # of splits: " + splits.size());        return splits;    }

Yarn does not seem to have 1 * of the expected number of maps set by the user.

Core code long minsize = math. max (getformatminsplitsize (), getminsplitsize (job); getformatminsplitsize returns 1 by default. getminsplitsize indicates the minimum number of parts set by the user. If the value is greater than 1, long maxsize = getmaxsplitsize (job); getmaxsplitsize is the maximum number of parts set by the user. The default value is 9223372036854775807 llong splitsize = computesplitsize (blocksize, minsize, maxsize); protected long computesplitsize (long blocksize, long minsize, long maxsize) {return math. max (minsize, math. min (maxsize, blocksize ));}

 

Test file size: 297 MB (311349250)

Block Size 128 m

Test code

Test 1

Fileinputformat. setmininputsplitsize (job, 301349250 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );

After the test, the number of maps is 1. Based on the above partition formula, the partition size is calculated as 301349250, which is smaller than 311349250. The theory should be two maps. Then, let's look at the partition function.

While (bytesremaining/splitsize> 1.1D ){
Int blkindex = getblockindex (blklocations, Length
-Bytesremaining );
Splits. Add (makesplit (path, length-bytesremaining,
Splitsize, blklocations [blkindex]. gethosts ()));

Bytesremaining-= splitsize;
}

As long as the remaining file size does not exceed 1.1 times of the partition size, it will be divided into one partition to avoid opening two maps. One of the running data is too small, wasting resources.

 

Test 2

Fileinputformat. setmininputsplitsize (job, 150*1024*1024 );

Fileinputformat. setmaxinputsplitsize (job, 10000 );

The number of maps is 2.

Test 3

In the original input directory, add a small file, several K to test whether the file will be merged

Fileinputformat. setmininputsplitsize (job, 150*1024*1024 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );

Map number changed to 3

View Source Code

For (filestatus file: Files ){

..

}

The original input is based on the file name, which is also known in common sense. Different file content formats are different.

 

In summary, the sharding process is roughly as follows: First traverse the target file, filter out some non-conforming files, and then add them to the list, split the parts according to the file name (the size is the formula used to calculate the part size before, and the end of a file may be merged. In fact, network programs are often known), and then add the parts to the part list, then each part reads its own part for map processing.

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.