Map number control in yarn

Last Update:2014-07-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

public List<InputSplit> getSplits(JobContext job) throws IOException {        long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));        long maxSize = getMaxSplitSize(job);        List splits = new ArrayList();        List files = listStatus(job);        for (FileStatus file : files) {            Path path = file.getPath();            long length = file.getLen();            if (length != 0L) {                FileSystem fs = path.getFileSystem(job.getConfiguration());                BlockLocation[] blkLocations = fs.getFileBlockLocations(file,                        0L, length);                if (isSplitable(job, path)) {                    long blockSize = file.getBlockSize();                    long splitSize = computeSplitSize(blockSize, minSize,                            maxSize);                    long bytesRemaining = length;                    while (bytesRemaining / splitSize > 1.1D) {                        int blkIndex = getBlockIndex(blkLocations, length                                - bytesRemaining);                        splits.add(makeSplit(path, length - bytesRemaining,                                splitSize, blkLocations[blkIndex].getHosts()));                        bytesRemaining -= splitSize;                    }                    if (bytesRemaining != 0L) {                        int blkIndex = getBlockIndex(blkLocations, length                                - bytesRemaining);                        splits.add(makeSplit(path, length - bytesRemaining,                                bytesRemaining,                                blkLocations[blkIndex].getHosts()));                    }                } else {                    splits.add(makeSplit(path, 0L, length,                            blkLocations[0].getHosts()));                }            } else {                splits.add(makeSplit(path, 0L, length, new String[0]));            }        }        job.getConfiguration().setLong(                "mapreduce.input.fileinputformat.numinputfiles", files.size());        LOG.debug("Total # of splits: " + splits.size());        return splits;    }

Yarn does not seem to have 1 * of the expected number of maps set by the user.

Core code long minsize = math. max (getformatminsplitsize (), getminsplitsize (job); getformatminsplitsize returns 1 by default. getminsplitsize indicates the minimum number of parts set by the user. If the value is greater than 1, long maxsize = getmaxsplitsize (job); getmaxsplitsize is the maximum number of parts set by the user. The default value is 9223372036854775807 llong splitsize = computesplitsize (blocksize, minsize, maxsize); protected long computesplitsize (long blocksize, long minsize, long maxsize) {return math. max (minsize, math. min (maxsize, blocksize ));}

Test file size: 297 MB (311349250)

Block Size 128 m

Test code

Test 1

Fileinputformat. setmininputsplitsize (job, 301349250 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );

After the test, the number of maps is 1. Based on the above partition formula, the partition size is calculated as 301349250, which is smaller than 311349250. The theory should be two maps. Then, let's look at the partition function.

While (bytesremaining/splitsize> 1.1D ){
Int blkindex = getblockindex (blklocations, Length
-Bytesremaining );
Splits. Add (makesplit (path, length-bytesremaining,
Splitsize, blklocations [blkindex]. gethosts ()));

Bytesremaining-= splitsize;
}

As long as the remaining file size does not exceed 1.1 times of the partition size, it will be divided into one partition to avoid opening two maps. One of the running data is too small, wasting resources.

Test 2

Fileinputformat. setmininputsplitsize (job, 150*1024*1024 );

Fileinputformat. setmaxinputsplitsize (job, 10000 );

The number of maps is 2.

Test 3

In the original input directory, add a small file, several K to test whether the file will be merged

Fileinputformat. setmininputsplitsize (job, 150*1024*1024 );
Fileinputformat. setmaxinputsplitsize (job, 10000 );

Map number changed to 3

View Source Code

For (filestatus file: Files ){

}

The original input is based on the file name, which is also known in common sense. Different file content formats are different.

In summary, the sharding process is roughly as follows: First traverse the target file, filter out some non-conforming files, and then add them to the list, split the parts according to the file name (the size is the formula used to calculate the part size before, and the end of a file may be merged. In fact, network programs are often known), and then add the parts to the part list, then each part reads its own part for map processing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Map number control in yarn

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support