Differences between the size of InputSplit in the new version of Hadoop and the old version
In the old version of Hadoop, the number of InputSplit is determined by the following three parameters:
GoalSize: totalSize/numSpilt. totalSize indicates the file size. numSplit indicates the number of map tasks set by the user. The default value is 1.
MinSize: the minimum value of InputSplit, which is set to mapred. min. split. size. The default value is 1.
BlockSize: The Block Size in HDFS.
SplitSize = max (minSize, min (goalSize, blockSIze ))
New Version:
MaxSize: determined by the configuration parameter mapred. max. split. size. The number of map tasks set by the user is no longer considered.
MinSize: the minimum value of InputSplit, which is set to mapred. min. split. size. The default value is 1.
BlockSize: The Block Size in HDFS.
SplitSize = max (minSize, min (maxSize, blockSIze ))