1, decisive first on the conclusion
1. If you want to increase the number of maps, set Mapred.map.tasks to a larger value.
2. If you want to reduce the number of maps, set Mapred.min.split.size to a larger value.
3. If there are many small files in the input, still want to reduce the number of maps, you need to merger small files into large files, and then use guideline 2.
2. Principle and Analysis Process
Read a lot of blog, feel no one said very clearly, so I come to tidy up a bit.
Let's take a look at this diagram.
input Shard: Prior to the map calculation, MapReduce computes the input shard based on the input file, each input shard (input split) for a map task, input Shard Split) is not the data itself, but an array of shard lengths and the location of a record data.
Hadoop 2.x the default block size is 128mb,hadoop 1.x the default block size is 64MB, you can set the dfs.block.size in Hdfs-site.xml, note that the unit is byte.
The shard size range can be set in Mapred-site.xml, Mapred.min.split.size mapred.max.split.size, minsplitsize size defaults to 1 B, maxsplitsize size defaults to Long.max_value = 9223372036854775807
So how big is the Shard?
Minsize=max{minsplitsize,mapred.min.split.size}
Maxsize=mapred.max.split.size
Splitsize=max{minsize,min{maxsize,blocksize}}
Let's look at the source code again.
So when we do not set the scope of the Shard, the Shard size is determined by the block size, just like its size. For example, to upload a 258MB file to HDFs, assuming that the block size is 128MB, then it will be divided into three block blocks, corresponding to the resulting three split
, so it will eventually produce three map task. I found another problem, the file size of the third block is only 2MB, and its block size is 128MB, how much space does it actually occupy the Linux file system?
The answer is the actual file size, not the size of a block.
The great God has already verified the answer: http://blog.csdn.net/samhacker/article/details/23089157
1. Before adding new files to HDFs, the space occupied by Hadoop on Linux is 464 MB:
2. Add a file of size 2673375 byte (approximately 2.5 MB) to HDFs:
2673375 Derby.jar
3. At this time, the space occupied by Hadoop on Linux is 467 mb-- adds an actual file size (2.5 MB) of space, not a block size (MB):
4. Use Hadoop dfs-stat to view file information:
It is clear here that the actual size of the file is 2673375 byte, but its block size is MB.
5, through the Namenode Web console to view the file information:
The result is the same: The file size is 2673375 byte, but its block size is MB.
6, but using ' Hadoop fsck ' to view the file information, see some different content-' 1 (avg.block size 2673375 B) ':
It is worth noting that the result has a ' 1 (avg.block size 2673375 B) ' Word. The ' block size ' here does not refer to the usual file block sizes (block size)-the latter is a concept of metadata, instead it reflects the actual size of the file. Here's a reply from a Hadoop community expert:
The fsck is showing "average blocksize" and not the block size metadata attribute of the file like stat shows. In this specific case, the average is just the length of the your file, and which is lesser than one whole block. "
The last question is: If HDFs occupies the disk space of the Linux file system according to the actual file size, is this "block size" necessary?
In fact, the block size is still necessary, an obvious function is that when the file through the append operation is growing, you can decide when to split the file by the block size. Here's a reply from a Hadoop community expert:
"The block size is a meta attribute. If you append tothe file later, it still needs to know when to split further-so it keeps that value as a mere metadata I T can use the advise itself on write boundaries. "
Add: I also found this passage
Original address: http://blog.csdn.net/lylcore/article/details/9136555
The size of a split is determined by the three values of Goalsize, MinSize, BlockSize. The logic of Computesplitsize is to first select the smallest of the two values from the Goalsize and blocksize (for example, do not set the map number, then BlockSize is the block size of the current file, The goalsize is the file size divided by the number of map users set, if not set, the default is 1).
Hadooop provides a parameter mapred.map.tasks that sets the number of maps, which we can use to control the number of maps. However, setting the number of maps in this way is not always valid. The reason is that mapred.map.tasks is just a reference value for Hadoop, and the number of final maps depends on other factors as well. To facilitate the introduction, first look at a few nouns: Block_size:hdfs file block size, the default is 64M, can be set by the parameter Dfs.block.size total_size: The overall size of the input file Input_file_num: The number of input files
(1) default map numberIf you do not make any settings, the default number of maps is associated with blcok_size. Default_num = total_size/block_size;
(2) Expected sizeThe number of maps expected by the programmer can be set by the parameter mapred.map.tasks, but only when the number is greater than default_num will it take effect. Goal_num = Mapred.map.tasks;
(3) Set the file size to be processedYou can set the file size for each task processing through mapred.min.split.size, but this size will only take effect if it is greater than block_size. Split_size = Max (mapred.min.split.size, block_size); Split_num = total_size/split_size;
(4) Number of maps calculatedcompute_map_num = min (Split_num, Max (Default_num, Goal_num)) In addition to these configurations, MapReduce follows some principles. Each map of MapReduce handles data that cannot be spanned by a file, that is, Min_map_num >= input_file_num. So, the final map number should be: Final_map_num = max (Compute_map_num, Input_file_num) after the above analysis, when setting the number of maps, you can simply summarize the following points: (1) If you want to add a map Number, the Mapred.map.tasks is set to a larger value. (2) If you want to reduce the number of maps, set Mapred.min.split.size to a larger value. (3) If there are many small files in the input, still want to reduce the number of maps, you need to merger small files into large files, and then use guideline 2. Reference: http://blog.csdn.net/dr_guo/article/details/51150278
Three words "Hadoop" tells you how to control the number of map processes in MapReduce?