HDFs block of data
Disk data block is the smallest unit of data read/write for disk, typically 512 bytes,
There are also data blocks in the HDFs, and the default is 64MB. So the large files on the HDFs are divided into many chunk. Files that are small (less than 64MB) on HDFs will not occupy the entire block of space.
The reason for the large set of HDFS data blocks is to reduce the addressing overhead, and the data backup is also in blocks.
Use the Hadoop fsck/-files-blocks command to check the health status of all files, blocks under the HDFs root directory (/):
Slice of Map task:
Why is the recommended map task's slice size consistent with the size of the HDFS base block?
Because there are 3 factors:
A.map number of tasks = Input file total size/fragment size, so the larger the fragmentation, the smaller the number of map tasks, resulting in less system execution overhead.
B. Managing fragmentation overhead: Obviously the larger the fragmentation, the less the fragmentation, the easier it is to manage.
From the a,b factor, it seems that the larger the fragmentation is better.
C. Network transport Overhead
However, if the fragment is too large for a fragment to span multiple HDFs blocks, a map task must be transmitted over the network by multiple blocks, so the upper limit of the fragment size is the size of the HDFS block.
To sum up, the slice size of the map task when set to the size of the HDFS block is the best choice.