A brief introduction to fragmentation of data blocks and map tasks in Hadoop HDFs

Source: Internet
Author: User

HDFs block of data

Disk data block is the smallest unit of data read/write for disk, typically 512 bytes,

There are also data blocks in the HDFs, and the default is 64MB. So the large files on the HDFs are divided into many chunk. Files that are small (less than 64MB) on HDFs will not occupy the entire block of space.

The reason for the large set of HDFS data blocks is to reduce the addressing overhead, and the data backup is also in blocks.

Use the Hadoop fsck/-files-blocks command to check the health status of all files, blocks under the HDFs root directory (/):

Slice of Map task:

Why is the recommended map task's slice size consistent with the size of the HDFS base block?

Because there are 3 factors:

A.map number of tasks = Input file total size/fragment size, so the larger the fragmentation, the smaller the number of map tasks, resulting in less system execution overhead.

B. Managing fragmentation overhead: Obviously the larger the fragmentation, the less the fragmentation, the easier it is to manage.

From the a,b factor, it seems that the larger the fragmentation is better.

C. Network transport Overhead

However, if the fragment is too large for a fragment to span multiple HDFs blocks, a map task must be transmitted over the network by multiple blocks, so the upper limit of the fragment size is the size of the HDFS block.

To sum up, the slice size of the map task when set to the size of the HDFS block is the best choice.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.