HDFs Concept detailed-block

Source: Internet
Author: User

A disk has its block size, which represents the minimum amount of data it can read and write. The file system operates this disk by processing chunks of integer multiples of the size of a disk block. File system blocks are typically thousands of bytes, while disk blocks are generally 512 bytes. This information is transparent to file system users who simply read or write at any length on a single file. However, some tools maintain file systems, such as DF and fsck, which operate at the system block level.

HDFs also has a block concept, but a larger unit, which defaults to a size of MB. Similar to the file system on a single disk, the files on HDFs are also broken down into block-sized chunks as separate unit storage. But unlike a file that is smaller than one block in HDFs, it does not occupy the entire block of space. If not specifically stated, "block" in this book refers to the block in HDFs.

Why is a block in HDFs so big?

The block of HDFs is larger than the disk block, and is intended to reduce the addressing overhead. By making a block large enough, the time to transfer data from the disk can be far greater than the time it takes to locate the block's starting end. Therefore, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.

Let's make a calculator, if the addressing time is around 10 milliseconds, the transfer rate is 100 megabits per second, in order to make the addressing time 1% of the transmission time, we need about a block size of about about MB. The default size is actually up to $ MB, although many HDFs settings use a block of MB. This number will continue to adjust later as the next generation disk drive speeds up the transfer.

Of course, this assumption should not be so exaggerated. The map task in the MapReduce process typically runs a block within a single time, so if the number of tasks is too small (less than the number of nodes on the cluster), the job is obviously slower than expected.

The use of abstract blocks in distributed file systems can bring many benefits. The first and most obvious benefit is that a file can be larger than the capacity of any disk in the network. The chunks of the file (block, some of which are also referred to as "blocks") do not need to be stored on the same disk, so they can take advantage of any disk on the cluster. In fact, although not common, for an HDFS cluster, you can also store a file whose tiles occupy all the disks in the cluster.

A second benefit is that using a block abstract unit instead of a file simplifies the storage subsystem. Simplification is the pursuit of all systems, but it is particularly important for a wide variety of fault-distribution systems. The storage subsystem controls blocks, simplifying storage management. (Because the size of the block is fixed, it is relatively easy to calculate how many blocks a disk can hold), it also eliminates the concern about metadata (the block is only part of the stored data-and the metadata of the file, such as the license information, does not need to be stored with the block, so other systems can manage the metadata in an orthogonal manner.) )

In addition, blocks are well suited for replication operations that provide fault tolerance and practicality. Each block is duplicated in a small number of other scattered machines (typically 3) in response to damaged blocks and failures of the disk or machine. If one block is damaged, another copy is read elsewhere, and the process is transparent to the user. A block that has been lost due to damage or machine failure is copied from other candidate locations to a functioning machine to ensure that the number of copies returns to normal levels. (See Chapter 4th, "Data integrity" for more information on how to deal with data corruption.) Similarly, some applications may choose to set a higher number of replicas for popular file blocks to increase the amount of read load on the cluster.

Similar to the disk file system, the fsck command in HDFS displays block information. For example, execute the following command to list the blocks that make up each file in the file system:

1.% Hadoop fsck/-files-blocks


HDFs Concept detailed-block

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.