Hadoop serial Four: Hadoop Distributed File System HDFs

Source: Internet
Author: User
Keywords DFS can size run disk

When a dataset is large in size beyond the storage capacity of a single physical machine, we can consider using a cluster. File systems that manage storage across networked machines are called Distributed File Systems (distributed http://www.aliyun.com/zixun/aggregation/19352.html ">filesystem"). With the introduction of multiple nodes, the corresponding problem arises, for example, one of the most important issues is how to ensure that the data will not be lost if a node fails. A core subproject in Hadoop HDFs (Hadoop distributed filesystem) is used to manage the storage problems of the cluster, and of course in Hadoop it is not only possible to use a common abstract file system concept in Hdfs,hadoop, This allows Hadoop to operate under different kinds of file systems, such as Hadoop, which can be integrated with Amazon's S3 file system.

1 HDFs Design Concept

1.1 Storing Super Large files

The "oversized file" here refers to files that are hundreds of MB, GB, or even terabytes in size.

1.2 Streaming data access

HDFs is based on the notion that the most efficient data-processing pattern is the concept of a write-multiple-read (Write-once,read-many-times) pattern, hdfs the stored dataset as the analysis object of Hadoop. After the dataset is generated, various analyses are performed on this dataset over a long period of time. Each analysis will design most of the data for the dataset, or even all of the data, so the time lag to read the entire dataset is more important than the time delay in reading the first record. (streaming reads minimizes the addressing overhead of a hard drive, and only needs to be addressed once, and then read it all the time.) The physical construction of the hard disk causes the addressing overhead to be optimized to keep up with the read overhead. Therefore, streaming reading is more suitable for the characteristics of the hard disk. Of course, the characteristics of large files are also more suitable for streaming reading. The corresponding to the stream data access is random data access, which requires a small delay in locating, querying, or modifying data, which is more suitable for writing and reading after creating data, which is consistent with traditional relational databases.

1.3 Hardware conditions to run

Running on a common cheap server HDFs design concept is to enable it to run on the normal hardware, even if the hardware failure, you can use fault tolerant strategy to ensure the high availability of data.

2 HDFs Unsuitable scene

2.1 Scenarios where data access requires low latency

Because HDFs is designed for high data throughput applications, it must be at the expense of high latency.

2.2 Storing a large number of small files

HDFs Metadata (basic file information) stored in Namenode memory, and Namenode as a single point, small file size to a certain extent, namenode memory will be unbearable.

More than 2.3 user writes, Arbitrary modification files

The files in HDFs may have only one writer, and write operations always add data to the end of the file. She does not support operations with multiple writers, nor does it support any modification at any location in the file.

3 Basic concepts of HDFS

3.1 Block: Blocks

Each disk has a data block size (blocksize), which is the smallest unit that can read or write data at once. HDFs also has the concept of a block of data, but the data block in the HDFs is much larger than the average disk's data block (typically 512Byte). Like a regular disk file system, HDFS partitions a file into blocks (if not specifically stated below, block refers to the size of the HDFs in 64MB) and is stored independently. Unlike a normal disk file system, however, if a file is smaller than a single block, the file does not occupy the entire block.

3.1.1 HDFs Why use large blocks of data

The data blocks in the HDFs are much larger than the normal disk file system, and this is due to the time that the data is addressed in the minimized file system. By setting a large block size, the time to address the data is much smaller than the time it took to transfer the data, so that the time to process a large file (HDFs is primarily used to handle large data) is largely determined by the timing of the data transfer.

If the average time to address the data is 10ms, and the transfer rate is 100mb/s, now we're going to roughly calculate that the time to address the data is only 1% of the time it takes to transfer, then we need to set each block size to 100MB. In fact, the default block size is 64MB (many other implementations of HDFS also use 128MB). The size of the block may also increase with the data transfer rate. But block size does not always grow. Because the map task in MapReduce can only handle one block at a time, for a file of the same size, if the block is too large to make the maptask too small, the time to run the job will increase.

3.1.2 The concept of a block at the Distributed file system level can provide the following benefits

1. Since none of the files must be stored on a single disk, a single file can be larger than the storage space of any one node in the cluster to take advantage of the storage capacity of the cluster. It is possible (though uncommon) for a file to occupy the storage space of all nodes on the entire cluster.

2. The storage subsystem is simplified as an abstract unit with block rather than file. Simplicity is the common goal of all storage systems and is particularly important in the case of a wide variety of distributed file systems that fail. The storage subsystem only needs to handle blocks, which simplifies storage management (because block is fixed-size, it's easy to figure out how many blocks a disk can store), and it eliminates the administrative burden of metadata (because the cube is just a string of data that needs to be stored). Metadata such as access rights for a file do not need to be stored with the block so that it can be managed separately through another system Namenode.

3. With block, redundant backup (replication) mechanisms that provide data tolerance and availability can work better. In HDFs, each block has several backups on different machines (default 3) in order to prevent data blocks from being corrupted, or disk and machine machines. If a block is unavailable, HDFs copies a new backup in a transparent way to restore the cluster's data security level to its previous level (you can also increase the security level of the data by increasing the number of redundant backups).

ps:

You can use the fsck command in HDFs to interact at the block level, such as running commands:

Hadoop fsck/-files-blocks lists the blocks that make up all the files in the file system.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.