Hadoop Detailed Introduction (i) HDFs

Last Update:2017-02-27 Source: Internet

Author: User

Tags file system

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HDFs Design Principles

1. Very large documents:

The very large here refers to the hundreds of MB,GB,TB. Yahoo's Hadoop cluster has been able to store PB-level data

2. Streaming data access:

Based on a single write, read multiple times.

3. Commercial hardware:

HDFs's high availability is done with software, so there is no need for expensive hardware to guarantee high availability, with PCs or virtual machines sold by each manufacturer.

HDFs not applicable to the scene

1. Low Latency Data access

HDFs's strong point is that a lot of data transfer, delay is not suitable for him, 10 milliseconds to access the following can ignore HDFs, but HBase can make up for this flaw.

2. Too many small files

The Namenode node hold the entire file system's metadata in memory, so the number of files is limited, and the metadata for each file is approximately 150 bytes

1 million files, each file occupies only one block, then 300MB of memory is required. Your server can hold live how much, you can calculate

3. Multiple write and random modification

Multiple writes are not supported at this time and are randomly modified via offsets

HDFs block

To minimize the lookup time ratio, the HDFs block is much larger than the disk block. The size of the HDFS block defaults to 64MB, which is different from the file system block.

HDFs files can be smaller than the block size and will not fill the entire block size.

Find time around 10ms, the probability of data transmission in 100mb/s, in order to make the search time is 1% of the transmission time, block size must be around 100MB

Typically set to 128MB

With the abstraction of a block, HDFs has three advantages:

1. Can store files larger than a single disk

2. Storage block is simpler than storing files, each block is basically the same size

3. Use blocks better than files for fault tolerance and high availability

Namenodes and Datanodes

The HDFs cluster has two types of nodes, one for master and Namenode, the other for worker and Datanodes.

The Namenode node manages the namespace of the file system. It contains a file system tree, all files and directories of the original data are in this tree, these

The information is stored in two files on the local disk, the image file and the edit log file. File-related blocks exist in which block, where the block is, and these

Information is loaded into the Namenode memory when the system is started and is not stored on disk.

The Datanode node's role in the file system is coolie, which stores or retrieves blocks according to Namenode and client directives, and periodically

Block that reports what files it has saved to the Namenode node

If the Namenode node is not available, then the entire hdfs is over. To prevent this, there are two ways to choose

1.namenode can be written to multiple disks by configuring metadata, preferably stand-alone disks, or NFS.

2. Using the second Namenode node, the second Namenode node does not normally work as a Namenode node, and its main job is to regularly edit

Log (edit log) merges the image of the namespace (namespace image) to prevent the edit log from being too large, and the merged image retains one of its own, waiting for

Namenode node hangs, then it can be positive, because it is not real-time, the loss of data is very likely to occur.

HDFs Federation

The Namenode node keeps references to all the files and blocks in memory, which means that in a large cluster with lots and lots of files, memory becomes a

The restricted condition of HDFS Federation is implemented in the Hadoop 2.x, allowing HDFs to have multiple namenode nodes, one part of each tube HDFs, such as a pipe/usr,

Another tube/home, each namenode node is isolated from each other, one hang off will not affect the other.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More