Hadoop Detailed Introduction (i) HDFs

Source: Internet
Author: User
Tags file system

HDFs Design Principles

1. Very large documents:

The very large here refers to the hundreds of MB,GB,TB. Yahoo's Hadoop cluster has been able to store PB-level data

2. Streaming data access:

Based on a single write, read multiple times.

3. Commercial hardware:

HDFs's high availability is done with software, so there is no need for expensive hardware to guarantee high availability, with PCs or virtual machines sold by each manufacturer.

HDFs not applicable to the scene

1. Low Latency Data access

HDFs's strong point is that a lot of data transfer, delay is not suitable for him, 10 milliseconds to access the following can ignore HDFs, but HBase can make up for this flaw.

2. Too many small files

The Namenode node hold the entire file system's metadata in memory, so the number of files is limited, and the metadata for each file is approximately 150 bytes

1 million files, each file occupies only one block, then 300MB of memory is required. Your server can hold live how much, you can calculate

3. Multiple write and random modification

Multiple writes are not supported at this time and are randomly modified via offsets

HDFs block

To minimize the lookup time ratio, the HDFs block is much larger than the disk block. The size of the HDFS block defaults to 64MB, which is different from the file system block.

HDFs files can be smaller than the block size and will not fill the entire block size.

Find time around 10ms, the probability of data transmission in 100mb/s, in order to make the search time is 1% of the transmission time, block size must be around 100MB

Typically set to 128MB

With the abstraction of a block, HDFs has three advantages:

1. Can store files larger than a single disk

2. Storage block is simpler than storing files, each block is basically the same size

3. Use blocks better than files for fault tolerance and high availability

Namenodes and Datanodes

The HDFs cluster has two types of nodes, one for master and Namenode, the other for worker and Datanodes.

The Namenode node manages the namespace of the file system. It contains a file system tree, all files and directories of the original data are in this tree, these

The information is stored in two files on the local disk, the image file and the edit log file. File-related blocks exist in which block, where the block is, and these

Information is loaded into the Namenode memory when the system is started and is not stored on disk.

The Datanode node's role in the file system is coolie, which stores or retrieves blocks according to Namenode and client directives, and periodically

Block that reports what files it has saved to the Namenode node

If the Namenode node is not available, then the entire hdfs is over. To prevent this, there are two ways to choose

1.namenode can be written to multiple disks by configuring metadata, preferably stand-alone disks, or NFS.

2. Using the second Namenode node, the second Namenode node does not normally work as a Namenode node, and its main job is to regularly edit

Log (edit log) merges the image of the namespace (namespace image) to prevent the edit log from being too large, and the merged image retains one of its own, waiting for

Namenode node hangs, then it can be positive, because it is not real-time, the loss of data is very likely to occur.

HDFs Federation

The Namenode node keeps references to all the files and blocks in memory, which means that in a large cluster with lots and lots of files, memory becomes a

The restricted condition of HDFS Federation is implemented in the Hadoop 2.x, allowing HDFs to have multiple namenode nodes, one part of each tube HDFs, such as a pipe/usr,

Another tube/home, each namenode node is isolated from each other, one hang off will not affect the other.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.