HDFs Design Principles
1. Very large documents:
The very large here refers to the hundreds of MB,GB,TB. Yahoo's Hadoop cluster has been able to store PB-level data
2. Streaming data access:
Based on a single write, read multiple times.
3. Commercial hardware:
HDFs's high availability is done with software, so there is no need for expensive hardware to guarantee high availability, with PCs or virtual machines sold by each manufacturer.
HDFs not applicable to the scene
1. Low Latency Data access
HDFs's strong point is that a lot of data transfer, delay is not suitable for him, 10 milliseconds to access the following can ignore HDFs, but HBase can make up for this flaw.
2. Too many small files
The Namenode node hold the entire file system's metadata in memory, so the number of files is limited, and the metadata for each file is approximately 150 bytes
1 million files, each file occupies only one block, then 300MB of memory is required. Your server can hold live how much, you can calculate
3. Multiple write and random modification
Multiple writes are not supported at this time and are randomly modified via offsets
HDFs block
To minimize the lookup time ratio, the HDFs block is much larger than the disk block. The size of the HDFS block defaults to 64MB, which is different from the file system block.
HDFs files can be smaller than the block size and will not fill the entire block size.
Find time around 10ms, the probability of data transmission in 100mb/s, in order to make the search time is 1% of the transmission time, block size must be around 100MB
Typically set to 128MB
With the abstraction of a block, HDFs has three advantages:
1. Can store files larger than a single disk
2. Storage block is simpler than storing files, each block is basically the same size
3. Use blocks better than files for fault tolerance and high availability
Namenodes and Datanodes
The HDFs cluster has two types of nodes, one for master and Namenode, the other for worker and Datanodes.
The Namenode node manages the namespace of the file system. It contains a file system tree, all files and directories of the original data are in this tree, these
The information is stored in two files on the local disk, the image file and the edit log file. File-related blocks exist in which block, where the block is, and these
Information is loaded into the Namenode memory when the system is started and is not stored on disk.
The Datanode node's role in the file system is coolie, which stores or retrieves blocks according to Namenode and client directives, and periodically
Block that reports what files it has saved to the Namenode node
If the Namenode node is not available, then the entire hdfs is over. To prevent this, there are two ways to choose
1.namenode can be written to multiple disks by configuring metadata, preferably stand-alone disks, or NFS.
2. Using the second Namenode node, the second Namenode node does not normally work as a Namenode node, and its main job is to regularly edit
Log (edit log) merges the image of the namespace (namespace image) to prevent the edit log from being too large, and the merged image retains one of its own, waiting for
Namenode node hangs, then it can be positive, because it is not real-time, the loss of data is very likely to occur.
HDFs Federation
The Namenode node keeps references to all the files and blocks in memory, which means that in a large cluster with lots and lots of files, memory becomes a
The restricted condition of HDFS Federation is implemented in the Hadoop 2.x, allowing HDFs to have multiple namenode nodes, one part of each tube HDFs, such as a pipe/usr,
Another tube/home, each namenode node is isolated from each other, one hang off will not affect the other.