Explains the HDFs storage mechanism and operation principle in a concise and understandable comic form.
First, the role starred
As shown, the HDFS storage-related roles and functions are as follows:
Client: Clients, system users, invoke HDFs API operation files, get file metadata interactively with NN, and read and write data with DN.
Namenode: Meta Data node, is the system's only manager. Responsible for metadata management, providing metadata queries with client interaction, assigning data storage nodes, etc.
Datanode: Data Storage node, responsible for data block storage and redundant backup, execution of data block read and write operations.
Second, write the data
1. Send Write Data request
The storage unit in HDFS is block. Files are usually stored in chunks of 64 or 128M blocks. Unlike normal file systems, in HDFs, if a file size is smaller than the size of a block of data, it does not need to occupy the entire block of storage space.
2. File segmentation
3. DN Assignment
4. Data Write
5. Finish writing
6. Role positioning
Iii. HDFs Read File
1. User needs
HDFs uses the file access model of write-once-read-once. A file does not need to change after it has been created, written, and closed. This assumption simplifies data consistency and makes high-throughput data access possible.
2. Contact the metadata node first
3. Download data
As mentioned earlier, in the process of writing data, the data store has been sorted by the distance between the client and the Datanode node, and the Datanode node that is closer to the client is placed at the front, and the client will first read the data block locally.
4. Thinking
Iv. fault tolerant mechanism of HDFS--Part one: fault type and monitoring method
1, three types of fault
(1) First Class: node failure
(2) Type II: Network failure
(3) Category III: Data corruption (dirty data)
2. Fault Monitoring mechanism
(1) Node failure monitoring mechanism
(2) Communication fault monitoring mechanism
(3) Data error monitoring mechanism
3, review: Heartbeat information and Data block report
The HDFs storage concept is to buy the worst machines with the least amount of money and achieve the most secure and difficult Distributed file system (high fault-tolerant low cost), as can be seen from the above, HDFs think machine failure is a normal, so in the design of the full consideration of a single machine failure, a single disk failure, a single file loss and so on.
V. Fault Tolerance Part II: Read and write fault tolerance
1. Write Fault tolerance
2. Read fault tolerance
Vi. Fault Tolerance Part III: Data node (DN) failure
Vii. Backup Rules
1. Rack and Data node
2. Copy Placement Policy
The first copy of the data block is prioritized on the node where the client is writing the data block, but if the data node on the client is out of space or is currently overloaded, you should select an appropriate data node from the rack in which the data node resides as the local node.
If there is no data node on the client, a suitable data node is randomly selected from the entire cluster as the local node of this data block at this time.
The storage strategy for HDFS is to store one copy on the local rack node, and the other two replicas on different nodes in different racks.
This allows the cluster to survive without a single rack. At the same time, this strategy reduces the data transfer between racks and improves the efficiency of write operations because the blocks are stored only on two different racks, reducing the total bandwidth required to read the data. This takes into account the cost of data security and network transmission to some extent.
"Comic reading" HDFs Storage principle (reprint)