This article takes the Distributed File System (HDFS) provided by Hadoop as an example to further expand the key points of the design of the Distributed Storage Service architecture.
Architectural goals
Any software framework or service is created to solve a specific problem. Remember some of the concerns we described in the article "Distributed Storage-Overview"? Distributed file system belongs to a file-oriented data model in distributed storage, which needs to solve the capacity expansion and fault tolerance problems faced by stand-alone file system.
So the architecture of HDFS is designed to be a target:
- For very large files or large file datasets
- Auto-detect local hardware errors and recover quickly
Based on this goal, considering the scenario for simplified design and implementation, HDFS assumes a write-once-read-many file access model. This type of write-once and read-out model really adapts to many business scenarios in the real world, and this kind of assumption of architecture design is reasonable. Because of the existence of such assumptions, it also limits its application scenarios.
Architecture Overview
Here is an architecture diagram from an official document:
The architecture of the visible HDFS consists of three parts, each with its own clear delineation of responsibilities.
- NameNode
- DataNode
- Client
As can be seen, HDFS uses the central master-control architecture, NameNode is the central node of the cluster.
NameNode
NameNode's primary responsibility is to manage meta-information (Metadata) for the entire file system, which mainly includes:
- File system Namesapce
HDFS 类似单机文件系统以目录树的形式组织文件,称为 file system namespace
- Replication factor
文件副本数,针对每个文件设置
- Mapping of blocks to Datanodes
文件块到数据节点的映射关系
In the schema diagram above, the Metadata ops point to NameNode is primarily about creating, deleting, reading, and setting the number of copies of files, so all file operations are not around NameNode. In addition NameNode is responsible for managing DataNode, such as the new DataNode joins the cluster, the old DataNode exits the cluster, the distribution of load-balanced file data blocks between DataNode and so on. More on NameNode's design implementation analysis, which will be written separately.
DataNode
DataNode's duties are as follows:
- Store file blocks (block)
- Service responds to Client's file read and write requests
- Perform file block creation, deletion, and replication
From the frame composition, see a Block OPS operating arrows from NameNode point to DataNode, will make people mistakenly think NameNode will take the initiative to send command calls to DataNode. In fact, NameNode never calls DataNode, only to carry the callback instruction information by DataNode sending the heartbeat to NameNode periodically.
The Rack1 and Rack2 are specifically marked on the frame composition, indicating that HDFS is designed specifically for rack perception when considering multiple copies of file data blocks, details we do not start here, more on the DataNode design implementation analysis, the following will be written in separate detail.
Client
Given the complexity of the HDFS interaction process, the Client of the pin-specific programming language is specifically provided to simplify usage. The Client's responsibilities are as follows:
- Provides a consistent API for application programming languages, simplifying application programming
- Improve access Performance
The client is able to improve performance because the cache is available for read, and for write can be buffered (buffer) batch mode, details we do not start here, more about the Client design implementation analysis, the following will be written separately.
Summarize
Originally wanted to write in an article in the HDFS architecture parsing, wrote that the discovery is not likely. As the most complex distributed storage class system in distributed system, every architectural design tradeoff is worth careful scrutiny, once you start this article feel the endless, so here first overall over a bit, for each part of the design implementation details to the theme of the detailed analysis of the article.
Reference
[1]hadoop documentation. HDFS Architecture.
[2]robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas. The Hadoop distributed File System
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Back-end Distributed series: Distributed storage-hdfs Architecture parsing