Introduction of HDFS principle, architecture and characteristics

Source: Internet
Author: User
Tags ack hadoop fs

This paper mainly describes the principle of HDFs-architecture, replica mechanism, HDFS load balancing, rack awareness, robustness, file deletion and recovery mechanism

1: Detailed analysis of current HDFS architecture

HDFS Architecture

1, Namenode

2, Datanode

3, Sencondary Namenode

Data storage Details

Namenode directory Structure

Namenode directory structure:

/fsimage is a list of several directories configured in Hdfs-site.xml.


The Namenode has a HDFS name space on it. For any operation that modifies file system metadata, Namenode is logged with a transaction log called Editlog. For example, if you create a file in HDFS, Namenode inserts a record in the Editlog, and similarly, the copy factor of the modification file inserts a record into the editlog. Namenode stores this editlog in the local operating system's file system. The namespace of the entire file system, including the mapping of data blocks to files, the attributes of files, etc., is stored in a file called Fsimage, which is also placed on the local filesystem where Namenode resides.

The namenode holds the image of the entire file system's namespace and File Block mappings (BLOCKMAP) in memory. This critical metadata structure is designed so tightly that a namenode with 4G of memory is sufficient to support a large number of files and directories. When Namenode starts, it reads Editlog and fsimage from the hard disk, acts on the fsimage in memory for all editlog transactions, and saves the new version of Fsimage from memory to the local disk, and then deletes the old Editlog , because this old Editlog affair has already been on the fsimage. This process is called a checkpoint (checkpoint). In the current implementation, checkpoints only occur when the Namenode is started and in the near future a periodic checkpoint is implemented.

HDFS NameSpace

HDFS supports the traditional hierarchical file organization structure. Users or applications can create directories and then save files in these directories. The hierarchy of file system namespaces is similar to most existing file systems: Users can create, delete, move, or rename files. Currently, HDFS does not support user disk quotas and access rights control, nor does it support hard links and soft links. However, the HDFS architecture does not prevent these features from being implemented.

Namenode is responsible for maintaining the file System namespace, and any modifications to the file System namespace or attributes will be namenode recorded. The application can set the number of copies of HDFS saved files. The number of copies of the file is called the copy factor of the file, and this information is also saved by Namenode.


Datanode stores HDFS data as a file in a local file system, and it does not know about HDFS files. It stores each HDFS data block in a separate file on the local file system. Datanode does not create all the files in the same directory, in fact, it uses a heuristic approach to determine the best number of files per directory, and creates subdirectories when appropriate. Creating all local files in the same directory is not an optimal choice because the local file system may not be able to efficiently support a large number of files in a single directory.

When a datanode is started, it scans the local filesystem, produces a list of all the HDFS blocks corresponding to the local file, and then sends the report to Namenode, which is the block status report.

Configure Secondary Namenode

Conf/masters file specified as secondary namenode node

Modify the Conf/hdfs-site.xml file on the machine that is configured in the Masters file, plus the following options:

Core-site.xml: Here are 2 parameters to configure, but generally we do not make changes. Fs.checkpoint.period represents how long a HDFs mirror is recorded. The default is 1 hours. Fs.checkpoint.size indicates how large a record size is and the default 64M.

<description>the number of seconds between two periodic checkpoints. </description>

<description>the size of the current edit log (in bytes) that triggers a

Secondary Namenode

Secondary Namenode periodically merges fsimage and edits logs, controlling the edits log file size to a limit.

Secondary Namenode processing Process

(1), Namenode responds to secondary Namenode request, pushes edit log to secondary Namenode, and begins to write a new edit log again.

(2), secondary Namenode received fsimage files and edit log from Namenode.

(3), secondary namenode loads the fsimage into memory, applies the edit log, and generates a new Fsimage file.

(4), secondary namenode to push the new fsimage to Namenode.

(5), Namenode with the new fsimage replace the old Fsimage, in the Fstime file, note that the checkpoint occurred

HDFS Communication Protocol

All HDFS communication protocols are built on the TCP/IP protocol. The client connects to Namenode through a configurable port and interacts with Namenode through ClientProtocol. And Datanode is interacting with Namenode using Datanodeprotocol. Redesign, datanode through the periodic send heartbeat and data block to Namenode to maintain and namenode communication, data block reported information including the properties of the data block, that is, the data block belongs to which file, block ID, modify time, Namenode Datanode The mapping relation of the data block is established by the Datanode data Block report when the system starts up. A remote call (RPC) is abstracted from ClientProtocol and Datanodeprotocol, and in design, Namenode does not initiate RPC, but rather responds to RPC requests from clients and Datanode.

Safe Mode for HDFs

Namenode is launched into a special state called Safe mode. Namenode in Safe mode do not replicate data blocks. Namenode receives heartbeat signals and block status reports from all Datanode. Block status reports include a list of all blocks of data for a datanode. Each data block has a specified minimum number of replicas. When Namenode detects that the number of replicas of a block of data reaches this minimum, the block is considered to be a replica security (safely replicated), and a certain percentage of the data block (which can be configured) is Namenode After the detection confirmation is secure (plus an additional 30 second wait time), Namenode will exit the Safe mode state. It then determines which data blocks have not reached the specified number of replicas and copies them to other Datanode.

Parsing of 2:hdfs file reading

File read process

Process Analysis

1. Use HDFs to provide client development library clients, to the remote Namenode to initiate RPC request;

2.Namenode will return some or all block lists of files as appropriate, and the Datanode address of the block copy will be returned for each block,namenode;

3. Client Development Library Clients select the closest datanode to the client to read block, and if the client itself is Datanode, the data is fetched directly from the local area.

4. After reading the current block data, turn off the connection to the current Datanode and find the best datanode for reading the next blocks;

5. When the block of the list is read and the file read is not finished, the client development library continues to fetch the next block list to Namenode.

6. Read a block will be checksum verification, if the error reading Datanode, the client will notify the Namenode, and then from the next owner of the block copy Datanode continue to read.

Parsing of 3:hdfs File writes

File Write process

Process Analysis

1. Use HDFs to provide client development library clients, to the remote Namenode to initiate RPC request;

2.Namenode will check whether the file you want to create is already there, whether the creator has permission to do so, and success will create a record for the file, otherwise it will cause the client to throw an exception;

3. When the client begins to write the file, the file is divided into multiple packets, and the data queue in the form of the management of these packets, and to Namenode to apply for a new blocks, Gets the appropriate list of Datanodes to store the replicas, depending on the replication settings in Namenode.

4. Begin writing packet to all replicas in the form of pipeline (piping). The packet is written as a stream to the first datanode, which is stored in the Datanode and passed to the next pipeline in this datanode until the last Datanode, the way the data is written in a pipelined format.

5. The last Datanode after successful storage will return an ACK packet, passed to the client in pipeline, in the client's development library internal maintenance "ACK queue", successfully received Datanode return ACK packet will be from "ack Queue to remove the corresponding packet.

6. If one of the datanode fails during transmission, the current pipeline is closed and the failed Datanode is removed from the current pipeline. The remaining blocks will continue to be transmitted as pipeline in the remaining Datanode, while Namenode will assign a new datanode to keep the number of replicas set.

pipelined replication

When the client writes data to the HDFS file, it is initially written to the local temporary file. Assuming that the copy factor for the file is set to 3, when the local temporary file accumulates to a data block size, the client obtains a Datanode list from Namenode for the copy. The client then begins to transmit data to the first Datanode, the first datanode a small fraction (4 KB) to receive the data, write each part to the local warehouse, and transfer the portion to the second Datanode node in the list at the same time. This is also true of the second Datanode, where a small fraction receives data, writes to the local repository, and passes to the third datanode at the same time. Finally, the third Datanode receives the data and stores it locally. As a result, Datanode can receive data from the previous node in a pipelined manner and forward it to the next node at the same time, and the data is datanode to the next

The principle of more detail

The client's request to create the file was not sent immediately to Namenode, in fact, at the beginning HDFS the client would first cache the file data to a local temporary file. The write operation of the application is transparently redirected to this temporary file. When this temporary file accumulates more data than a block of data, the client will contact Namenode. Namenode inserts the file name into the hierarchy of the filesystem and assigns a block of data to it. It then returns the Datanode identifier and the target block to the client. The client then uploads the data from the local temporary file to the specified datanode. When the file is closed, the remaining data that is not uploaded in the temporary file is transferred to the specified datanode. The client then tells Namenode that the file has been closed. Namenode then submits the file creation operation to the log for storage. If Namenode is down before the file is closed, the file will be lost.

4: Copy mechanism


1. Single Data type

2. Greater number of replicas

3. How to place a copy when writing a file

4. Dynamic Replica creation Strategy

5. Weakening copy consistency requirements

Replica placement Strategy

Modify the number of replicas

What happens when a cluster has only three datanode,hadoop systems replication=4.

For uploading files to HDFs, Hadoop at that time the copy coefficient is a few copies of the number of pieces of this file will be several, regardless of how you change the system copy system, the number of copies of this file will not change, also said that the upload to the distributed system of the number of copies of the system is determined by the number of copies, will not be subject to replication changes unless you use a command to change the number of copies of the file. Because Dfs.replication is essentially a client parameter, you can specify a specific replication when you create a file, and the property dfs.replication is the default backup number when you do not specify a specific replication. After the file is uploaded, the number of backups has been set, and modifying the dfs.replication will not affect the previous files, nor will it affect the files that are specified after the backup number. Affects only files that follow the default backup count. However, you can use the commands provided by Hadoop to later change the number of backups for a file: Hadoop fs-setrep-r 1. If you are setting up a dfs.replication in hdfs-site.xml, it must be, because you may not have the Conf folder added to your project's classpath, When your program is running, the dfs.replication may be the dfs.replication in Hdfs-default.xml, and the default is 3. Maybe that's why you dfs.replication always 3. You can try to set the replication explicitly when you create the file. Replication General to 3 on it, big meaning is not small.

5:hdfs Load Balancing

The HDFS data may not be evenly distributed across the datanode. A common reason is that new datanode nodes are often added to existing clusters. When a new block of data is added (a file's data is stored in a series of blocks), Namenode takes a number of factors into account before selecting Datanode to receive the data block. Some of these considerations are:

1. Place a copy of the block of data on the node that is writing the data block.

2. Try to distribute the different copies of the data blocks in different racks so that the cluster can survive a complete loss of a rack.

3. A copy is usually placed on a node in the same rack as the file-writing node, which reduces network I/o across the rack.

4. Distribute HDFs data as evenly as possible in the Datanode of the cluster.

6:hdfs Rack perception

HDFs Rack perception

Typically, large Hadoop clusters are organized in a rack, and the network conditions between different nodes on the same rack are more desirable than between different racks. In addition, Namenode manages to keep copies of blocks of data on different racks to improve fault tolerance.

and HDFS can not automatically judge the network topology of each datanode in the cluster Hadoop allows the Cluster Administrator to determine the rack of the node by configuring the parameter. The document provides a ip->rackid translation. Namenode through this to get the rackid of each datanode machine in the cluster. If is not set, each IP will be translated to/Default-rack.

With a rack perception, namenode can draw the Datanode network topology diagram shown above. D1,R1 are switches, the bottom is datanode. Then H1 's Rackid=/d1/r1/h1, H1 's parent is R1, R1 is D1. These rackid information can be configured via With these rackid information, you can calculate the distance between any two datanode.

Distance (/D1/R1/H1,/D1/R1/H1) =0 the same datanode

Distance (/D1/R1/H1,/D1/R1/H2) =2 different datanode under the same rack

Distance (/D1/R1/H1,/D1/R1/H4) =4 different datanode under the same IDC

Distance (/D1/R1/H1,/D2/R3/H7) =6 under different IDC Datanode

7:hdfs Access

How to access

HDFS provides a variety of ways to access applications. Users can access through the Java API interface, also through the C language encapsulation API access, but also through the browser to access the files in the HDFS.

Robustness of 8:hdfs

The main goal of HDFS is to ensure the reliability of data storage even in the event of an error. Common three error cases are: Namenode error, datanode error, and network fragmentation (network partitions).

Disk data errors, heartbeat detection and re-replication

Each Datanode node periodically sends a heartbeat signal to the Namenode. Network fragmentation may cause some datanode to lose contact with Namenode. Namenode detects this by missing heartbeat signals and marks these recent heartbeat datanode as downtime and does not send new IO requests to them. Any data stored on the downtime Datanode will no longer be valid. Datanode downtime may cause some data blocks to have a replica factor below the specified value, Namenode continuously detects the blocks of data that need to be replicated, and initiates the replication operation once it is discovered. You may need to replicate in the following situations: A Datanode node fails, a copy is corrupted, a hard disk on the Datanode is wrong, or the copy factor for the file is increased.

Data integrity

A block of data obtained from a datanode may be corrupted and may be caused by a Datanode storage device error, a network error, or a software bug. The HDFS client software implements a checksum (checksum) check of the contents of the HDFS file. When the client creates a new HDFS file, it calculates the checksum for each block of the file, and officers transferred Guevara it and saves it as a separate hidden file under the same HDFS name space. When the client obtains the contents of the file, it verifies that the data obtained from Datanode matches the checksum in the corresponding checksum file, and if not, the client can choose to obtain a copy of the block from another datanode.

Meta Data disk error

Fsimage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFS instance is invalidated. Thus, Namenode can be configured to support the maintenance of multiple copies of Fsimage and Editlog. Any modifications to fsimage or Editlog will be synchronized to their replicas. This multiple replica synchronization operation may reduce the number of namespace transactions processed by Namenode per second. However, this price is acceptable, because even though HDFS applications are data-intensive, they are also data-intensive. When the Namenode is restarted, it selects the nearest complete fsimage and Editlog to use.

Namenode is the single point of failure in the HDFS cluster. If Namenode machine malfunction, it needs manual intervention. Currently, the ability to automatically reboot or do namenode failover on another machine has not been implemented.


Snapshots support a replicated backup of data at a particular point in time. With snapshots, you can allow HDFS to revert to a known-good point in time when data is corrupted. HDFS currently does not support snapshot functionality, but is scheduled to be supported in future releases.

9:hdfs file Deletion recovery mechanism

When a user or application deletes a file, the file is not immediately removed from the HDFS. In fact, HDFS will rename this file to the/trash directory. As long as the file is still in the/trash directory, the file can be recovered quickly. The time the file is saved in/trash is configurable, and when this time is exceeded, Namenode deletes the file from the namespace. Deleting a file causes the file-related block of data to be freed. Note that there is a time lag between deleting a file from a user and increasing the amount of free space HDFS.

As long as the deleted file is still in the/trash directory, the user can restore the file. If the user wants to recover the deleted file, he/she can browse the/trash directory to retrieve the file. The/trash directory only saves the last copy of the deleted file. The/trash directory is no different from other directories except for the point that HDFS will apply a special policy to automatically delete files on the directory. The current default policy is to delete files that have a retention time of more than 6 hours in/trash. In the future, this strategy can be configured with a well-defined interface.

Open the Recycle Bin






<description>number ofminutes between trash checkpoints.

If Zero, the trashfeature is disabled.




1, fs.trash.interval parameter set retention time to 1440 minutes (1 days)

2, the location of the Recycle Bin: on the HDFs on the/user/$USER/. trash/current/

10:hdfs distributed Caching (Distributedcache)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.