"HDFS" Hadoop Distributed File System: Architecture and Design

Last Update:2015-01-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction
Prerequisites and Design Objectives
- Hardware error
- Streaming data access
- Large data sets
- A simple consistency model
- "Mobile computing is more cost effective than moving data"
- Portability between heterogeneous software and hardware platforms
Namenode and Datanode
File System namespace (namespace)
Data replication
- Copy storage: One of the most starting steps
- Copy Selection
- Safe Mode
Persistence of File system metadata
Communication protocols
Robustness
- Disk data errors, heartbeat detection, and re-replication
- Cluster equalization
- Data integrity
- Meta Data disk error
- Snapshot
Data organization
- Data block
- Staging
- Pipeline Replication
Accessibility
- Dfsshell
- Dfsadmin
- Browser interface
Storage space Reclamation
- Deletion and recovery of files
- Reduced replica factor
Resources

Introduction

The Hadoop Distributed File System (HDFS) is designed to be suitable for distributed file systems running on common hardware (commodity hardware). It has a lot in common with existing Distributed file systems. But at the same time, the difference between it and other distributed file systems is obvious. HDFs is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is ideal for applications on large-scale datasets. HDFs relaxes a subset of POSIX constraints to achieve the purpose of streaming data from the file system. HDFs was first developed as the infrastructure for the Apache Nutch search engine project. HDFs is part of the Apache Hadoop core project. Prerequisites and Design Objectives

Hardware error

Hardware errors are normal, not exceptions. HDFs can be comprised of hundreds of servers, each of which stores part of the file system's data. The reality we face is that the number of components that make up the system is huge, and any component can fail, which means that there is always a part of HDFs that is not working. Therefore, error detection and fast, automatic recovery are the most core architectural goals of HDFs.

Streaming data access

Applications running on HDFs are different from normal applications and require streaming access to their datasets. The design of HDFs is more concerned with data batching than user interaction. More critical than the low latency problem of data access is the high throughput of data access. Many of the hard constraints of POSIX standard settings are not required for HDFS application systems. To improve the throughput of data, some key aspects of POSIX semantics have been modified.

Large data sets

Applications running on HDFs have a large data set. A typical file size on HDFs is generally in G bytes to T bytes. As a result, HDFs is tuned to support large file storage. It should be able to provide a high overall data transfer bandwidth, which can be extended to hundreds of nodes in a cluster. A single instance of HDFs should be able to support tens of millions of files.

A simple consistency model

The HDFS application requires a file access model that writes multiple reads at a time. A file does not need to change after it has been created, written, and closed. This assumption simplifies data consistency and makes high-throughput data access possible. Map/reduce applications or web crawler applications are well suited for this model. There is also a plan to expand the model in the future to support additional write operations for files.

" Mobile Computing is more cost effective than moving data "

The calculation of an application request, the closer it is to the data it operates, the more efficient it is, especially when the data reaches a massive level. Because this can reduce the impact of network congestion, improve the throughput of system data. Moving the calculations around the data is better than moving the data to the app's location. HDFS provides an interface for your app to move them to near the data.

Portability between heterogeneous software and hardware platforms

HDFs is designed to take into account the portability of the platform. This feature facilitates the generalization of HDFS as a large scale data application platform.

Namenode and the Datanode

HDFs uses the Master/slave architecture. An HDFS cluster consists of a namenode and a certain number of datanodes. Namenode is a central server that manages the file system's namespace (namespace) and client access to files. The Datanode in a cluster is typically a node that is responsible for managing the storage on the node it resides on. HDFs exposes the namespace of the file system, allowing users to store data in the form of files. Internally, a file is actually partitioned into one or more blocks of data that are stored on a set of Datanode. Namenode performs namespace operations on the file system, such as opening, closing, renaming files or directories. It is also responsible for determining the mapping of data blocks to specific datanode nodes. The Datanode is responsible for handling read and write requests from the file system client. The creation, deletion and replication of data blocks under the unified dispatch of Namenode.

Namenode and Datanode are designed to run on ordinary commercial machines. These machines typically run the Gnu/linux operating system (OS). HDFs is developed in the Java language, so any Java-enabled machine can be deployed with Namenode or Datanode. With a highly portable Java language, HDFS can be deployed to multiple types of machines. A typical deployment scenario is that only one Namenode instance is running on a single machine, and the other machines in the cluster run one Datanode instance. This architecture does not exclude the running of multiple datanode on a single machine, but this is rarely the case.

The structure of the single namenode in the cluster greatly simplifies the architecture of the system. Namenode is the arbiter and manager of all HDFs metadata, so that user data never flows through Namenode.

File System Namespace (namespace)

HDFs supports the traditional hierarchical file organization structure. A user or application can create a directory and then save the file in those directories. The hierarchical structure of the file System namespace is similar to most existing file systems: Users can create, delete, move, or rename files. Currently, HDFS does not support user disk quotas and access rights control, nor does it support hard links and soft links. However, the HDFS architecture does not prevent these features from being implemented.

Namenode is responsible for maintaining the namespace of the filesystem, and any modifications to the file System namespace or attribute will be recorded namenode. The application can set the number of copies of files that HDFs saves. The number of copies of a file is called the copy factor of the file, and this information is also saved by Namenode.

Data replication

HDFs is designed to reliably store oversized files across machines in a large cluster. It stores each file as a series of data blocks, except for the last one, all of which are of the same size. For fault tolerance, all data blocks of the file will have replicas. The block size and replica coefficients for each file are configurable. An application can specify the number of copies of a file. The replica coefficients can be specified at the time the file is created, or can be changed later. Files in HDFs are write-once, and strict requirements can only have one writer at any time.

Namenode fully manages the replication of the data block, which periodically receives heartbeat and block status reports (Blockreport) from each datanode in the cluster. Receiving a heartbeat signal means that the Datanode node is working properly. The Block status report contains a list of all the data blocks on the Datanode.

Copy Storage : one of the most starting steps

The storage of replicas is critical to the reliability and performance of HDFs. An optimized copy-holding strategy is an important feature of HDFs's separation from most other distributed file systems. This feature requires a lot of tuning, and requires experience to accumulate. HDFS employs a strategy called rack-aware (rack-aware) to improve data reliability, availability, and utilization of network bandwidth. The current implementation of the replica storage strategy is only the first step in this direction. The short-term goal of implementing this strategy is to validate its effectiveness in a production environment, observe its behavior, and lay the groundwork for testing and research to achieve more advanced strategies.

Large HDFs instances typically run on a cluster of computers spanning multiple racks, and communication between two machines on different racks needs to go through the switch. In most cases, the bandwidth between two machines in the same rack is greater than the bandwidth between two machines in a different rack.

With a rack-aware process, Namenode can determine the rack ID that each datanode belongs to. A simple, but not optimized, strategy is to store replicas on different racks. This effectively prevents the loss of data when the entire rack fails and allows the bandwidth of multiple racks to be fully utilized when reading data. This policy setting distributes replicas evenly across the cluster and facilitates load balancing in the event of component failure. However, because a write operation of this strategy requires the transfer of data blocks to multiple racks, this increases the cost of writing.

In most cases, the replica factor is 3,hdfs's storage strategy is to place one copy on the node of the local rack, one copy on the other node in the same rack, and the last copy on the nodes of the different racks. This strategy reduces the data transfer between racks, which improves the efficiency of write operations. Rack errors are far less than node errors, so this strategy does not affect the reliability and availability of the data. At the same time, because the data blocks are placed on only two (not three) different racks, this strategy reduces the total network transmission bandwidth required to read the data. Under this strategy, the replicas are not evenly distributed across different racks. One-third copies on one node, two-thirds copies on one rack, and other replicas are evenly distributed across the remaining racks, a strategy that improves write performance without compromising data reliability and read performance.

Currently, the default copy storage policy described here is in the process of being developed.

Copy selection

To reduce overall bandwidth consumption and read latency, HDFs tries to get the reader to read the most recent copy from it. If there is a copy on the same rack of the reader, the copy is read. If an HDFS cluster spans multiple datacenters, the client will also first read a copy of the data center.

Safe Mode

Namenode will enter a special state called Safe Mode when it is started. A namenode in Safe mode does not replicate data blocks. The Namenode receives heartbeat signals and block status reports from all Datanode. The Block status report includes a list of all data blocks for a datanode. Each data block has a specified minimum number of copies. When Namenode detects that the number of copies of a block of data reaches this minimum, the data block is considered a copy-safe (safely replicated) , the Namenode exits the Safe mode state after a certain percentage (this parameter is configurable) is confirmed to be secure (plus an additional 30 second wait time) after the block is Namenode detected. Next, it determines which data blocks have not reached the specified number of copies, and copies the blocks to other datanode.

Persistence of File system metadata

The Namenode holds the HDFs namespace. For any action that produces modifications to the file system metadata, Namenode uses a transaction log called Editlog to record it. For example, if you create a file in HDFs, Namenode inserts a record in the Editlog and, similarly, modifies the file's copy factor to insert a record into Editlog. Namenode stores this editlog in the file system of the local operating system. The namespace of the entire file system, including data block-to-file mappings, file attributes, and so on, is stored in a file called Fsimage, which is also placed on the local file system where Namenode resides.

Namenode maintains an image of the entire file system's namespace and File Block mapping (BLOCKMAP) in memory. This key metadata structure is designed to be compact, so a namenode with 4G of memory is enough to support a large number of files and directories. When Namenode starts, it reads Editlog and fsimage from the hard disk, acts all the transactions in the Editlog on the Fsimage in memory, saves the new version of Fsimage from memory to the local disk, and then deletes the old Editlog, Because this old Editlog business has already worked on the fsimage. This process is called a checkpoint (checkpoint). In the current implementation, checkpoints only occur at namenode startup, and in the near future will be implemented to support periodic checkpoints.

Datanode stores the HDFS data as a file in the local file system, and it does not know about the HDFs file. It stores each HDFS chunk in a separate file on the local file system. Datanode does not create all the files in the same directory, in fact, it uses heuristics to determine the optimal number of files per directory, and to create subdirectories when appropriate. Creating all local files in the same directory is not an optimal choice, because the local file system may not be able to efficiently support a large number of files in a single directory. When a datanode starts, it scans the local file system, generates a list of all the HDFs blocks corresponding to these local files, and then sends it to Namenode as a report, which is a block status report.

Communication Protocols

All HDFS communication protocols are built on top of the TCP/IP protocol. The client connects to the Namenode through a configurable TCP port, interacting with the Namenode through the ClientProtocol protocol. Instead, Datanode interacts with Namenode using the Datanodeprotocol protocol. A remote Procedure call (RPC) model is abstracted to encapsulate the ClientProtocol and Datanodeprotocol protocols. On design, Namenode does not initiate RPC, but responds to RPC requests from clients or Datanode.

Robustness

The main goal of HDFs is to ensure the reliability of data storage even in the event of an error. The three common error cases are: Namenode error, datanode error and network partitions.

Disk data errors, heartbeat detection, and re-replication

Each datanode node sends a heartbeat signal periodically to the namenode. Network fragmentation can cause some datanode to lose contact with Namenode. Namenode detects this by missing heartbeat signals and marks these recent no-datanode heartbeat signals as down and no longer sends new IO requests to them. Any data stored on the outage Datanode will no longer be valid. Datanode outages can cause some data blocks to have a copy factor lower than the specified value, Namenode continuously detects the data blocks that need to be replicated, and initiates the copy operation as soon as it is discovered. A re-replication may be required in the following situations: A Datanode node fails, a copy is damaged, a hard disk on the Datanode is wrong, or the file's copy factor increases.

Cluster equalization

The architecture of HDFS supports data balancing policies. If the free space on a datanode node falls below a specific critical point, the system automatically moves the data from this datanode to other idle Datanode in accordance with the equalization policy. When a request for a file suddenly increases, it is possible to start a new copy of the file that is scheduled to be created, and to rebalance other data in the cluster at the same time. These equilibrium strategies have not yet been implemented.

Data integrity

A block of data obtained from a datanode may be corrupted and may be caused by a Datanode storage device error, a network error, or a software bug. The HDFS client software implements a checksum (checksum) check of the contents of the HDFs file. When the client creates a new HDFs file, it calculates the checksum of each chunk of the file and officer it as a separate hidden file under the same HDFs namespace. When the client obtains the contents of the file, it verifies that the data obtained from Datanode matches the checksum in the corresponding checksum file, and if it does not match, the client can choose to get a copy of the block from the other datanode.

Meta Data disk error

Fsimage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFs instance will be invalidated. Thus, Namenode can be configured to support the maintenance of multiple copies of Fsimage and Editlog. Any modifications to fsimage or Editlog will be synchronized to their copy. This synchronous operation of multiple replicas may reduce the number of namespace transactions processed by Namenode per second. The cost is acceptable, however, because even if HDFS applications are data-dense, they are also non-meta-data intensive. When Namenode restarts, it selects the most recent full fsimage and editlog to use.

The Namenode is the single point of failure in the HDFs cluster, which is failure. Manual intervention is required if the Namenode machine fails. Currently, the ability to restart automatically or do namenode failover on another machine has not been implemented.

Snapshot

Snapshots support a copy backup of data at a specific point in time. Snapshots allow HDFs to revert to the last known good point in time when data corruption occurs. HDFs does not currently support snapshot functionality, but it is scheduled to be supported in a future release.

Data organization

Data Block

HDFs is designed to support large files, and HDFs is for applications that need to handle large-scale datasets. These applications are written only once, but read one or more times, and the read speed should meet the needs of streaming reading. The "Write once read" semantics of HDFs support files. A typical data block size is 64MB. Thus, the files in HDFs are always cut into different blocks according to the 64M, and each block is stored in different datanode as much as possible.

Staging

The client's request to create the file was not immediately sent to Namenode, in fact, the HDFS client initially caches the file data to a local temporary file in the first phase. The write operation of the application is transparently redirected to this temporary file. When the amount of data accumulated by this temporary file exceeds the size of a block of data, the client will contact Namenode. Namenode inserts the file name into the hierarchy of the filesystem and assigns a data block to it. It then returns the identifier of the Datanode and the target data block to the client. The client then uploads the piece of data from the local temporary file to the specified datanode. When the file is closed, the remaining non-uploaded data in the temporary file is also transferred to the specified datanode. The client then tells the namenode that the file is closed. The file creation operation is now submitted to the log for storage by Namenode. If the namenode is down before the file is closed, the file will be lost.

The above method is the result of careful consideration of the target application running on HDFs. These applications require a stream of file writes. If you do not use client-side caching, you will have a big impact on swallowing because of network speed and network congestion. This approach is not without precedent, and early file systems, such as AFS, use client-side caching to improve performance. To achieve higher data upload efficiency, the POSIX standard has been relaxed.

Pipeline Replication

When the client writes data to the HDFs file, it is initially written to the local temporary file. Assuming that the copy factor of the file is set to 3, when the local temporary file accumulates to the size of a data block, the client obtains a Datanode list from Namenode to hold the copy. The client then begins transmitting data to the first Datanode, the first datanode a small fraction (4 KB) of the data, writes each part to the local repository, and transmits the part to the second Datanode node in the list at the same time. The second datanode is also the case, where a small fraction of the data is received, written to the local repository, and passed on to a third datanode. Finally, a third datanode receives the data and stores it locally. Therefore, Datanode can be pipelined to receive data from the previous node, and at the same time forward to the next node, the data in a pipelined way from the previous Datanode copy to the next.

Accessibility

HDFS provides multiple ways to access your app. Users can access through the Java API interface, or through the C-language encapsulation API, and can access the files in HDFs through a browser. Access through the WebDAV protocol is in development.

Dfsshell

HDFS organizes user data in the form of files and directories. It provides a command-line interface (Dfsshell) that allows users to interact with the data in HDFs. The syntax of the command is similar to other shell (such as bash, CSH) tools that the user is familiar with. Here are some examples of actions/commands:

Action	Command
Create a directory named/foodir	Bin/hadoop Dfs-mkdir/foodir
Create a directory named/foodir	Bin/hadoop Dfs-mkdir/foodir
View the contents of a file named/foodir/myfile.txt	Bin/hadoop Dfs-cat/foodir/myfile.txt

Dfsshell can be used on applications that interact through scripting languages and file systems.

Dfsadmin

The Dfsadmin command is used to manage the HDFs cluster. These commands are only available to HDSF administrators. Here are some examples of actions/commands:

Action	Command
Put the cluster in safe mode	Bin/hadoop Dfsadmin-safemode Enter
Show Datanode List	Bin/hadoop Dfsadmin-report
Decommissioning the Datanode node Datanodename	Bin/hadoop dfsadmin-decommission Datanodename

Browser interface

A typical HDFs installation opens a Web server on a configurable TCP port to expose the namespace of HDFs. Users can use the browser to browse the HDFs namespace and view the contents of the file.

Storage space Reclamation

Deletion and Recovery of files

When a user or application deletes a file, the file is not immediately removed from HDFs. In fact, HDFs transfers This file rename to the/trash directory. As long as the file is in the/trash directory, the file can be recovered quickly. The time the file was saved in/trash is configurable, and when this time is exceeded, Namenode removes the file from the namespace. Deleting a file causes the data block associated with the file to be freed. Note that there is a delay between the time the user deletes the file and the increase in the HDFs free space.

As long as the deleted file is still in the/trash directory, the user can recover the file. If the user wants to recover the deleted file, he/she can browse the/trash directory to retrieve the file. The/trash directory simply saves the last copy of the deleted file. There is no difference between the/trash directory and other directories, except that HDFS applies a special policy to automatically delete files on that directory. The current default policy is to delete files that have been retained for more than 6 hours in/trash. In the future, this strategy can be configured through a well-defined interface.

Reduced replica factor

When the copy factor of a file is reduced, Namenode will choose to delete the excess copy. The next heartbeat test will pass this information to Datanode. Datanode then removes the corresponding data block, and the free space in the cluster increases. Similarly, there is a delay between calling the end of the Setreplication API and increasing the amount of free space in the cluster.

Resources

HDFS Java api:http://hadoop.apache.org/core/docs/current/api/

HDFS Source code: http://hadoop.apache.org/core/version_control.html

by Dhruba Borthakur

"HDFS" Hadoop Distributed File System: Architecture and Design

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"HDFS" Hadoop Distributed File System: Architecture and Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"HDFS" Hadoop Distributed File System: Architecture and Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support