Original: http://hadoop.apache.org/core/docs/current/hdfs_design.html
Introduction
The Hadoop Distributed File System (HDFS) is designed to be suitable for distributed file systems running on general-purpose hardware (commodity hardware). It has a lot in common with existing Distributed file systems. At the same time, it is obvious that it differs from other distributed file systems. HDFs is a highly fault tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is ideal for applications on large datasets. HDFs has relaxed a subset of POSIX constraints to enable streaming file system data to be read. HDFs was first developed as an infrastructure for the Apache Nutch search engine project. HDFs is part of the Apache Hadoop core project. The address of this project is http://hadoop.apache.org/core/.
prerequisite and Design target hardware error
Hardware errors are normal rather than abnormal. HDFs may be made up of hundreds of servers, each of which stores part of the file system's data. The reality we face is that the number of components that make up the system is huge, and any component is likely to fail, which means that there is always a part of the HDFs component that is not working. Therefore, error detection and rapid and automatic recovery are the core architectural goals of HDFs.
Streaming data Access
Applications running on HDFs are different from ordinary applications and require streaming access to their datasets. HDFs's design is more about data batching than user interaction. The more critical problem is the high throughput of data access than the low latency of data access. Many of the hard constraints set by the POSIX standard are not required for HDFS applications. To improve data throughput, some key aspects of POSIX semantics have been modified.
Large DataSet
Applications running on HDFs have a large dataset. A typical file size on the HDFs is typically in G-byte to T-byte. Therefore, HDFs is adjusted to support large file storage. It should be able to provide overall high data transmission bandwidth, can be extended to hundreds of nodes in a cluster. A single HDFs instance should support tens of millions of files.
Simple Consistency model
The HDFS application requires a file access model of write multiple reads at a time. A file does not need to be changed after it has been created, written, and closed. This assumption simplifies data consistency issues and makes it possible to access high throughput data. Map/reduce applications or web crawler applications are ideal for this model. There are also plans to expand the model in the future to support the file's additional write operations.
"Mobile computing is more cost-effective than moving data"
The computation of an application request, the closer it is to the data it operates, the more efficient it is when the data reaches a mass level. This can reduce the impact of network congestion and improve the throughput of system data. Moving the calculation near the data is obviously better than moving the data to the application. HDFS provides an interface for applications to move themselves to the vicinity of data.
portability
between heterogeneous hardware and software platforms
HDFs is designed to take into account the portability of the platform. This feature facilitates the popularization of HDFs as a large-scale data application platform.
Namenode and Datanode
HDFs adopts Master/slave architecture. A HDFs cluster is composed of a namenode and a certain number of datanodes. Namenode is a central server that manages the namespace (namespace) of file systems and client access to files. The Datanode in a cluster is typically a node that manages the storage on its node. HDFs exposes the file system's namespace and allows users to store data in the form of files. Internally, a file is actually divided into one or more blocks of data that are stored on a set of Datanode. Namenode performs namespace operations on file systems, such as opening, closing, renaming files, or directories. It is also responsible for determining the mapping of data blocks to specific datanode nodes. Datanode is responsible for processing and reading requests from file system clients. Data block creation, deletion and replication under the unified dispatch of Namenode.
Namenode and Datanode are designed to run on ordinary commercial machines. These machines typically run the GNU operating system (OS). HDFs is developed in the Java language, so any Java-enabled machine can deploy Namenode or Datanode. With a highly portable Java language, HDFS can be deployed to multiple types of machines. A typical deployment scenario is to run only one Namenode instance on a single machine, while the other machines in the cluster run a Datanode instance. This architecture does not preclude running multiple datanode on a single machine, but this is rarely the case.
The structure of a single namenode in a cluster greatly simplifies the architecture of the system. Namenode is the arbiter and manager of all HDFs metadata, so that user data never flows through Namenode.
File system name space (namespace)
HDFs supports the traditional hierarchical file organization structure. Users or applications can create directories and then save files in these directories. The hierarchy of file system namespaces is similar to most existing file systems: Users can create, delete, move, or rename files. Currently, HDFS does not support user disk quotas and access rights control, nor does it support hard links and soft links. However, the HDFS architecture does not prevent these features from being implemented.
Namenode is responsible for maintaining the namespace of the file system, and any modifications to the file System namespace or attributes will be namenode recorded. The application can set the number of copies of HDFs saved files. The number of copies of the file is called the copy factor of the file, and this information is also saved by Namenode.
Data Replication
HDFs is designed to reliably store oversized files across machines in a large cluster. It stores each file as a series of blocks of data, except for the last one, all blocks of data are the same size. For fault tolerance, all data blocks of the file will have a copy. The data block size and replica coefficients for each file are configurable. The application can specify the number of copies of a file. The copy factor can be specified when the file is created, or it can be changed later. The files in HDFs are written one at a time, and there is a strict requirement that there can only be a single writer at all times.
Namenode manages the duplication of data blocks, periodically receiving heartbeat and block status reports (Blockreport) from each datanode in the cluster. Receiving a heartbeat signal means that the Datanode node is working properly. The Block status report contains a list of all the blocks of data on the Datanode.
copy storage: The first step
The storage of replicas is the key to HDFs reliability and performance. Optimized replica storage strategy is an important feature that HDFs distinguishes from most other distributed file systems. This feature requires a lot of tuning and experience accumulation. HDFs uses a strategy called Rack Awareness (Rack-aware) to improve data reliability, availability, and network bandwidth utilization. The current implementation of the replica hosting strategy is only the first step in this direction. The short-term goal of implementing this strategy is to validate its effectiveness in a production environment, observe its behavior, and lay the groundwork for testing and research to achieve more advanced strategies.
Large HDFs instances typically run on a cluster of computers spanning multiple racks, and communications between two machines on different racks need to be switched. In most cases, the bandwidth between the two machines in the same rack is greater than the bandwidth between the two machines in the different racks.
Through a rack-aware process, Namenode can determine the rack ID that each datanode belongs to. A simple but not optimized strategy is to store the replicas in different racks. This effectively prevents the loss of data when the entire rack fails, and allows the bandwidth of multiple racks to be fully utilized when reading the data. This policy setting distributes the replicas evenly across the cluster and facilitates load balancing when components fail. However, because a write operation of this strategy requires the transfer of data blocks to multiple racks, this increases the cost of writing.
In most cases, the replica factor is the 3,HDFS strategy for storing a copy on a node in the local rack, one copy on another node of the same rack, and the last copy on a different rack node. This strategy reduces data transfer between racks, which increases the efficiency of write operations. Rack errors are far less than node errors, so this strategy does not affect the reliability and availability of data. At the same time, because the data blocks are only placed on two (not three) different racks, this policy reduces the total network transport bandwidth required to read the data. Under this strategy, replicas are not evenly distributed across racks. The One-third copy is on one node, two-thirds of the replicas are on one rack, and the other replicas are evenly distributed across the remaining racks, a strategy that improves write performance without compromising data reliability and read performance.
Currently, the default copy hosting policy described here is in the process of being developed.
Replica Selection
To reduce overall bandwidth consumption and read latency, HDFs will try to get the reader to read the most recent copy from it. If there is a copy on the same rack of the reader, the copy is read. If a HDFS cluster spans multiple data centers, the client will also first read the copy of the data center.
Safe Mode
Namenode is launched into a special state called Safe mode. Namenode in Safe mode do not replicate data blocks. Namenode receives heartbeat signals and block status reports from all Datanode. Block status reports include a list of all blocks of data for a datanode. Each data block has a specified minimum number of replicas. When Namenode detects that the number of replicas of a block reaches this minimum, the data block is considered to be a replica security (safely replicated) , after a certain percentage (this parameter can be configured) of the block of data is Namenode detect is safe (plus an additional 30 seconds waiting time), Namenode will exit the Safe mode state. It then determines which data blocks have not reached the specified number of replicas and copies them to other Datanode.
file system metadata Persistence
The Namenode has a HDFs name space on it. For any operation that modifies file system metadata, Namenode is logged with a transaction log called Editlog. For example, if you create a file in HDFs, Namenode inserts a record in the Editlog, and similarly, the copy factor of the modification file inserts a record into the editlog. Namenode stores this editlog in the local operating system's filesystem. The namespace of the entire file system, including the mapping of data blocks to files, the attributes of files, etc., is stored in a file called Fsimage, which is also placed on the local filesystem where Namenode resides.
The namenode holds the image of the entire file system's namespace and File Block mappings (BLOCKMAP) in memory. This critical metadata structure is designed so tightly that a namenode with 4G of memory is sufficient to support a large number of files and directories. When Namenode starts, it reads Editlog and fsimage from the hard disk, acts on the fsimage in memory for all editlog transactions, saves the new version of Fsimage from memory to the local disk, and then deletes the old Editlog. Because this old Editlog affair has been on the fsimage. This process is called a checkpoint (checkpoint). In the current implementation, checkpoints occur only when Namenode is started, and in the near future a periodic checkpoint is implemented.
Datanode stores HDFs data as a file in a local file system, and does not know about HDFs files. It stores each HDFS data block in a separate file on the local file system. Datanode does not create all the files in the same directory, in fact, it uses a heuristic approach to determine the best number of files per directory, and creates subdirectories when appropriate. Creating all local files in the same directory is not an optimal choice because the local file system may not be able to efficiently support a large number of files in a single directory. When a datanode is started, it scans the local filesystem, produces a list of all the HDFs blocks corresponding to the local file, and then sends the report to Namenode, which is the block status report.
Communication Protocol
All HDFS communication protocols are based on the TCP/IP protocol. The client connects to the Namenode through a configurable TCP port and interacts with the Namenode through the ClientProtocol protocol. Datanode uses the DATANODEPROTOCOL protocol to interact with Namenode. A remote Procedure call (RPC) model is abstracted to encapsulate the ClientProtocol and Datanodeprotocol protocols. In design, Namenode does not initiate RPC on its own initiative, but instead responds to RPC requests from clients or Datanode.
Robust
The main goal of HDFs is to ensure the reliability of data storage even in the event of an error. Common three error cases are: Namenode error, datanode error, and network fragmentation (receptacle partitions).
Disk data errors, heartbeat detection and re-replication
Each Datanode node periodically sends a heartbeat signal to the Namenode. Network fragmentation may cause some datanode to lose contact with Namenode. Namenode detects this by missing heartbeat signals and marks these recent heartbeat datanode as downtime and does not send new IO requests to them. Any data stored on the downtime Datanode will no longer be valid. Datanode downtime may cause some data blocks to have a copy factor below the specified value, Namenode continuously detects the blocks of data that need to be replicated and initiates the replication operation once it is discovered. You may need to replicate in the following situations: A Datanode node fails, a copy is corrupted, a hard disk on the Datanode is wrong, or the copy factor for the file is increased.
Cluster Equilibrium
The HDFS architecture supports the data equalization strategy. If the free space on a datanode node is below a specific critical point, the system will automatically move the data from this datanode to other idle datanode according to the equilibrium policy. When a request for a file suddenly increases, it is also possible to start a plan to create a new copy of the file and to rebalance other data in the cluster at the same time. These equalization strategies have not yet been implemented.
Data Integrity
A block of data obtained from a datanode may be corrupted and may be caused by a Datanode storage device error, a network error, or a software bug. The HDFS client software implements a checksum (checksum) check of the contents of the HDFs file. When the client creates a new HDFs file, it calculates the checksum for each block of the file, and checksum it and saves it as a separate hidden file under the same HDFs name space. When the client obtains the contents of the file, it verifies that the data obtained from Datanode matches the checksum in the corresponding checksum file, and if not, the client can choose to obtain a copy of the block from another datanode.
Metadata Disk Error
Fsimage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFs instance is invalidated. Thus, Namenode can be configured to support the maintenance of multiple copies of Fsimage and Editlog. Any modifications to fsimage or Editlog will be synchronized to their replicas. This multiple replica synchronization operation may reduce the number of namespace transactions processed by Namenode per second. However, this price is acceptable, because even though HDFS applications are data-intensive, they are also data-intensive. When Namenode reboots, it selects the nearest complete fsimage and Editlog to use.
Namenode is the single point of failure in the HDFs cluster. If Namenode machine malfunction, it needs manual intervention. Currently, the ability to automatically reboot or do namenode failover on another machine has not been implemented.
Snapshot
Snapshots support a replicated backup of data at a particular point in time. With snapshots, you can allow HDFs to revert to a known-good point in time when data is corrupted. HDFs currently does not support snapshot functionality, but is scheduled to be supported in future releases.
Data Block
The HDFS is designed to support large files, and the applications that apply to HDFS are those that need to deal with large datasets. These applications are written only once, but read one or more times, and the reading speed should be able to meet the needs of streaming reading. HDFs "Write multiple read-once" semantics for supporting files. A typical data block size is 64MB. Thus, the files in the HDFs are always cut into different blocks according to 64M, and each block is stored in as many different datanode as possible.
Staging
The client's request to create the file was not sent immediately to Namenode, in fact, at the beginning HDFs the client would first cache the file data to a local temporary file. The write operation of the application is transparently redirected to this temporary file. When this temporary file accumulates more data than a block of data, the client will contact Namenode. Namenode inserts a file name into the filesystem hierarchy and assigns a block of data to it. It then returns the Datanode identifier and the target block to the client. The client then uploads the data from the local temporary file to the specified datanode. When the file is closed, the remaining data that is not uploaded in the temporary file is transferred to the specified datanode. The client then tells Namenode that the file has been closed. Namenode then submits the file creation operation to the log for storage. If Namenode is down before the file is closed, the file will be lost.
The above method is the result of careful consideration of the target application running on the HDFs. These applications require streaming writes of files. If the client-side caching is not adopted, the network speed and network congestion will have a greater impact on the swallowing estimate. This approach is not without precedent, and early file systems, such as AFS, use client caching to improve performance. To achieve higher data upload efficiencies, the POSIX standard requirements have been relaxed.
Pipeline Replication
When the client writes data to the HDFs file, it is initially written to the local temporary file. Assuming that the copy factor for the file is set to 3, when the local temporary file accumulates to a data block size, the client obtains a Datanode list from Namenode for the copy. The client then begins to transmit data to the first Datanode, the first datanode a small fraction (4 KB) to receive the data, write each part to the local warehouse, and transfer the portion to the second Datanode node in the list at the same time. This is also true of the second Datanode, where a small fraction of the data is received, written to the local repository, and passed on to the third Datanode. Finally, the third Datanode receives the data and stores it locally. As a result, Datanode can receive data from the previous node in a pipelined manner and forward it to the next node at the same time, and the data is copied to the next in a pipelined fashion from the previous datanode.
Accessibility
HDFS provides a variety of ways to access applications. Users can access through the Java API interface, also through the C Language Encapsulation API access, but also through the browser to access the files in the HDFs. Access through WebDAV protocol is under development.
Dfsshell
HDFS organizes user data in the form of files and directories. It provides a command-line interface (Dfsshell) that allows users to interact with the data in HDFs. The syntax of the command is similar to that of other shells that are familiar to the user, such as bash, CSH. Here are some examples of actions/commands:
Action command Create a directory named/foodir bin/hadoop dfs-mkdir/foodir Create a directory named/foodir Bin/hadoop view named Dfs-mkdir/foodir MyFile.txt file content Bin/hadoop dfs-cat/foodir/myfile.txt
Dfsshell can be used on applications that interact with scripting languages and file systems.
Dfsadmin
The Dfsadmin command is used to manage HDFs clusters. These commands are available only to HDSF administrators. Here are some examples of actions/commands:
The
action command places the cluster in Safe mode bin/hadoop dfsadmin-safemode Enter displays datanode list bin/hadoop dfsadmin-report node Datanodename retired Bin/hadoop dfsadmin-decommission Datanodename browser interface
A typical HDFS installation will open a Web server on a configurable TCP port to expose the HDFs namespace. Users can use the browser to browse the HDFs namespace and view the contents of the file.
Storage-Space recovery file deletion and recovery
When a user or application deletes a file, the file is not immediately removed from the HDFs. In fact, HDFs will rename this file to the/trash directory. The file can be quickly recovered as long as the file is still in the/trash directory. The time that a file is saved in/trash is configurable, and when this time is exceeded, Namenode deletes the file from the namespace. Deleting a file causes the file-related block of data to be freed. Note that there is a time lag between deleting a file from a user and increasing the amount of free space HDFs.
As long as the deleted file is still in the/trash directory, the user can restore the file. If the user wants to recover the deleted file, he/she can browse the/trash directory to retrieve the file. The/trash directory only saves the last copy of the deleted file. The/trash directory is no different from other directories except for the point that HDFs will apply a special policy to automatically delete files on the directory. The current default policy is to delete files that have a retention time of more than 6 hours in/trash. In the future, this strategy can be configured with a well-defined interface.
reduction of copy coefficients
When the copy factor of a file is reduced, the Namenode will select an excess copy deletion. This information is passed to Datanode at the next heartbeat test. Datanode then removed the corresponding data block, the free space in the cluster increased. Similarly, there is a delay between calling the Setreplication API end and increasing the amount of free space in the cluster.
Reference
HDFS Java api:http://hadoop.apache.org/core/docs/current/api/
HDFS Source code: http://hadoop.apache.org/core/version_control.html
by Dhruba Borthakur