Key points and architecture of Hadoop HDFS Distributed File System Design

Source: Internet
Author: User

Hadoop Introduction: a distributed system infrastructure developed by the Apache Foundation. You can develop distributed programs without understanding the details of the distributed underlying layer. Make full use of the power of clusters for high-speed computing and storage. Hadoop implements a Distributed File System (Hadoop
Distributed File System), HDFS for short. HDFS features high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware. It also provides high throughput to access application data, suitable for those with large datasets (large)
Data set. HDFS relaxed (relax) POSIX requirements (requirements) so that you can access the data in the streaming access File System in the form of a stream.

Hadoop Website: http://hadoop.apache.org/

Hadoop Chinese document: http://hadoop.apache.org/common/docs/r0.21.0/cn/

I. Prerequisites and design objectives

1. hardware errors are normal, rather than exceptions. HDFS may consist of hundreds of servers, and any component may remain invalid, therefore, error detection and fast and automatic recovery are the core architecture goals of HDFS.

2. applications running on HDFS are different from general applications. They are mainly stream-based reads for batch processing. This is better than focusing on the low latency of data access, the more important thing is the high throughput of data access.

3. HDFS is designed to support the integration of large data sets. The size of a typical file stored in HDFS is generally from 1 GB to T bytes. A single HDFS instance should be able to support tens of millions of files.

4. HDFS applications require the write-one-read-committed access model for files. After a file is created and written, it does not need to be changed after it is closed. This hypothesis simplifies data consistency and makes high-throughput data access possible. A typical mapreduce framework or a Web Crawler application is suitable for this model.

5. The cost of mobile computing is lower than that of mobile data. The closer an application request is to be processed, the more efficient it is. This is especially true when the data reaches a massive level. Moving computing to the vicinity of data is obviously better than moving data to the application. HDFS provides an interface like this to the application.

6. Portability between heterogeneous software and hardware platforms.

Ii. Namenode and Datanode

HDFS adopts the Master/Slave architecture. An HDFS cluster consists of a namenode and a certain number of datanode.
1) namenode is a central server responsible for managing the file system namespace and client access to files.
2) datanode is generally a node in a cluster and is responsible for managing the storage attached to the nodes. A file is actually divided into one or more blocks, which are stored in the datanode set. Namenode executes the namespace operations of the file system, such as opening, closing, renaming files and directories, and decides the ing between blocks and specific datanode nodes. Under the command of namenode, datanode creates, deletes, and copies blocks.
Both namenode and datanode are designed to run on ordinary cheap Linux machines. HDFS is developed in Java, so it can be deployed on a wide range of machines. A typical Deployment scenario is that one machine runs a separate namenode node, and other machines in the cluster run one datanode instance each. This architecture does not include running multiple datanode on one machine, but this is rare.


The namenode of A Single Node greatly simplifies the system architecture. Namenode is responsible for keeping and managing all the HDFS metadata, so user data does not need to be read and written through namenode (that is, the file data is read and written directly on datanode ).

Iii. File System namespace

HDFS supports traditional hierarchical file organizations. Similar to most other file systems, you can create directories, create, delete, move, and rename files. HDFS does not support user quotas and access permissions, and does not support links. However, the current architecture does not rule out the implementation of these features. Namenode maintains the namespace of the file system. any modifications to the namespace and file attributes of the file system will be recorded by namenode. The application can set the number of copies of the files stored by HDFS. The number of copies is called the replication factor of the files. This information is also saved by namenode.

Iv. Data Replication

HDFS is designed to reliably store massive files across machines in a large cluster. It stores each file as a block sequence. Except for the last block, all the blocks are of the same size. All blocks of files are copied for fault tolerance. The block size and replication factor of each file are configurable. The replication factor can be configured when a file is created and can be changed later. The file in HDFS is write-one, and only one writer is required at any time. Namenode manages block replication. It periodically receives heartbeat packets and a blockreport from each datanode in the cluster. The reception of the heartbeat packet indicates that the datanode node is working normally.
And blockreport includes a list of all blocks on the datanode.

1. The storage of copies is the key to the reliability and performance of HDFS. HDFS uses a policy called Rack-aware to improve data reliability, effectiveness, and network bandwidth utilization. The short-term goal of this strategy is to verify the performance in the production environment, observe its behavior, and build the basis for testing and research to achieve more advanced strategies. A large HDFS instance is generally run on a cluster formed by computers on multiple racks. Communication between two machines on different racks must pass through a switch. Obviously, the bandwidth between two nodes in the same rack is larger than that between two machines in different racks.

Through a process called Rack Awareness, Namenode determines the rack id of each Datanode. A simple but not optimized policy is to store copies on a separate rack. This prevents the entire Rack (non-replica storage) from being invalid and allows reading data from multiple racks. This simple policy can distribute copies in the cluster, which facilitates Load Balancing when components fail. However, this simple policy increases the write cost because a write operation needs to transmit blocks to multiple racks.

In most cases, the replication factor is 3. The HDFS storage policy is to store a copy on a node in the local rack, and one copy on another node in the same rack, the last copy is placed on a node on different racks. Rack errors are far fewer than node errors. This policy does not affect data reliability and effectiveness. 1/3 of copies are on one node, 2/3 on one rack, and others are stored in the rest of the rack. This policy improves write performance.

2. Copy selection. To reduce the overall bandwidth consumption and read latency, HDFS will try its best to allow reader to read the nearest copy. If there is a copy on the same rack of reader, read the copy. If an HDFS cluster spans multiple data centers, reader first tries to read copies of the local data center.

3. SafeMode

After Namenode is started, it enters a special State called SafeMode. Namenode in this state will not replicate data blocks. Namenode receives heartbeat packets and blockreports from all Datanode. Blockreport includes a list of all data blocks of a Datanode. Each block has a specified decimal copy. When the Namenode check determines the minimum number of data block copies of a Datanode, The Datanode is considered safe; if it is a certain percentage (this parameter can be configured) the data block detection is safe, so the Namenode will exit the SafeMode state, and then it will determine which data block copies have not reached the specified number, and
These blocks are copied to other Datanode.

V. Persistence of File System metadata

Namenode stores HDFS metadata. For any operation that modifies the file metadata, namenode uses a transaction log called editlog to record it. For example, if you create a file in HDFS, The namenode inserts a record in the editlog. Similarly, the replication factor that modifies the file inserts a record into the editlog. Namenode stores this editlog in the local OS file system. The namespace of the entire file system, including the block-to-file ing and file attributes, is stored in a file called fsimage. This file is also stored in the file system of the namenode system.

Namenode stores the entire file system namespace and file blockmap images in the memory. This key metadata is very compact, so a namenode with 4G memory is enough to support a large number of files and directories. When namenode is started, it reads editlog and fsimage from the hard disk, and applies all the transactions in the editlog to the fsimage in the memory, and flush the new fsimage from the memory to the hard disk, and then truncate the old editlog, because the transaction of the old editlog has been applied to the fsimage. This process is called checkpoint. In the current implementation, the checkpoint only occurs when the namenode is started. In the near future, we will
Supports periodic checkpoints.

Datanode does not know anything about a file, except to save the data in the file on a local file system. It stores each HDFS data block in an isolated file on the local file system. Datanode does not create all files in the same directory. Instead, it uses heuristic locality to determine the optimal number of files in each directory and create subdirectories when appropriate. Creating all files in the same directory is not the best choice, because the local file system may not be able to efficiently support a large number of files in a single directory. When a datanode is started, it scans the local file system, generates a list Of all HDFS data blocks for these local files, and then sends a report to namenode. This report is blockreport.

Vi. Communication Protocols

All HDFS communication protocols are built on TCP/IP. The client connects to Namenode through a configurable port and interacts with Namenode through ClientProtocol. Datanode uses DatanodeProtocol to interact with Namenode. From ClientProtocol and
Datanodeprotocol abstracts a Remote Call (RPC). In Design, Namenode does not actively initiate RPC, but responds to RPC requests from clients and Datanode.

7. robustness

The main goal of HDFS is to achieve data storage reliability in the case of failure. Three common failures are Namenode failures, Datanode failures, and network partitions ).

1. Hard Disk Data error, heartbeat detection, and re-replication

Each Datanode node sends a heartbeat packet to the Namenode periodically. Network cutting may cause some Datanode to lose contact with Namenode. Namenode detects this situation by missing heartbeat packets and marks these Datanode as dead without sending new IO requests to them. Any data stored on dead Datanode is no longer valid. The death of Datanode may cause the number of copies of some blocks to be smaller than the specified value. Namenode constantly tracks the blocks to be copied and starts replication whenever necessary. If a Datanode node fails, a copy is damaged, and the hard disk on Datanode is incorrect
Or the replication factor of the file increases.

2. Cluster balancing

HDFS supports a data balancing plan. If the free space on a Datanode node is lower than the specified critical point, a plan is started to automatically move data from a Datanode to a idle Datanode. When requests to a file suddenly increase, a plan may also be started to create a new copy of the file and distributed to the cluster to meet the application requirements. These balancing plans are not yet implemented.

3. Data Integrity

The data block obtained from a datanode may be damaged because of a storage device error, network error, or software bug of datanode. The HDFS client software implements the HDFS file content checksum. When a client creates a new HDFS file, it calculates the checksum of each block in the file and saves the checksum in the same HDFS namespace as a separate hidden file. When the client retrieves the file content, it will confirm whether the data obtained from datanode matches the checksum in the corresponding checksum file. If not, the client can choose to obtain a copy of the block from other datanode.

4. Metadata disk error

Fsimage and editlog are the core data structures of HDFS. If these files are damaged, the entire HDFS instance will become invalid. Therefore, namenode can be configured to support maintenance of multiple copies of fsimage and editlog. Any modification to fsimage or editlog will be synchronized to their copies. This synchronization operation may reduce the number of namespace transactions that can be processed by namenode per second. This price is acceptable because HDFS is data-intensive rather than metadata-intensive. When namenode is restarted, it always selects the latest consistent fsimage and editlog.

Namenode exists at a single point in HDFS. If the machine where namenode is located is incorrect, manual intervention is required. At present, the namenode function of stopping services due to a fault has not been implemented on another machine.

5. Snapshots

Snapshots support data copying at a certain time. When HDFS data is damaged, it can be restored to a known correct time point. HDFS does not currently support the snapshot function.

8. Data Organization

1. Data blocks

Applications compatible with HDFS process big data sets. These applications write data once, read data once to multiple times, and the read speed must meet the streaming read speed. HDFS supports the write-once-read-committed Syntax of files. A typical block size is 64 mb. Therefore, files are always split into chunks Based on 64 MB, and each chunk is stored in different datanode

2. Steps

The request to create a file on a client is not sent to Namenode immediately. In fact, the HDFS client caches the file data to a temporary file on the local machine. The application write is transparently redirected to this temporary file. When the data accumulated in this temporary file exceeds the size of a block (64 MB by default), the client will contact Namenode. Namenode inserts a file name into the hierarchical structure of the file system and assigns a data block to it. Then, it returns the identifier and target data block of Datanode to the client. The client flush the local temporary file to the specified Datanode. When the file is closed, the remaining data without flush in the temporary file will also be transmitted to the specified Datanode, and the client will tell the Namenode file that
Disable. In this case, Namenode submits the file creation operation to persistent storage. If Namenode fails before the file is closed, the file will be lost.

The above method is the result of carefully considering the target applications running on HDFS. If the client cache is not used, network speed and network congestion will have a great impact on the swallow estimation.

3. Pipeline Replication

When a client writes data to an HDFS file, it first writes a local temporary file. Assume that the replication factor of the file is set to 3, the client obtains a Datanode list from Namenode to store copies. Then the client starts to transmit data to the first Datanode. The first Datanode receives data in a small part (4 kb) and writes each part to the local warehouse, this part is also transmitted to the Second Datanode node. This is also true for the second Datanode, where a small part is received and stored in the Local warehouse, and sent to the third Datanode. The third Datanode is only used for receiving and storing data. This is assembly-line replication.

9. Accessibility

HDFS provides multiple access methods for applications. You can use DFSShell to interact with HDFS data through command lines, call java APIs, or encapsulate APIs in C languages, it also provides browser access. Developing a WebDav protocol access method. For more information, see.

10. Space recycling

1. File Deletion and Restoration

The user or application does not immediately delete a file from HDFS. Instead, HDFS renames the file and transfers it to the/trash directory. When the file is still in the/trash directory, the file can be quickly restored. The storage time of the file in/trash is configurable. When this time is exceeded, namenode deletes the file from the namespace. The deletion of the file will also release the data blocks associated with the file. Note that there is a latency between the deletion of files by users and the increase of HDFS free space.

When the deleted file is still in the/trash directory, if you want to restore the file, you can search the Browse/trash directory and retrieve the file. The/trash directory only saves the last copy of the deleted object. The/trash directory is no different from other file directories, except that HDFS applies a special policy on this directory to automatically delete files, the current default policy is to delete files that have been retained for more than six hours. This policy will be defined as a configurable interface in the future.

2. Reduction of replication factor

When the replication factor of a file decreases, namenode selects the excess copy to be deleted. The next heartbeat detection will pass this information to datanode, and datanode will remove the corresponding block and release the space. Similarly, there will be a time delay between calling the setreplication method and increasing the free space in the cluster.

References:

HDFS Java API: http://hadoop.apache.org/core/docs/current/api/

HDFS source code: http://hadoop.apache.org/core/version_control.html

Original article: http://hadoop.apache.org/core/docs/current/hdfs_design.html

Http://hadoop.apache.org/common/docs/r0.21.0/cn/hdfs_design.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.