Hadoop Distributed File System: architecture and design (zz)

Source: Internet
Author: User
Tags file copy
Document directory
  • Hardware Error
  • Stream Data Access
  • Large-scale Dataset
  • Simple consistency Model
  • "Mobile computing is more cost-effective than mobile data"
  • Portability between heterogeneous software and hardware platforms
  • Copy storage: the first step
  • Copy Selection
  • Security Mode
  • Disk Data error, heartbeat detection and re-replication
  • Cluster balancing
  • Data Integrity
  • Metadata disk error
  • Snapshots
  • Data Block
  • Staging
  • Assembly line Replication
  • DFSShell
  • DFSAdmin
  • Browser Interface
  • File Deletion and recovery
  • Reduce copy Coefficient
  • Introduction
  • Prerequisites and design objectives
    • Hardware Error
    • Stream Data Access
    • Large-scale Dataset
    • Simple consistency Model
    • "Mobile computing is more cost-effective than mobile data"
    • Portability between heterogeneous software and hardware platforms
  • Namenode and Datanode
  • File System namespace)
  • Data Replication
    • Copy storage: the first step
    • Copy Selection
    • Security Mode
  • Persistence of File System metadata
  • Communication Protocol
  • Robustness
    • Disk Data error, heartbeat detection and re-replication
    • Cluster balancing
    • Data Integrity
    • Metadata disk error
    • Snapshots
  • Data Organization
    • Data Block
    • Staging
    • Assembly line Replication
  • Accessibility
    • DFSShell
    • DFSAdmin
    • Browser Interface
  • Reclaim buckets
    • File Deletion and recovery
    • Reduce copy Coefficient
  • References


Introduction

Hadoop Distributed File System (HDFS)Is designed as a distributed file system suitable for running on a common hardware (commodity hardware. It has a lot in common with the existing distributed file system. But at the same time, it is quite different from other distributed file systems. HDFS is a highly fault-tolerant system and is suitable for deployment on cheap machines. HDFS provides high-throughput data access and is suitable for applications on large-scale datasets. HDFS relaxed some POSIX constraints to implement streaming reading of file system data. HDFS was initially developed as the infrastructure of the Apache Nutch search engine project. HDFS is part of the Apache Hadoop Core project. The Project address is http://hadoop.apache.org/core /.


Prerequisites and design objectives
Hardware Error

Hardware errors are normal rather than abnormal. HDFS may consist of hundreds of thousands of servers. Each server stores part of the data in the file system. The reality is that the number of components that constitute the system is huge, and any component may fail, which means that some HDFS components are always not working. Therefore, error detection and fast and automatic recovery are the core architectural goals of HDFS.


Stream Data Access

Applications running on HDFS are different from common applications and require streaming access to their datasets. In the HDFS design, data batch processing is more taken into account than user interaction processing. Compared with the low latency of data access, the key is the high throughput of data access. Many hard constraints set by POSIX are not required for HDFS application systems. To improve data throughput, the POSIX semantics has been modified in some key aspects.


Large-scale Dataset

Applications running on HDFS have large datasets. The size of a typical HDFS file is generally from G bytes to T bytes. Therefore, HDFS is adjusted to support large file storage. It should be able to provide a whole-body high data transmission bandwidth and expand to hundreds of nodes in a cluster. A single HDFS instance should support tens of millions of files.


Simple consistency Model

An HDFS application requires a file access model that reads data multiple times at a time. A file does not need to be changed after it is created, written, or closed. This assumption simplifies data consistency and makes high-throughput data access possible. MAP/reduce applications or web crawler applications are very suitable for this model. It is also planned to expand this model in the future to support additional write operations on files.


"Mobile computing is more cost-effective than mobile data"

The closer an application request is to be processed, the more efficient it is. This can reduce the impact of network congestion and improve the system data throughput. Moving computing to the vicinity of data is obviously better than moving data to the application. HDFS provides an interface for applications to move themselves to the vicinity of data.


Portability between heterogeneous software and hardware platforms

HDFS is designed with regard to the portability of the platform. This feature facilitates the promotion of HDFS as a large-scale data application platform.


Namenode and datanode

HDFS adopts the Master/Slave architecture. An HDFS cluster consists of a namenode and a certain number of datanodes. Namenode is a central server responsible for managing the file system namespace and client access to files. A datanode in a cluster is generally a node responsible for managing the storage on its node. HDFS exposes the namespace of the file system, allowing you to store data in the form of files. Internally, a file is actually divided into one or more data blocks, which are stored in a group of datanode. Namenode executes the namespace operations of the file system, such as opening, closing, renaming a file or directory. It is also responsible for determining the ing between data blocks and specific datanode nodes. Datanode is responsible for processing read/write requests from the file system client. Create, delete, and copy data blocks under the unified scheduling of namenode.

Namenode and datanode are designed to run on common commercial machines. These machines generally run the GNU/Linux operating system (OS)). HDFS is developed in Java. Therefore, namenode or datanode can be deployed on any machine that supports Java. HDFS can be deployed on multiple types of machines because of its highly portable Java language. A typical Deployment scenario is to run only one namenode instance on one machine, while other machines in the cluster run one datanode instance respectively. This architecture does not reject running multiple datanode on one machine, but this is rare.

The single namenode structure in the cluster greatly simplifies the system architecture. Namenode is the arbitration and manager of all HDFS metadata, so that user data will never flow through namenode.


File System namespace)

HDFS supports the traditional hierarchical file structure. Users or applications can create directories and save the files in these directories. The hierarchical structure of the file system namespace is similar to that of most existing file systems: users can create, delete, move, or rename files. Currently, HDFS does not support disk quota and access permission control, nor does it support hard and soft links. However, the HDFS architecture does not prevent implementation of these features.

Namenode maintains the file system namespace. any modifications to the file system namespace or attributes will be recorded by namenode. Applications can set the number of copies of files stored in HDFS. The number of file copies is called the copy coefficient of the file. This information is also saved by namenode.


Data Replication

HDFS is designed to reliably store large files across machines in a large cluster. It stores each file into a series of data blocks. Except for the last one, all data blocks are of the same size. All data blocks of the file have copies to accommodate errors. The data block size and copy coefficient of each file are configurable. The application can specify the number of copies of a file. The copy coefficient can be specified or changed later when the file is created. All files in HDFS are written at one time, and there must be only one writer at any time.

Namenode manages data block replication. It periodically receives heartbeat signals and block status reports (Blockreport) from each Datanode in the cluster ). The received heartbeat signal means that the Datanode node is working properly. The block Status Report contains a list of all data blocks on the Datanode.


Copy storage: the first step

The storage of copies is the key to the reliability and performance of HDFS. The optimized copy storage policy is an important feature that HDFS distinguishes from most other distributed file systems. This feature requires a lot of tuning and experience accumulation. HDFS uses a policy called rack-aware to improve data reliability, availability, and network bandwidth utilization. The current implementation of the copy storage policy is only the first step in this direction. The short-term goal of achieving this strategy is to verify its effectiveness in the production environment, observe its behavior, and lay the foundation for testing and research to achieve more advanced strategies.

Large HDFS instances generally run on clusters composed of computers across multiple racks. Communication between two machines on different racks must go through switches. In most cases, the bandwidth between two machines in the same rack is larger than that between two machines in different racks.

Through a rack-aware process, Namenode can determine the id of the rack to which each Datanode belongs. A simple but not optimized strategy is to store copies on different racks. This effectively prevents data loss when the entire rack fails and allows Data Reading to take full advantage of the bandwidth of multiple racks. This policy can evenly distribute copies in the cluster, which is conducive to load balancing when the component fails. However, a write operation using this policy needs to transmit data blocks to multiple racks, which increases the write cost.

In most cases, the copy coefficient is 3. The HDFS storage policy is to store one copy on the node of the local rack, and one copy on another node of the same rack, the last copy is placed on nodes of different racks. This policy reduces data transmission between racks, which improves the efficiency of write operations. Rack errors are far fewer than node errors, so this policy does not affect data reliability and availability. At the same time, because data blocks are only placed on two (not three) Different racks, this policy reduces the total network transmission bandwidth required for Data Reading. In this policy, copies are not evenly distributed across different racks. 1/3 of the copies are on one node, 2/3 of the copies are on one rack, and other copies are evenly distributed in the remaining racks, this policy improves write performance without compromising data reliability and read performance.

Currently, the default copy storage policy described here is under development.


Copy Selection

To reduce the overall bandwidth consumption and read latency, HDFS tries its best to allow the reader to read the copy closest to it. If there is a copy on the same rack of the read program, the copy will be read. If an HDFS cluster spans multiple data centers, the client will first read copies of the local data center.


Security Mode

After namenode is started, it enters a special State called security mode. In safe mode, namenode does not replicate data blocks. Namenode receives heartbeat signals and block status reports from all datanode. The block Status Report contains a list of all data blocks of a datanode. Each data block has a specified minimum number of copies. When the namenode check determines that the number of copies of a data block has reached this minimum value, the data block will be considered as a copy safe (safely replicated); at a certain percentage (this parameter can be configured) after the data block is confirmed as secure by namenode (plus an additional 30 seconds to wait), namenode will exit the security mode. Next, it will determine which data blocks have not reached the specified number of copies, and copy these data blocks to other datanode.


Persistence of File System metadata

The namespace of HDFS is stored on Namenode. For any operation that modifies the metadata of the file system, Namenode uses a transaction log called EditLog to record it. For example, if you create a file in HDFS, The Namenode will insert a record to indicate it. Similarly, the copy coefficient of the modified file will also insert a record to the Editlog. Namenode stores this Editlog in the file system of the local operating system. The name space of the entire file system, including data block-to-file ing and file attributes, are stored in a file called FsImage, this file is also stored in the local file system where Namenode is located.

Namenode stores the entire file system namespace and Blockmap images in the memory. This key metadata structure is very compact. Therefore, a 4 GB Namenode is sufficient to support a large number of files and directories. When Namenode is started, it reads the Editlog and FsImage from the hard disk, applies all the transactions in the Editlog to the FsImage in the memory, and saves the new FsImage from the memory to the local disk, then delete the old Editlog, because the transaction of the old Editlog has been applied to FsImage. This process is called a checkpoint ). In the current implementation, checkpoints only occur when Namenode is started. In the near future, periodic checkpoints will be supported.

Datanode stores HDFS data as files in a local file system and does not know the information about HDFS files. It stores each HDFS data block in a separate file in the local file system. Datanode does not create all files in the same directory. In fact, it uses a testing method to determine the optimal number of files in each directory and create subdirectories when appropriate. Creating all local files in the same directory is not the best choice, because the local file system may not be able to efficiently support a large number of files in a single directory. When a Datanode is started, it will scan the local file system, generate a list Of all HDFS data blocks corresponding to these local files, and then send it as a report to Namenode. This report is the block status report.


Communication Protocol

All HDFS communication protocols are built on the TCP/IP protocol. The client uses a configurable TCPThe port is connected to Namenode and interacts with Namenode through the ClientProtocol protocol. Datanode uses the DatanodeProtocol protocol to interact with Namenode. A Remote Procedure Call (RPC)The model is abstracted to encapsulate the ClientProtocol and Datanodeprotocol protocols. In design, Namenode does not actively initiate RPC, but responds to RPC requests from clients or Datanode.


Robustness

The primary goal of HDFS is to ensure the reliability of data storage even when an error occurs. Three Common Errors are Namenode errors, Datanode errors, and network partitions ).


Disk Data error, heartbeat detection and re-replication

Each Datanode node periodically sends a heartbeat signal to the Namenode. Network splitting may cause some Datanode to lose contact with Namenode. Namenode detects this situation by missing heartbeat signals, and marks these no longer sent heartbeat signals as down, no new IOPlease send them. Any data stored on the downtime Datanode will no longer be valid. The downtime of Datanode may cause the copy coefficient of some data blocks to be lower than the specified value. Namenode will continuously detect the data blocks to be copied. Once discovered, the replication operation will be started. In the following cases, you may need to copy again: A Datanode node fails, a copy is damaged, the hard disk on Datanode is incorrect, or the file copy coefficient increases.


Cluster balancing

HDFS architecture supports data balancing policies. If the free space on a Datanode node falls below the specified critical point, the system automatically moves data from this Datanode to another idle Datanode according to the balance policy. When requests to a file suddenly increase, a plan may also be started to create a new copy of the file and rebalance other data in the cluster. These balance policies have not yet been implemented.


Data Integrity

Data blocks obtained from a Datanode may be damaged, which may be caused by storage device errors, network errors, or software bugs of Datanode. The HDFS client software implements a checksum check on the content of the HDFS file. When the client creates a new HDFS file, it calculates the checksum of each data block of the file and saves the checksum as a single hidden file in the same HDFS namespace. After the client obtains the file content, it checks whether the data obtained from Datanode matches the checksum in the corresponding checksum file. If the checksum does not match, the client can obtain a copy of the data block from another Datanode.


Metadata disk error

FsImage and Editlog are the core data structures of HDFS. If these files are corrupted, the entire HDFS instance will become invalid. Therefore, Namenode can be configured to support maintenance of multiple copies of FsImage and Editlog. Any modification to FsImage or Editlog will be synchronized to their copies. This multi-copy synchronization operation may reduce the number of namespace transactions processed by Namenode per second. However, this price is acceptable because HDFS applications are not metadata-intensive even if they are data-intensive. When Namenode is restarted, it selects the latest complete FsImage and Editlog for use.

Namenode is the single point of failure (single point of failure) in the hdfs cluster. If the Namenode machine fails, manual intervention is required. At present, the function of automatic restart or Namenode failover on another machine has not been implemented yet.


Snapshots

Snapshots support copying and backing up data at a specific time point. Snapshots can be used to restore HDFS to a known correct time point when data corruption occurs. HDFS currently does not support snapshot, but it is planned to be supported in future versions.


Data Organization
Data Block

HDFS is designed to support large files. HDFS is applicable to applications that need to process large-scale data sets. These applications only write data once, but read data once or multiple times, and the reading speed should be sufficient for stream reading. HDFS supports the semantics of "one write multiple reads. A typical data block size is 64 MB. Therefore, files in HDFS are always divided into different blocks according to 64 m, and each block is stored in different Datanode as much as possible.


Staging

The request to create a file on the client is not sent to Namenode immediately. In fact, in the initial stage, the HDFS client first caches the file data to a local temporary file. Write operations of the application are transparently redirected to this temporary file. When the amount of data accumulated in this temporary file exceeds the size of a data block, the client will contact Namenode. Namenode inserts a file name into the level of the file system and assigns a data block to it. Return the identifier and target data block of Datanode to the client. Then the client uploads the data from the local temporary file to the specified Datanode. When the file is closed, the remaining unuploaded data in the temporary file will be transmitted to the specified Datanode. Then the client tells the Namenode that the file is closed. At this time, Namenode submits the file creation operation to the log for storage. If Namenode goes down before the file is closed, the file will be lost.

The above method is the result obtained after careful consideration of the target application running on HDFS. These applications require streaming file writing. If the client cache is not used, network speed and network congestion will have a great impact on the swallow estimation. This method is not without precedent. Early file systems, such as AFSTo improve the performance. In order to achieve higher data upload efficiency, the POSIX standard requirements have been relaxed.


Assembly line Replication

When the client writes data to the HDFS file, it first writes data to the local temporary file. Assume that the copy coefficient of the file is set to 3. When the local temporary file accumulates to a data block, the client obtains a Datanode list from Namenode to store the copy. Then the client starts to transmit data to the first Datanode. The first Datanode receives data in a small part (4 KB) and writes each part to the local warehouse, this part is also transmitted to the Second Datanode node in the list. This is also true for the second Datanode. A small part receives data in a small part, writes data to a local warehouse, and transmits the data to the third Datanode at the same time. Finally, the third Datanode receives and stores the data locally. Therefore, Datanode can receive data from the previous node in a pipeline and forward the data to the next node at the same time. The data is copied to the next node in a pipeline.


Accessibility

HDFS provides multiple access methods for applications. You can access files in HDFS through Java APIs, C language encapsulation APIs, and browsers. Through WebDAVProtocol access is under development.


DFSShell

HDFS organizes user data in the form of files and directories. It provides a command line interface (DFSShell) for users to interact with data in HDFS. The command syntax is similar to other shell (such as bash and csh) tools that you are familiar. The following is an example of some actions/commands:

Action Command
Create a directory named/foodir Bin/hadoop dfs-mkdir/foodir
Create a directory named/foodir Bin/hadoop dfs-mkdir/foodir
View the content of the file/foodir/myfile.txt. Bin/hadoop dfs-cat/foodir/myfile.txt

DFSShell can be used in applications that interact with the file system through scripting.


DFSAdmin

The DFSAdmin command is used to manage HDFS clusters. These commands can only be used by the HDSF administrator. The following is an example of some actions/commands:

Action Command
Place clusters in Security Mode Bin/hadoop dfsadmin-safemode enter
Display the Datanode list Bin/hadoop dfsadmin-report
Decommission Datanode node datanodename Bin/hadoop dfsadmin-decommission datanodename


Browser Interface

A typical HDFS installation will enable a Web server on a configurable TCP port to expose the HDFS namespace. You can use a browser to browse the HDFS namespace and view the file content.


Reclaim buckets
File Deletion and recovery

When a user or application deletes a file, the file is not immediately deleted from HDFS. In fact, HDFS will rename the file to the/trash directory. As long as the file is still in the/trash directory, the file can be quickly restored. The storage time of the file in/trash is configurable. When the storage time exceeds this time, Namenode deletes the file from the namespace. Deleting a file will release data blocks related to the file. Note that there will be a certain delay between the user's deletion of files and the increase of HDFS free space.

The deleted file can be restored as long as it is still in the/trash directory. If the user wants to restore the deleted file, he/she can browse the/trash directory to retrieve the file. The/trash directory only saves the last copy of the deleted file. The/trash directory is no different from other directories, except that HDFS will apply a special policy on this directory to automatically delete files. Currently, the Default policy is to delete files that have been retained for more than 6 hours in/trash. In the future, this policy can be configured through a well-defined interface.


Reduce copy Coefficient

When the copy coefficient of an object is reduced, the Namenode selects an excess copy to delete it. This information is passed to Datanode during the next heartbeat detection. Datanode removes the corresponding data blocks and increases the free space in the cluster. Similarly, there will be a certain delay between calling the setReplication API and increasing the free space in the cluster.


References

HDFS Java API: http://hadoop.apache.org/core/docs/current/api/

HDFS source code: http://hadoop.apache.org/core/version_control.html

By Dhruba Borthakur

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.