07. HDFS Architecture

Source: Internet
Author: User
Tags hadoop mapreduce
HDFS ubuntureintroduction

HDFS is a distributed file system designed to run on common commercial hardware. It has many similarities with existing file systems. However, there are huge differences. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware. HDFS provides a high-throughput access to application data and is suitable for applications with large amounts of data. HDFS relaxed some POSIX requirements to allow streaming access to the data in the file system. HDFS was initially used as the infrastructure for Apache nutch Web search engine projects. HDFS is the core project of Apache hadoop.

Assumptions and goalshardware failure

Hardware faults are not exceptional. An HDFS instance may contain hundreds of thousands of servers, each of which stores part of the data in the file system. In fact, there are a large number of components, each of which has an extraordinary failure probability, which means that there are always some components that cannot work. Therefore, fault detection and quick automatic fault recovery are the main Architecture Objectives of HDFS.

Streaming Data Access

Applications running on HDFS need to access their datasets in a stream. These applications are not general-user applications running on general-purpose file systems. HDFS is more suitable for batch processing than interactive processing. HDFS focuses on high-throughput data access rather than low-latency data access. POSIX imposes a lot of hard and hard requirements that are not required for applications suitable for HDFS. The semantics of poxis has been replaced in some key areas to improve data throughput.

Large data sets

Applications running on HDFS all have a large number of datasets. The file size in HDFS is usually GB. Therefore, HDFS is set to support large files. It should provide high aggregate data bandwidth and scale for hundreds of machines. An HDFS instance should support millions of files.

Simple coherency Model

Simple consistency Model

The HDFS application needs to access the file in the write-once-read-committed mode. Once a file is created, written, and closed, it does not need to be changed. This construction simplifies the data consistency problem and enables it to have a high-throughput data access. A Mr application or web crawler application applies this model perfectly first. In the future, there will be plans to support appending-writes to files.

"Moving computation is cheaperthan moving data"

Mobile computing costs less than mobile data

If the application runs the computing required by the application next to the data, it will be more efficient. This is especially true when the dataset is very large. This reduces network congestion and increases the overall throughput of the system. This assumption is that the mobile computation is where the data is located, not where the application is running. HDFS provides an interface for an application to move the application itself to the data location.

Portability authentication SS heterogeneoushardware and softwareplatforms

Heterogeneous hardware and software platforms

HDFS is set to easily move from one platform to another. This makes HDFS as a platform universally applicable to a large number of applications.

Namenode and datanodes

HDFS is a Master/Slave architecture. An HDFS cluster contains a namenode (masterserver) to manage the file system namespace and control client access files. In addition, a cluster also includes many datanode, which manage the storage of the nodes where they are located. HDFS exposes a file system namespace that allows users to store data in files. Internally, a file is divided into one or more blocks, which are stored in a series of datanode. Namenode performs operations on the file system namespace, such as opening, closing, renaming a file or directory. It also determines the dating between blocks and datanode. Datanode is responsible for processing read/write requests from the file system client. Datanode also executes the CREATE, delete, and copy commands for blocks from namenode.

Namenode and datanode are software designed to run on common commercial machines. These machines generally run the GNU/Linux operating system. HDFS is built in Java. any machine that supports Java can run namenode and datanode software. Using a highly portable Java language means that HDFS can be deployed on a large scale. A general deployment scheme is to run namenode software on a dedicated machine. An instance where other machines in the cluster run the datanode software. This architecture won't be organized to run multiple datanode on one machine, but there are almost no such examples in the real deployment process.

The existence of only one namenode in a cluster greatly simplifies the system architecture. Namenode is the arbitrator and stores HDFS metadata. The system is designed as user data that never flows through namenode.

The file system namespace

HDFS supports the traditional hierarchical file structure. A user or application can create a directory and store files in it. The file system namespace layering is similar to many other file systems. You can create and delete files, move files, or rename files. HDFS does not implement user quotas or access permissions. HDFS does not support hard links and soft connections. However, HDFS does not prevent organizations from implementing these features.

Namenode maintains the file system namespace. If any file system namespace changes, its attributes are recorded by namenode. An application can specify multiple copies of a file maintained by HDFS. The number of copies of a file is called the copy factor of the file. This information will be stored by namenode.

Data Replication

HDFS is designed to reliably store large files between machines in a large cluster. It stores each file as a block sequence; all the blocks of a file, except the last one, are of the same size. All blocks of a file are duplicated for fault tolerance. The block size and copy factor are configurable for each file. An application can specify the number of copies of a file. The copy factor can be specified or modified later when the file is created. Files in HDFS are written at one time. At any time, this file has only one write process.

Namenode makes all decisions about block copies. It periodically receives heartbeat information and block reports of each datanode in the cluster. Receiving heartbeat information means that this datanode is working correctly. A block report includes a list of all blocks on a datanode.


Replica placement: the first baby steps

The storage location of copies is related to the reliability and performance of HDFS. The storage location of optimized copies is different from that of other distributed file systems. This feature requires a lot of tuning experience. The rack-aware copy storage policy aims to improve data reliability, availability, and bandwidth utilization. The implementation of the current copy storage policy is based on these efforts. The short-term goal of this strategy is to verify it on the production system, get more information about its performance, and then build a foundation to test and study more complex strategies.

Large-scale HDFS instances run in a cluster that contains many computers. clusters usually need to be deployed on multiple racks. Communication between two nodes on different racks must be performed through a switch. In most cases, the bandwidth between machines on the same rack is faster than that on different racks.

Namenode uses the steps described in hadoop mapreduce next generation-cluster setup # hadooprack awareness to determine the rack to which each datanode belongs. A simple but not optimal strategy is to place copies on different racks. This prevents data loss when a rack fails and allows the bandwidth of multiple racks when reading data. This policy distributes copies evenly to the cluster, which can easily balance the load when components fail. However, this policy increases the write cost because a write process needs to transmit blocks to multiple racks.

In most cases, the copy factor is 3. HDFS's copy placement policy is to place one copy on one node on the local rack and the other on another node on the local rack, the last node placed on another rack. This policy reduces the write load between racks, which usually improves the write performance. Rack failure is more likely to fail than a single node, so this policy does not affect data reliability and availability. Moreover, when reading data, it will reduce the network bandwidth used, because a block is only placed on two racks rather than three. Using this policy, copies of a file are not evenly distributed to the rack. 1/3 of the copies are on one node, 2/3 of the copies are on one rack, and the two outer 1/3 are evenly distributed to the remaining racks. This policy improves write performance without compromising data reliability and read performance.

Currently, the default copy placement policy described here is in progress.

Replica Selection

To reduce global bandwidth consumption and read latency, HDFS tries to respond to a read request from a recent copy. If a replica exists with a node in the rack where the reader is located, the replica is preferentially used to respond to the Read Request. If an HDFS cluster spans multiple data centers, a copy in the local data center takes precedence over other remote copies.

Safemode

When HDFS is started, namenode enters a special State called safemode. Data block replication does not occur when namenode is in safemode. Namenode receives the heartbeat and blockreport messages of datanode. A blockreport includes a list of all data blocks on this datanode. Each block is specified with the minimum number of copies. When the minimum number of copies of a data block is checked against namenode, a block is considered to have been copied securely. A configurable percentage of data blocks that can be securely copied to the namenode (plus an additional 30 s). The namnode ends the safemode state. It then determines the list of data blocks with fewer replicas than the specified number of replicas. Then, namenode copies the data blocks to other datanode.

The persistence of file systemmetadata

HDFS namespaces are stored on namenode nodes. Namenode uses a transaction log called editlog to persistently record every change in the metadata of the file system. For example, creating a new file in HDFS causes namenode to insert a record to editlog to declare this operation. Similarly, changing the copy factor of a file will insert a new record into the editlog file. Namenode uses a file in the local operating system to store editlog. The entire file system namespace, including block-to-file ing and file system attributes, is stored in a file called fsimage. Fsimage files are also stored in the namenode local file system.

Namenode stores an image in the namespace of the Global File System and blockmap of files in the memory. These metadata is designed to be compact. For example, a 4 gb ram namenode is sufficient to support large-scale files and directories. When namenode is started, it reads fsimage and editlog from the disk, applies all transactions from editlog to fsimage in the memory, and flush the new version to a new fsimage file on the disk. This can shorten the old editlog because its transaction has been applied to the persistent fsimage. This process is called checkpoint. In the current implementation, a checkpoint gradually occurs at namenode startup. The support for periodic checkpoints will be in progress in the future.

Datanode stores HDFS data to files in the local file system. Datanode does not know the HDFS file information. It stores each block of HDFS to a separate file in the local file system. Datanode does not create all files in the same directory. Instead, it uses a heuristic method to determine the optimal number of files in each directory and create subdirectories when appropriate. Creating all local files in one directory is not optimal, because the local file system may not be able to efficiently store a large number of files in a separate directory. When a datanode is started, it scans its local file system, generates a list Of all HDFS data blocks corresponding to each local file, and then sends the list to namenode: This is blockreport.

The communication protocols

The HDFS communication protocol is at the upper layer of the TCP/IP protocol. A client can establish a connection to the configured port on the machine where the namenode is located. This is the clientprotocol that communicates with namenode. Datanode uses datanode protocol to communicate with namenode. RPC abstracts clientprotocol and datanodeprotocol. As designed, namenode never initiates any RPC. Namenode only responds to RPC requests from datanode and client.

Robustness

The main objective of HDFS is to store data reliably when a fault occurs. The following three common fault types are namenode faults, datanode faults, and network division.

Data disk failure, heartbeats and re-replication

Each datanode periodically sends heartbeat messages to namenode. One network division may cause some datanode to lose contact with namenode. Namenode detects this situation by missing heartbeat messages. Namenode marks the datanode that has not recently sent heartbeat messages as dead, and then does not forward new IO requests to these datanode. Any data registered with a dead datanode is not available for HDFS. The death of datanode may cause the replication factor of some blocks to be reduced to the specified value. Namenode always needs to copy blocks, and then initiates replication when necessary. The necessity of re-replication may increase for multiple reasons: one datanode may become unavailable, one copy may be destroyed, the hard disk on datanode may fail, or the copy factor of one file may increase.

Cluster rebalancing

The HDFS architecture is compatible with the data rebalancing system. A policy can automatically move data from one datanode to another when the free space on one datanode is reduced to a specific threshold value. When there is a sudden high demand for a file, a policy may automatically create additional copies to rebalance other data in the cluster. These types of data balancing policies have not yet been implemented.

Data intergrity

Data Integrity

Corruption is possible when data blocks obtained from datanode arrive. The damage may be caused by storage device faults, network faults, or software bugs. The HDFS client software verifies and checks the content of HDFS files. When a client creates an HDFS file, it calculates the checksum of each part of the file, and then stores the checksum to a separate hidden file. The Checksum file is in the same location as the HDFS namespace. When a client retrieves the file content, it verifies that the data received from each datanode matches the checksum stored in the related checksum file. If the data does not match, the client can re-retrieve the data from another datanode that owns the data block copy.

Metadata disk failure

Fsimage and editlog are the core data structures of HDFS. If these files are damaged, HDFS instances cannot work. For this reason, namenode can be configured to support maintenance of multiple copies of the fsimage and editlog files. Any update to fsimage or editlog synchronizes the update of each copy of fsimage and editlog. Synchronous update of multiple copies of fsimage and editlog may reduce the efficiency of namespace transactions. However, this reduction is acceptable because although HDFS applications are data-intensive, they are not metadata-intensive. When a namenode restarts, it selects the latest consistent fsimage and editlog files.

Namenode is a single point of failure for an HDFS cluster. If the namenode machine fails, manual intervention is necessary. Currently, automatic restart and Failover of namenode to another machine are not supported.

Snapshot

Snapshots support data replication at any point in time. A snapshot-specific application may roll back a corrupted HDFS instance to a previous good time point. HDFS currently does not support snapshots, but will support snapshots in the future.

Data organizationdata Blocks

HDFS is designed to support very large files. Applications compatible with HDFS are those that process large-scale datasets. These applications write data once but read the data once or multiple times. The application must meet certain read speeds. HDFS supports the write-once-read-committed Syntax of files. HDFS usually uses 64 MB block size. Therefore, an HDFS file is truncated to 64 MB blocks. If possible, each block is located in a different datanode.

Staging

A client's request to create a file cannot immediately reach namenode. In fact, at the beginning, the HDFS client caches the file data to a temporary local file. The application write is transparently redirected to this temporary local file. When the data accumulated by the local file is used to operate on the size of an HDFS block, the client contacts namenode. Namenode inserts the file name to the file system level, and then applies for a data block for it. Namenode uses a datanode and the target data block ID to respond to client requests. Then the client flush the data block from the local temporary file to the specified datanode. When a file is closed, unflushed data blocks in the remaining local temporary files are transferred to datanode. The client then tells namenode that the file is closed. At this time, namenode submits the creation operation of this file to a persistent storage. If namenode dies before the file is closed, the file will be lost.

The above method is adopted after careful consideration of the characteristics of the target application running on HDFS. These applications need to write streaming files. If a client directly writes a remote file without the client cache, the network speed and network congestion will greatly affect the throughput. This method is not without precedent. Early distributed file systems, such as AFS, also use client cache to improve performance. To achieve higher data upload performance, a POSIX requirement has been relaxed.

Replication pipelining

Copy MPs queue

When a client is writing data to an HDFS file, its data is first written to a local file, as explained above. Assume that the copy factor of the HDFS file is 3. After the local file accumulates user data of a whole block, the client retrieves the list of datanode from namenode. This list contains the datanode that will hold a copy of this block. Then, flush the data block on the client to the first datanode. The first datanode receives a small part of the data, writes each small part to its local repository, and then transmits this small part to the second datanode in the list. The second datanode begins to receive every portion of the data block, writes this small part to the local warehouse, and then flush this small part to the third datanode. Finally, the third datanode writes the received data to the local warehouse. Therefore, a datanode can receive data from the former datanode of the pipeline and forward the data to the next datanode of the pipeline. Therefore, data is transmitted from one datanode to another, forming a pipeline.

Accessibility

Access

HDFS can be accessed by applications in different ways. Basically, HDFS provides a filesystem Java API for applications. A c language encapsulation of Java APIs can also be used. In addition, the HTTP browser can also be used to browse files of an HDFS instance. Access to HDFS through WebDAV protocol is in progress.

FS Shell

HDFS allows user data to be organized in the form of files and directories. It provides a command line interface called FS shell, which allows a user to interact with data in HDFS. The syntax of this command set is similar to that of shell commands that other users are already familiar. This is an example of some commands:

Action

Command

Create a directory named/foodir

Bin/hadoop DFS-mkdir/foodir

Remove a directory named/foodir

Bin/hadoop DFS-RMR/foodir

View the contents of a file named/foodir/myfile.txt

Bin/hadoop DFS-CAT/foodir/myfile.txt

FS shell is an application that requires scripting to interact with stored data.

Dfsadmin

The dfsadmin command set is used to manage an HDFS cluster. These commands are only used by administrators. This is an example of some commands:

Action

Command

Put the cluster in safemode

Bin/hadoop dfsadmin-safemode enter

Generate a list of datanodes

Bin/hadoop dfsadmin-Report

Recommission or decommission datanode (s)

Bin/hadoop dfsadmin-refreshnodes

Browser Interface

A typical HDFS installation configures a web server to expose the HDFS namespace through a TCP port. This allows a user to use a web browser to browse the HDFS namespace and view its file content.

Space Reclamation

Space recycling

File deletes and undeletes

When a file is deleted by one or more applications, it is not immediately removed from HDFS. On the contrary, HDFS first renames it as a file in the/trash directory. As long as the file is in/trash, it can be quickly restored. A file is saved at a configurable time in/trash. After this time, namenode deletes the file from the HDFS namespace. This deletion will release block-related files. Note that there may be a significant time delay between the deletion of a file and the release of the response disk space.

A user can delete a file and then undelete it as long as it is stored in the/trash directory. If a user wants to undelete a deleted file, he can browse the/trash directory and then retrieve the file. The/trash directory contains only copies of the objects to be deleted from the last time point. Like other directories, the/trash directory has only one special feature: HDFS uses special policies to automatically delete files in this directory. Currently, the default deletion interval is 0 (delete directly, not in/trash ). This value can be configured by the fs. Trash. interval attribute in the core-site.xml file.

Decrease replication factor

When the copy factor of an object is reduced, namenode selects the redundant copies that can be deleted. The next heartbeat transmits this information to datanode. Then, datanode removes the corresponding block and the corresponding free space will appear in the cluster. Once again, there may be latency between setreplication API calls and idle space in the cluster.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.