Hadoop Distributed File system: Structure and Design

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Hadoop

Tags *.h file access apache application application data applications bandwidth block

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduce

The Hadoop Distributed File System (HDFS) is a distributed file system designed to be used on common hardware devices. It has many similarities to existing distributed file systems, but it is quite different from these file systems. HDFS is highly fault-tolerant and is designed to be deployed on inexpensive hardware. HDFS provides high throughput for application data and applies to large dataset applications. HDFs opens up some POSIX-required interfaces that allow streaming access to file system data. HDFs was originally designed for the low-level components of the Apache Nutch Web search engine project. is part of the Hadoop project, and this is part of the Apache Lucene project. The address of this project is: http://projects.apache.org/projects/hadoop.html.

2. Assumptions and Objectives

2.1. Hardware error

Hardware errors are normal, not exceptions. The HDFs instance consists of hundreds of servers, each storing part of the file system's data. In fact, this will have a large number of components, and each component is very likely to fail, which means that HDFs always have some components that are not working. Therefore, detection errors and rapid automatic recovery becomes the core design goal of HDFs.

2.2. Streaming data access

Applications running on HDFS require streaming access to their datasets and are not common applications that typically run on normal file systems. HDFs is designed for those batch processing, not for the interaction of ordinary users. Emphasize the high throughput of data access rather than the low latency of data access. Many of the hard requirements that POSIX imposes are not needed on HDFS applications, and these POSIX semantics are used in some key environments to increase the throughput frequency of the data.

2.3. Large datasets

Applications running on HDFs use large datasets. HDFs A typical file may be several gigabytes or terabytes. Therefore, HDFs applies to large files. This provides high integration bandwidth and offers hundreds of nodes in a few episodes. An instance may support thousands of files.

2.4. Simple consistency model

HDFs applications require a one-time-write, multiple-read access pattern for a file. Once the file is established and written, the file does not need to be changed. Such assumptions simplify data consistency issues and make high data throughput possible. MapReduce programs or web crawler programs are ideal for using such models. Of course future plans support incremental writing.

2.5. Mobile computing environment is more cost-effective than mobile data

If you are performing operations on the data next to the data, the device used by the program is efficient. This is especially true when the file is quite large. This can reduce network congestion and improve system throughput. This assumption also means that it is often better to migrate the computation to the close of the data store, rather than to transfer the data to the place where the program runs. HDFS provides a program interface to move themselves to the location of the data store for execution.

2.6. Moving across hardware and software platforms

HDFs is designed for easy moving from one platform to another. This helps HDFs to be used as a working platform for a large set of programs.

3. Name node and data node

HDFS is a master/from structure. A cluster has a name node, which is the master control server, which manages the file System namespace and coordinates client access to the file. There is also a stack of data nodes, typically deployed on a physical node, responsible for storage management on the physical node where they reside. HDFs Open File system namespaces to allow user data to be stored in a file. Internally, a file is divided into one or more blocks of data that are stored in a set of data nodes. The name node performs the namespace operation of the file system, such as opening, closing, renaming a file or directory, and determining the data block's mapping from the data node. The data node is responsible for providing the customer with read and write requests. Data nodes also perform data block creation, deletion and duplication according to the instruction of the name node.

Name nodes and data nodes are software components that are designed to run on a common machine. Most of these machines run GNU operating systems. HDFs is implemented using the Java language; Any Java-enabled machine can run name nodes and data node software. Using a highly portable Java language means that HDFs can be used by a variety of machines. A typical deployment has a specified machine that runs only the name node, and the architecture does not rule out the data nodes on that machine, but it is rarely used in real-world deployments.

Only one name node in a cluster simplifies the system mechanism greatly. The name node is the storage and arbiter of all system metadata. The system is designed so that the user data never flows through the name node.

4. File System name space

HDFS supports the traditional file organization architecture. Users or programs can create directories and store files in the directory. The structure of namespaces is similar to that of most existing file systems. You can create, delete files, move files from one directory to another, or rename files. HDFs has not yet implemented user quotas and access rights control, nor does it support hard and soft connections. Of course, the system does not hinder the implementation of these features.

The name node maintains the system's namespace, which records any changes in the namespace or the attributes of the namespace itself. The user can specify the number of copies of the file in the HDFs, which is called the replication Factor, and is recorded by the name node.

5. Data replication

HDFS is designed to store very large files in a large cluster across machines and reliably. Each file is stored as a series of blocks, and all blocks except the last one in the same file are the same size. File blocks are replicated to ensure fault tolerance. The block size and replication factor for each file can be configured. The program can specify the number of times a file is copied, which can be specified at the time the file is created, or later. HDFs files are written all at once and are strictly restricted to a single write user at all times.

The name node makes all decisions based on the block copy state, and it periodically receives the heartbeat and block reports from the data nodes in the cluster. It is normal to receive a heartbeat to prove that the data node is working. A block report should include a list of all the blocks on the data node.

5.1. Location of replicas: infancy

The location of replicas is critical to the reliability and performance of HDFs, and the biggest difference from other distributed file systems is that it optimizes the location of replicas, which requires a lot of tweaking and experience. Carefully measure the placement of replicas to improve data reliability, availability, and network bandwidth utilization. The current strategy for replica placement is only the first step in this direction, and the short-term goal is to validate in the production system, learn more about its behavior, and build a foundation to test and study more complex strategies.

Large HDFS entities run on a large pile of machines that may have to be placed across several cabinets. The two nodes in different cabinets communicate through the switch. In most cases, the bandwidth between machines in the same cabinet is greater than that between machines in different cabinets.

At startup, each data node determines its cabinet and notifies the ID of the cabinet in the name node registry. HDFS provides APIs to facilitate the use of detachable modules to determine the ID of the cabinet in which it resides. A simple but not ideal strategy is to place replicas on different cabinets. This prevents data loss while the entire cabinet is down, and allows the bandwidth of multiple cabinets to be used when reading data. This strategy can also place replicas evenly so that the load is balanced when the component fails. However, this strategy increases the cost of writing, which requires the transfer of data blocks across the cabinet.

In general, when the replication factor is 3, the HDFS deployment strategy is to place a node in the local cabinet, place a different node in the cabinet, and then place a node in another cabinet. Such a strategy cuts down the transmission of internal write operations in the cabinet, which improves write performance. The probability of a broken cabinet is much smaller than the probability that a node is broken; This policy does not affect the reliability and availability of data. In fact, when you read chunks from two different cabinets, the contrast is read from three and does not reduce the total bandwidth used. With this policy, replicas of files cannot be evenly distributed across cabinets.

A copy of 1/3 is placed on a node, and 2/3 copies are placed on the same cabinet, and the final 1/3 is placed evenly on the other cabinet's junction.

This strategy improves write performance without losing data reliability or read performance. Currently, the default replica placement strategy, which is the strategy discussed here, is a work in progress.

5.2. Replica selection

To reduce total bandwidth consumption and read latency, HDFS attempts to satisfy read requests with a recent replica of the customer who is away from reading. If there is a replica in the same cabinet that reads the node, the copy will be the most appropriate to satisfy the read request. If the HDFs cluster spans multiple data centers, replicas residing in the local data center are more appropriate than remote replicas.

5.3. Safe Mode

At startup, the name node enters a special state called safe Mode, and no data block duplication occurs. The name node receives the heartbeat and data block report of the data node. A data block report includes a list of blocks of data that a data node has. A specified number of replicas are available in each block of data. When a name node registers the minimum number of copies of the data, the block is considered to be securely replicated. The name node exits Safe Mode when the configurable percentage of the data block is securely replicated after the name node is registered (plus additional 30 seconds). It determines which blocks of data, if any, are less than the specified replica data, and then copies the data blocks.

6. Persistence of File system metadata

The HDFS namespace is stored at the name node, and the name node uses the transaction log called "Edit Log" to persist the recording of each change in the file system metadata. For example, when a file is created in HDFs, a name node inserts a record in the edit log. Similarly, when a file's replication factor causes a new record to be inserted into the edit log. The name node uses the local operating system's files to store the edit log. The entire file system namespace, including block to file mappings, file System properties are stored in a file called "File system Mirror", which is also placed on the local operating system name node.

The name node maintains a namespace containing the entire system and a mirror image of the file block mapping in memory. The key metadata entries are designed so succinctly that a name node with 4GB of memory is sufficient to support a large number of directories and files. When the name node is started, it reads "File System Mirror" and "Edit Log" from disk. The "Edit log" transaction is applied to an in-memory file mirror, then the new file mirror is flushed to the hard disk, and the old "edit log" can be truncated because the transaction has been persisted. This process is called a checkpoint. In the current implementation, the checkpoint only appears when the name node is started, and the work on supporting periodic checkpoints in the future is ongoing.

Data nodes use the local file system to store HDFS data. The data node is ignorant of the HDFs file, and it only stores each block of HDFs with a single file. Instead of creating all the files in the same directory, the data node uses a heuristic algorithm to determine the best number of files for each directory, and to build the word directory appropriately. It is not ideal to create all files in the same directory because the local file system may not support a large number of files in the same directory. When the data node is started, it traverses the local filesystem, producing a list of HDFS data blocks and local file correspondence, and sending the report to the name node: This is the Block report.

7. Communication protocol

All communication protocols for HDFS are layered on top of the TCP/IP protocol. The client connects to a configurable TCP port on the name node and interacts with the name node using the client protocol. The data node and the name node interact with the data Point Protocol. A remote Procedure call (RPC) encapsulates both protocols. By design, a name node will never start any RPC, it is only responsible for responding to data nodes or customer-initiated requests.

8. Robustness

The primary purpose of HDFS is to ensure the reliability of the data, even when it goes wrong. The three most common errors are the name node or data node failure, and the network disconnect.

8.1. Data disk failure, heartbeat and repeat system

Each data node periodically sends a heartbeat packet to the name node. Network blocking may cause some data nodes to lose their connection to the name node. The name node detects this by losing the heartbeat. The name node marks a data node with no recent heartbeat and does not forward any new IO requests. Any data registered at the Downtime data node is not available to HDFs. The downtime of data nodes results in a drop in the replication factor for some data blocks and below the specified value. The name node keeps track of which blocks of data need to be replicated and starts replicating when needed. There may be several reasons for the necessary duplication: Data nodes are unavailable, a replica is corrupted, a disk on the data node is corrupted, and a file replication factor has been raised.

8.2. The heavy trim of the cluster

The structure of the HDFS is compatible with the data-trimming scheme. If the remaining disk space of a data node falls to a certain limit, the scheme automatically moves the data from this data node to another node. When there is a high demand for a file, the program dynamically adds more replicas and balances other data in the cluster. These data-balancing schemes have not yet been implemented.

8.3. Data integrity

It is possible that the data obtained from the data node is corrupted, possibly due to a faulty storage device, a network failure, or a bug in the software. HDFs's client software implements a checksum check of the contents of the HDFs file. When a client creates a HDFs file, it calculates the checksum of each block of the file first and stores it in a hidden file under the same namespace. When the customer receives the contents of the file, it checks the data from each data node and matches the corresponding checksum. If it does not match then the customer chooses another data node with replicas to fetch a copy of the data.

8.4. Metadata disk failure

"File system Mirrors" and "edit logs" are central important structures for HDFS, which can cause HDFs instances to be unavailable.

Therefore, name nodes can be configured to support multiple copies of the file system mirror and edit log. Each update of them will cause a synchronized update of multiple copies of the file. However, such synchronization reduces the speed at which namespaces are transferred at the name node. In fact, the degree of slowness is acceptable, since even a program is sensitive to data, but not to metadata. When the name node is restarted, it selects the most recent copy of the file. The name node machine is just a single point of failure of the HDFs cluster. If the name knot is really down, then manual intervention is necessary. Currently, it is not supported to automatically restart named nodes or to restore the downtime of software to other machines.

8.5. Snapshot

Snapshots support data copies that store a certain time. One application of this feature is to roll back a corrupted HDFs instance to a previous normal point in time. This feature is not currently supported, and future versions will support it.

9. Data organization structure

9.1. Data block

HDFS is designed to support very large files. And the program that is consistent with HDFS is also processing large datasets. The program writes data one at a time, but reads it once or more, hoping to get a line-speed read. HDFs supports file semantics last read multiple times, it uses a block size is usually 64M. As a result, HDFs files are cut to 64M blocks, and it is possible that each block of data resides on a fixed data node.

9.2 Staging

The client's request to create the file does not arrive at the name node immediately, but rather hdfs the customer into a local file. The write operation of the program is explicitly redirected to this local temporary file. When a local file accumulates data that exceeds the size of the HDFS data block, the customer is contacted by the name node. The name node inserts the file name into the system and applies a block of data to it. The name node responds to the ID of the Customer data node and the target block ID, so that the customer flushes the data from the local cache to the destination's data block. When the file is closed, the remaining unsaved data in the temporary file is transferred to the data node, and the client can then tell the name node file to be closed. At this point in time, the name node completes the operation of creating the file in persistent storage. If the name node is down before the file is closed, the file is lost.

The above approach has been accepted after careful weighing of the application software on the running HDFs. These programs require streaming writes to files. If the customer writes directly to a remote file without any caching, the speed and congestion in the network can have a significant impact on the output. Such a method is not without precedent. Early distribution file systems For example, AFS has adopted client caching to improve performance. In order to get better performance of data uploads, POSIX related requirements have been discarded.

9.3. Copy Line

When a client writes data to a HDFs file, as explained in the previous section, the data is first written to a local file. Assuming that the replication factor for the HDFs file is 3, when the local file accumulates to a large chunk of data, the user obtains a list of data nodes from the name node, and the data nodes in the list will hold a copy of the data block. The client refreshes the data block to the first data node, which begins to receive data in small chunks (4K). Write each block to the local library and pass it to the second data node in the list, and the second node starts each small block of data, writes it to its own local library, and refreshes the data to the 3rd node. Finally, the 3rd node writes the data to its own library. As a result, data nodes can collect data from a single node at the same time and pass data to the data nodes at the same time. That is, the data is formed into an assembly line, passed from one data node to the next.

10. Accessibility

There are many ways to access HDFs from an application, which, naturally, provides a set of Java APIs for use by programs. and the C language package for this set of APIs is also available. The HTTP browser can also be used to browse a file for a HDFs instance. The work to access using the WebDAV protocol is still in progress.

10.1. Distributed File System command interpreter

HDFS organizes user data in the form of files and directories, providing a command interface Dfsshell lets users interact with their data. The syntax of these commands is similar to the shell that other users already know, and here's an example:

Dfsshell is designed for programs that use scripts and store data for interaction.

10.2. DFS Management

The HDFS Management Command Group is designed to manage a HDFS cluster that can only be used by administrators, and here are some examples: Action Command put a cluster in safemode bin/hadoop Dfsadmin-safemode Enter Generate a list of datanodes bin/hadoop dfsadmin-report decommission Datanode datanodename bin/hadoop dfsadmin- Decommission Datanodename

10.3. Browser interface

A typical HDFs installation configures a Web server to open its own namespace, and its TCP port is configurable. This allows users to traverse the HDFs namespace and view the contents of the file through a Web browser.

11. Space Recycling

11.1. Deletion and cancellation of documents

When a file is deleted by a user or a program, it is not immediately removed from the HDFs. Instead, HDFs first moved it to the/trash directory. As long as it is in this directory, the file can be restored. The time of the file in this directory can be configured, exceeding the lifecycle, and the system removes it from the namespace. The deletion of a file causes the release of the corresponding block of data. Note that removing from the user can take a considerable amount of time to see the increase in the remaining space from the system.

The user can cancel the operation after deletion, as long as the file is still in the Recycle Bin. When the user wants to cancel, can browse this directory, and retrieve the file. This directory contains only the files that were recently deleted, and this directory has a feature that HDFs automatically deletes files using the policy. The current default policy is to automatically delete files after more than 6 hours. In future versions, this strategy is configured with a well-defined interface.

11.2. Reduce replication factor

When a file replication factor is reduced, the name node picks up a number of replicas that can be deleted. The next heartbeat will pass this information to the data node, the data node will delete the corresponding block, the corresponding space will appear in the cluster. Again, there will be a delay between calling the Setreplication function and seeing the rest of the space.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More