1. Introduction
Hadoop Distributed File System (HDFS) is a distributed file system designed for use on common hardware devices. It is similar to the existing distributed file system, but it is quite different from these file systems. HDFS is highly fault tolerant and designed to be deployed on inexpensive hardware. HDFS provides high throughput for application data and is applicable to those large dataset applications. HDFS provides some POSIX interfaces that allow streaming access to file system data. HDFS was initially designed for the underlying components of Apache's nutch network search engine project. Is part of the hadoop project, and this is part of the Apache Lucene project. The address of this project is: http://projects.apache.org
/Projects/hadoop.html.
2. assumptions and objectives
2.1. hardware error
- Hardware errors are normal, not exceptions. HDFS instances are composed of hundreds of servers, each of which stores part of the data in the file system. In fact, there will be a large number of components, and each component has a high possibility of failure, which means that there are always some components in HDFS that cannot work. Therefore, detecting errors and quickly recovering them automatically become the core design goal of HDFS.
2.2. Stream Data Access
- Applications running on HDFS require streaming access to their datasets, and they are not common applications that normally run on common file systems. HDFS is designed for batch processing, rather than interactive use by common users. It emphasizes the high throughput of data access, rather than the low response time of data access. Many of the hard requirements imposed by POSIX are not required by applications on HDFS. These POSIX semantics are used to increase the data throughput frequency in some key environments.
2.3. Large Dataset
- Applications running on HDFS use big datasets. HDFS a typical file may be several GB or several TB. Therefore, HDFS applies to large files. This provides high integrated bandwidth and hundreds of nodes in several clusters. An instance may support tens of millions of files.
2.4. Simple consistency Model
HDFS applications require a one-time write and multiple read access mode for files. Once the file is created and written, the file does not need to be changed. This assumption simplifies data consistency and makes high data throughput possible. This model is suitable for mapreduce programs or web crawlers. Of course, the future plan supports incremental writing.
2.5. Mobile computing environments are more cost-effective than mobile data
If you perform operations on the data right next to the data, the devices used by the program will be very efficient. This is especially true when the file is huge. This can reduce network congestion and improve system throughput. This assumption also means that it is often better to migrate computing to the close of data storage, rather than transfer data to the place where the program runs. HDFS provides program interfaces to move themselves to the data storage place for execution.
2.6. Move across hardware and software platforms
- HDFS is designed to easily move from one platform to another. This helps HDFS be adopted as a work platform for a large set of programs.
3. Name and data nodes
- HDFS is in the Master/Slave structure. A cluster has a Name node, that is, the master control server, which is responsible for managing the file system namespace and coordinating the customer's access to files. There are also a bunch of data nodes, usually one deployed on a physical node, responsible for the Storage Management on the physical node where they are located. HDFS opens the namespace of the file system so that user data can be stored in files. Internally, a file is divided into one or more data blocks, which are stored in a group of data nodes. The Name node executes the file system namespace operations, such as opening, closing, renaming a file or directory, and also determines the ing of data blocks from the data node. The data node is responsible for providing the customer's read and write requests. The Data Node also creates and deletes data blocks according to the command of the Name node.
Name nodes and data nodes are software components designed to run on normal machines. Most of these machines run the GNU/Linux operating system. HDFS is implemented using the Java language. any machine that supports Java can run the name node and data node software. The highly portable Java language means that HDFS can be used by many machines. In a typical deployment, a specified machine runs only the Name node, and the architecture does not rule out running data nodes on that machine. However, in reality, the deployment is rarely used in that way.
A cluster has only one name node, which greatly simplifies the system structure. The Name node is used as the storage and arbitration of all system metadata. This design prevents user data from flowing through the Name node.
4. File System namespace
HDFS supports the traditional file organization architecture. Users or programs can create directories and store files in the directories. The structure of the namespace is similar to that of most existing file systems. You can create or delete files, move files from one directory to another, or rename files. HDFS does not implement user quota and access permission control, nor does it support hard connections and soft connections. Of course, the system does not impede the realization of these features.
- The Name node maintains the system's namespace. It records any changes in the namespace or changes the attributes of the namespace. You can specify the number of copies of files in HDFS. This number is called a replication factor and is recorded by the name node.
5. Data Replication
HDFS is designed to store very large files across machines in a large cluster. Each file is stored as a series of blocks, and all blocks except the last block in the same file are the same size. File blocks are copied to ensure fault tolerance. The block size and replication factor of each file can be configured. The program can specify the number of file copies. The replication factor can be specified when the file is created or later. HDFS files are written at one time, and only one write user is allowed at any time.
The Name node is determined by the block replication status. It periodically receives heartbeat and block reports from the data nodes in the cluster. The heartbeat result proves that the data node works normally. A block report should include a list of all blocks on the data node.
5.1. Replica Location: Baby started
The replica location is critical to the reliability and performance of HDFS. The biggest difference with other distributed file systems is that it can optimize the replica location. This feature requires a lot of adjustments and experience. The purpose of carefully measuring replica placement is to improve data reliability, availability, and network bandwidth utilization. The previous strategy for replica placement is only the first step towards this direction. The short-term goal is to verify in the production system and learn more about its behavior, and establish a basis to test and study more complex strategies.
Large HDFS entities run on a large number of machines, which may need to be placed across several cabinets. Two nodes in different cabinets need to communicate through the switching machine. In most cases, the bandwidth between machines in the same cabinet is larger than that between machines in different cabinets.
At startup, each data node determines its Cabinet and notifies the Name node of the ID of the Cabinet in the registry. HDFS provides APIs to determine the ID of the Cabinet in which the detachable module is located. A simple but not ideal strategy is to place copies on different cabinets. This prevents data loss when the entire cabinet goes down and allows the bandwidth of multiple cabinets to be used when reading data. This policy can also place copies evenly to balance the load when the component fails. However, this policy will increase the write cost, and write operations require data blocks to be transferred across cabinets.
In general, when the replication factor is 3, The HDFS deployment policy is to place a node in the local cabinet and put another node in the cabinet, then place a node in another cabinet. This policy removes the transmission of write operations in the cabinet, which improves the write performance. The failure probability of a cabinet is much lower than that of a node. This policy does not affect data reliability and availability. In fact, when reading data blocks from two different cabinets, the comparison between the three reads does not reduce the total bandwidth usage. With this policy, copies of files cannot be evenly distributed across cabinets.
1/3 of the replicas are placed on a single node. (in this case, 2/3 of the replicas are placed on the same cabinet, and the last 1/3 is evenly placed on the nodes of other cabinets.
This policy improves write performance without compromising data reliability or read performance. Currently, the default replica placement policy, that is, the Policy discussed here is still in progress.
5.2. Duplicate Selection
To reduce the total bandwidth consumption and read latency, HDFS tries to use the replica closest to the read customer to satisfy read requests. If there is a replica in the same cabinet at the reading node, the replica is most suitable for meeting the Read Request. If the HDFS cluster spans multiple data centers, it is more suitable for local data center replicas than remote replicas.
5.3. Security Mode
At startup, the node name enters a special State called security mode, and data block replication does not occur. The Name node receives the heartbeat and data block reports of the data node. The data block report includes a list of data blocks owned by a data node. Each data block contains a specified number of copies. When a Name node registers the minimum number of data copies, data blocks are considered as secure copies. After the data block that can be configured with a percentage of secure copies is registered at the Name node (plus 30 seconds), The Name node exits the safe mode. It will determine which data blocks (if any) are less than the specified replica data, and then copy the data blocks.
6. Persistence of File System metadata
The HDFS namespace is stored on the Name node. The Name node uses the transaction log called "Edit log" to persistently record every change in the file system metadata. For example, when you create a file in HDFS, The Name node inserts a record in "Edit log. Similarly, when the file replication factor causes a new record to be inserted with the "Edit log ". The Name node uses files of the local operating system to store "Edit logs ". The namespace of the entire file system, including block-to-file shadows and file system attributes, is stored in a file called "File System Image, this file is also stored on the local operating system of the Name node.
The Name node keeps a mirror image in the memory that contains the namespace of the entire system and the file block. Key metadata entries are designed very concisely, so that a 4 GB memory Name node can support a large number of directories and files. When the Name node is started, it reads "File System Image" and "Edit log" from the disk, and applies the "Edit log" transaction to the file image in the memory, then, refresh the new file image to the hard disk. In this case, the old "Edit log" can be truncated because the transaction has been persisted. This process is called a checkpoint. In the current implementation, the checkpoint only appears when the Name node is started, and the support for periodic check points in the future is still in progress.
The data node uses a local file system to store HDFS data. The data node does not know anything about HDFS files. It only stores each HDFS data block with one file. Instead of creating all the files in the same directory, the data node uses a heuristic algorithm to determine the optimal number of files in each directory and create a word directory as appropriate. Creating all files in the same directory is not optimal because the local file system may not support storing a large number of files in the same directory. When the data node is started, it traverses the local file system, generates a list of the correspondence between HDFS data blocks and local files, and sends the Report to the Name node: this is the block report.
7. Communication Protocol
All communication protocols of HDFS are layered at the upper layer of TCP/IP protocol. The customer connects to a configurable TCP port on the Name node and uses the "customer protocol" to interact with the Name node. The data node and Name node use the "data node protocol" to interact. A Remote Procedure Call (RPC) encapsulates the two Protocols. According to the design, the Name node never starts any RPC. It is only responsible for responding to requests initiated by data nodes or customers.
8. robustness
The primary purpose of HDFS is to ensure data reliability, even when errors occur. The three most common errors are Name node or data node faults, which are disconnected from the network.
. Data disk faults, heartbeat and re-replication
Each data node periodically sends heartbeat packets to the Name node. Network blocking may cause some data nodes to lose the connection to the Name node. The Name node detects the loss of heartbeat. The Name node will mark a data node without a recent heartbeat as down and will not forward any new IO requests to them. Any data registered on the down data node is no longer available for HDFS. The downtime of data nodes causes the replication factor of some data blocks to fall below the specified value. The Name node will keep track of which data blocks need to be copied and Start copying as needed. There may be many reasons for the necessary re-replication: The data node is unavailable, a replica is damaged, and a disk on the data node is damaged,
A file replication factor is promoted.
8.2. Configure and flatten a cluster
The HDFS structure is compatible with the data reconfiguration solution. If the remaining disk space of a data node falls to a certain limit, the solution automatically moves the data from this data node to another node. When there is a high demand for a file, the solution dynamically adds more copies and balances other data in the Machine Group. These data allocation schemes have not yet been implemented.
8.3. Data Integrity
One possibility is that the data obtained from the data node is damaged. Such errors may be caused by storage device errors, network faults, or software bugs. HDFS customer software implements the verification and inspection of HDFS file content. When a customer creates an HDFS file, it first computes the verification and storage of each block of the file and stores it in a hidden file in the same namespace. After receiving the file content, the customer checks that the data retrieved by each data node matches the corresponding checksum. If the data does not match, the customer selects another data node with a replica to obtain a copy of the data.
8.4. Metadata disk failure
- "File System Image" and "Edit log" are important structures in the center of HDFS. If these files are damaged, HDFS instances may become unavailable.
Therefore, the Name node can be configured to support multiple copies of the "File System Image" and "Edit log. Each time they are updated, multiple copies of the files will be synchronously updated. However, such synchronization will reduce the transfer speed of the namespace on the Name node. In fact, the slowdown is acceptable, because even if the program is very sensitive to data, it is not sensitive to metadata. When the Name node is restarted, it selects the latest file copy. The Name node machine is only a single point of failure for the HDFS cluster. If the Name node is down, manual intervention is required. Currently, it is not supported to automatically restart the named node or restore the down software to another machine.
8.5. Snapshots
Snapshots support data copying at a certain time point. One application of this feature is to roll back the corrupted HDFS instance to a previous normal time point. This feature is not currently supported and will be supported in future versions.
9. Data Organization Structure
9.1. Data blocks
HDFS is designed to support very large files. In addition, programs consistent with HDFS process large datasets. The program writes data at a time, but reads the data once or multiple times, and wants to get a line rate. HDFS supports file semantics for the last write multiple times. The block size used by HDFS is usually 64 MB. Therefore, HDFS files are cut into 64 MB blocks. It is also possible that each data block resides on a non-moving data node.
9.2. Staging
The customer's request to create a file does not immediately reach the Name node, but HDFS customers store the data in a local file. The write operation of the program is explicitly redirected to this local temporary file. When the data accumulated by local files exceeds the size of HDFS data blocks, the customer can contact the Name node. The Name node inserts a file name in the system and applies for a data block. The Name node responds to the ID of the customer data node and the target data block ID, so that the customer refreshes the data from the local cache to the target data block. When the file is closed, the remaining unrefreshed data in the temporary file will also be transmitted to the data node. The customer can then tell the Name node that the file is closed. At this point in time, the Name node completes the persistent Storage
Create a file. If the Name node goes down before the file is closed, the file will be lost.
After careful consideration of running HDFS on the application software, the above methods have been accepted. These programs need to be streamed into files. If the client directly writes a remote file without any caching, the speed and congestion in the network will have a great impact on the output. This method is not without a precedent. Early distributed file systems, such as AFS, have adopted the customer cache to improve performance. For better data upload performance, POSIX-related requirements have been abandoned.
9.3. Replication Pipeline
When the customer wants to write data to the HDFS file, as explained in the previous section, the data will be first written to a local file. Assume that the replication factor of the HDFS file is 3. When the local file is accumulated to the size of the data block, the user obtains a list of data nodes from the Name node, all the data nodes in the list will save that data block as a copy. The customer refresh the data block to the first data node, which starts to use (4 K) small blocks to receive data and write each block to the local database, and pass it to the second data node in the list. The second node starts each small data block, writes it to its local database, and refreshes the data to 3rd nodes. Finally, the 3rd nodes write data to their own databases. Therefore, the data node
Data can be collected from the previous node at the same time, and the data is transmitted to the subsequent data node in the pipeline at the same time. That is, the data forms a pipeline, which is passed from a data node to the next one.
10. Accessibility
There are many ways to access HDFS from an application. Naturally, it provides a set of Java APIs for the program to use. C language encapsulation of this set of APIS is also available. The HTTP browser can also be used to browse files of an HDFS instance. WebDAV access is still in progress.
10.1. command interpreter for Distributed File System
HDFS uses files and directories to organize user data. It provides the command interface dfsshell to allow users to interact with the data in it. The syntax of these commands is similar to the shell that other users are already familiar with. An example is provided here:
Dfsshell is designed for programs that use scripts to interact with stored data.
10.2. DFS management
HDFS management command groups are designed to manage an HDFS cluster. These commands can only be used by administrators. Here are some examples: action Command put a cluster in safemode bin/hadoop dfsadmin-safemode enter generate a list of datanodes bin/hadoop dfsadmin-Report decommission datanode datanodename bin/hadoop dfsadmin-decommission
Datanodename
10.3. Browser Interface
A typical HDFS installation configures a web server to open its own namespace, and its TCP port is configurable. In this way, you can traverse the HDFS namespace through a Web browser and view the file content.
11. Space recycling
11.1. Delete and cancel an object
When the file is deleted by the user or program, it is not immediately removed from HDFS. Instead, HDFS first moves it to the/trash directory. Files can be recovered as long as they are still in this directory. The time of the file in this directory can be configured. After the lifecycle is exceeded, the system deletes the file from the namespace. Deleting a file may cause the release of the corresponding data block. Note: it may take a long time to see the increase in the remaining space from the user's deletion operation to the system.
You can cancel the operation after deletion as long as the file is still in the recycle bin. When you want to cancel a file, you can browse the Directory and retrieve the file. This directory contains only recently deleted files. One feature of this directory is that HDFS uses policies to automatically delete files. The default policy is to automatically delete files after six hours. In future versions, this policy is configured through well-defined interfaces.
11.2. Reduce replication factor
When the replication factor of a file is reduced, the Name node selects multiple products that can be deleted. The next heartbeat will pass the information to the data node, and the data node will delete the corresponding block, and the corresponding remaining space will appear in the cluster. Once again, there is a latency between calling the setreplication function and seeing the remaining space.
12. Reference
HDFS Java API: http://lucene.apache.org/hadoop/api/ HDFS source code: http://lucene.apache.org/hadoop/version_control.html
Problem: 1) Name node stability 2) Security of simultaneous access by multiple processes 3) how to solve small files