HDFS Architecture Guide 2.6.0
This article is a translation of the text in the link below
Http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Brief introduction
HDFS is a distributed file system that can run on normal hardware. Compared with the existing distributed system, it has a lot of similarities. However, the difference is also very large. HDFs is highly fault tolerant and is designed to be deployed on low-cost hardware. HDFS provides high-throughput access for applications that have large-capacity data sets. HDFs breaks through some POSIX requirements, allowing streaming access to file system data. The purpose of HDFs is to base the Apache Nutch Web search engine project. Currently HDFs is a sub-project of Apache Hadoop. The project link is http://hadoop.apache.org/hdfs/.
Assumptions and goals
Hardware failure
Hardware failures are the norm, not the exception. An HDFs instance may contain tens of thousands of servers, each of which holds part of the file system's data. With so many components and every component likely to fail, it is doubtful that some of the components of HDFS will always work. Therefore, fault detection and fast, automatic recovery are the architecture objectives of the HDFS core.
Data-streaming access
Applications running on HDFS require streaming access to the data. Applications that do not pass the target run on a common file system. HDFs is designed to focus on batching, rather than the user's interactive operation. Emphasis is placed on the high throughput rate of large amounts of data access, rather than the low latency of data access. POSIX many hard-to-use requirements do not apply to applications running on HDFs. The POSIX standard has been modified in a number of key areas to improve data throughput rates.
Big Data sets
Applications running on HDFS handle large datasets. Typical HDFs file sizes are from a few grams to a few t. As a result, HDFs is optimized to support large files. On the one hand to provide a larger total data bandwidth in a cluster, and can be extended to the basic node, on the other hand, an instance should be able to support tens of thousands of files.
A simple aggregation model
The access mode of the HDFs application to the file is written once and read multiple times. Once a file is created, it is closed and the file is not changed. This assumption simplifies the data aggregation problem and facilitates high throughput access to the data. A mapreduce application or a web crawler application is a great fit for this model. Of course, in the future will support the subsequent additional file write mode, there is already a development plan.
"Mobile computing is more cost effective than moving data"
Application-initiated calculations run more efficiently near the data, especially when the volume of data is large, because it reduces network congestion and increases the overall throughput efficiency of the system. Move the data to the data instead of moving it to the compute-run place. The application can use the interface provided by HDFS to move itself to where the data resides.
Portability across multiple hardware and software platforms
The design of HDFS allows for cross-platform portability. This design facilitates a large number of applications to select HDFs.
Namenode and Datanodes
HDFS is a master-slave architecture. An HDFS cluster contains a single namenode. Namenode is a master server that manages the namespace of the filesystem and controls the client's access to the file. HDFs contains multiple datanodes, usually one datanode per node in a cluster. Datanode manages the data. HDFs is a file system namespace, the user's data is stored in the file. Within HDFs, a file is partitioned into one or more file blocks, stored in a set of datanodes. Namenode is responsible for performing file system operations, such as opening files, closing files, renaming files or directories, and also determining the mapping of internal file blocks and Datanode. Datanodes is responsible for providing file system client data read and write requests, and execution of File block creation, deletion, replication (response Namenode instructions)
Namenode and Datanode can run software on a normal machine. These machines are usually run gnu/linux operating systems. HDFs is written in the Java language. Any Java-enabled machine can run Namenode and Datanode. This sense of portability in Java means that HDFS can be deployed on various types of machines. A typical deployment has a separate machine that only runs Namenode, while the machines in other clusters run one datanode. The architecture of HDFS does not limit the number of datanodes that can be run on a single machine, but there are few such deployments.
Only one namenode in a cluster greatly simplifies the architecture of the system. Namenode controls and manages all of the HDFS metadata. User data is never stored on namenode.
Namespace of the file system
HDFS supports the hierarchical architecture of traditional file systems. A user or application can create a directory to store files in the directory. The namespace hierarchy given by the file system is very similar to the existing file system. Files can be long, deleted, moved from one directory to another, or renamed. HDFs does not currently implement a user's quota mechanism. HDFs does not support soft and hard links, although the implementation of these features is not restricted on the schema.
Namenode maintains a file system namespace. Any changes to the file System namespace or attribute are recorded in Namenode. The app can specify how many copies of a file can be. The number of copies of a file becomes the file's replication factor, stored in Namenode.
Data replication
In a large cluster, HDFs can reliably store very large files across machines. Each file is stored as a series of blocks. All other file block sizes are the same except for the last block. For fault tolerance, the file blocks are copied. The block size and replication factor for each file can be individually configured. An app can specify the number of copies of a file, which can be specified at the time the file is created or modified after the file is created. Files in HDFs can only be written once, and only one instance can be written at any time.
Namenode is responsible for all replication decisions about blocks. It periodically receives heartbeat signals and block reports from each Datanode in the cluster. The heartbeat signal marks the Datanode working properly, and the block report contains all the file blocks on the Datanode.
Copy placement: First child steps
The placement of replicas is critical to the reliability and performance of HDFs. The optimization of copy placement is an important sign that HDFS differs from other Distributed file systems. This feature requires a lot of debugging and experience. The purpose of the rack-aware copy placement strategy is to improve data reliability, availability, and to save network bandwidth usage. The current implementation strategy is the first step towards this goal. The short-term goal is to validate on production systems, to learn more about their behavior, and to lay the groundwork for testing and researching more complex strategies.
Large HDFS application instances run on a cluster of computers that span multiple racks. The data interaction of two nodes on different racks needs to be through a switch. In most cases, the network bandwidth between the two nodes on the same rack is higher than the bandwidth on different racks.
NameNode determines the rack ID for each datanode based on the Hadoop Rack awareness process. A simple but less optimized strategy is to put a copy on a different rack. When a rack fails, the data is not lost, and when you read the data, you can take advantage of the bandwidth of multiple racks. This strategy can smoothly place the replicas on the machine in the cluster, and it is easy to load balance in component failure. However, this strategy increases the cost of writing files because each write operation requires the file block to be transferred to multiple racks.
Typically, when the copy factor parameter is set to 3, the policy for HDFS placement is to place a copy on the same rack node and the other two on two different nodes of the same rack that are different from the rack. This strategy saves traffic across racks and improves the performance of write files. The probability of the failure of the rack is much smaller than the probability of a node failure, so the strategy does not affect the reliability and availability of the data. It reduces the use of network bandwidth when reading data, because a file block is placed on two different racks, not that. Under this policy, the copying of files is not distributed smoothly to different racks. A copy of 1/3 is on the same node, 2/3 of the replicas are on the same rack, and the other 1/3 is smoothed across the remaining racks. This strategy improves write performance, but does not sacrifice data reliability and read performance.
Selection of Replicas
To reduce bandwidth usage and read latency, HDFs tries to read a copy close to reader. The same node, the same rack, the same machine room copy first read.
Safe Mode
When it starts, the Namenode enters a safe mode. In safe mode, data replication is not performed. Namenode receives heartbeat and block report information from Datanode. A block report contains information about all the blocks contained in a datanode. Each block has a specified minimum number of copies. This block is considered a safe copy only if the minimum number of copies specified by a block are reported. When a certain percentage of safe copy blocks are reported, after 30 seconds, Namenode exits safe mode. It then checks all the blocks that have not yet been secured for replication.
Persistence of File system metadata
Namenode stores the name subspace of HDFs. Namenode uses a transaction record (Editlog) to record every change to the file system in a persistent manner. For example, when you create a new file in HDFs, Namenode inserts a record in the Editlog. Similarly, changing the copy parameter also adds a record to the Editlog. The Namenode stores the Editlog in the local file system. The namespace of the entire file system, including the mapping of file blocks and files, as well as the properties of the file system, are stored in a file called Fsimage, and also in the local file system.
Namenode maintains the file block mappings for the entire file system namespace in memory. Key metadata has a compact design, for example, a 4GB of RAM can store a large number of files and directories. When the Namenode starts, it reads Fsimage and Editlog and applies all the transactions to the FSIMGE. Then, store a new fsimge. Then discard the old Editlog, because all the previous transactions have been solidified into the fsimage. This process becomes a checkpoint. The current implementation of checkpoint is only performed when Namenode is started. The periodic checkpoint is the function to be implemented later.
Datanode stores HDFS data in the local file system. Datanode does not have the concept of an HDFs file, it simply stores the data for each file block in the local file system. Datanode does not create all the files in the same directory, but uses heuristics to determine how many files are appropriate for each directory. When a datanode is started, it first scans the local filesystem and generates a list of all the HDFs blocks sent to Namenode, which is the Block report Blockreport
Communication protocols
All HDFS communication protocols are based on TCP/IP. A client establishes a connection agreement with the Namenode machine via a configurable TCP port number ClientProtocol. The agreement between Datanode and Namenode is Datanode protocol. An RPC abstraction hides both of these protocols. Namenode never initiates an RPC request and always responds to requests from Datanode and clients.
Robustness
The goal of the HDFS group is to achieve the reliability of the data store, even in the event of a failure. Here are the main three kinds of faults: Namenode fault, Datanode fault, and network segmentation partitions.
Data disk failure, heartbeat, and re-replication
Each datanode sends a heartbeat signal to the namenode periodically. A network failure can cause a group of Datanode to lose connection with Namenode. Namenode detects this situation, the datanode of the tag response is invalidated, and the new IO request path fails to be forwarded datanode up. Any data on the failed Datanode is inaccessible. A failed datanode may cause some file blocks to have a replication factor that is much smaller than the specified value, at which point the NameNode triggers a new copy. There are several reasons for the need for re-copying: A datanode cannot be accessed, a copy is damaged, a hard disk fails, or a file's replication factor improves
Cluster rebalancing
The HDFS architecture is compatible with the rebalancing of the data. Data may move from one datanode to another, if the remaining space is at a certain threshold. When a read request to a file suddenly increases dramatically, more copies may be created. These haven't been realized yet.
Data integrity
The data extracted from the Datanode may be corrupted. The cause may be a failure of the storage device, a network problem, or a software issue. The HDFs client implements the checksum of the HDFS data. When the client creates an HDFS, it calculates a checksum for each block and stores the checksum in a separate hidden file. When the client obtains the contents of the file, the checksum is recalculated and compared with the previously stored checksum. If not, the client reads a block of data from another datanode.
Meta Data disk failure
Fsimge and Editlog are the central data structures of HDFS. A problem with these two files causes the HDFs instance to fail. For this reason, Namenode can be configured to support the maintenance of multiple fsimage and Editlog. Any fsimage, editlog changes are synchronized to the other replicas. This synchronization can degrade the performance of HDFs, however, this is acceptable because HDFs is a large number of data operations, not a large number of metadata operations. When Namenode starts, he chooses to use the nearest consistent fsimage and Editlog.
At 2.6, Namenode is still single point of failure
Snapshot
HDFs does not currently support snapshot
Organization of the data:
Data block
The HDFs design supports very large files. Applications that use HDFS handle very large data. These applications write data once, read multiple times, and require the speed at which these reads are streamed. HDFS supports write once, read multiple times. A typical block size is 64M. As a result, the HDFs file is split into 64M blocks, which are datanode on different locations.
Staging distribution
A client requires that a file be created and not immediately accessible to namenode. In fact, HDFs first caches the file data in a local temporary file. The application's write operations are redirected to these temporary files. When the local file size reaches the size of a data block, the client initiates a request to Namenode. Namenode put the file name in the file hierarchy and assign a file block. Namenode replies to the client, the returned data contains the DATANODE flag and the target data block. The client then stores the data from the temporary file on the Datanode. When a file is closed, the remaining data in the temporary file that has not yet been written to the file block is transferred to Datanode. The client then tells the Namenode file to shut down. At this point, namenode the file creation operation. If the Namenode fails before the file is closed, the file is lost.
Selecting the method above has been considered for applications running in HDFs. These applications require a streaming write operation to the file. If a file is written directly to a remote file without the client's cache, network speed and congestion can affect the throughput of the write operation. Of course, this method is not the first, the previous distributed file system, AFS, has been using the client cache to improve performance. To achieve higher performance for data transfer, the POSIX specification is not fully compliant.
Replication Pipeline
When a client writes data to an HDFS file, as described above, the data is first written to a local file. The add HDFs file has a copy parameter of 3. After the local file accumulates to a block of data, the client obtains a set of datanode information from Namenode. The client writes the data to the first Datanode. The first datanode begins to receive data, part of it, writes it to the local file, and then transfers the data to the second Datanode. The second datanode the same. Finally, the third Datanode writes the data to the local file system. Therefore, a datanode can receive data from the Datanode in front of the assembly line, while forwarding the data to another datenode in the pipeline.
Accessibility
Applications can access data in HDFs in a variety of ways. Native, HDFs provides file system Javaapi http://hadoop.apache.org/docs/current/api/. There are also C-language wrap-around. Also, the HTTP browser can browse the files in HDFs. Support for the WebDAV protocol is being built.
FS Shell
HDFS allows user data to be organized into files and directories. It provides a command line interface called the FS Shell. Users can interact with data in HDFs through the FS shell. syntax is similar to other shells.
Create/foodir Bin/hadoop Dfs–mkdir/foodir
Remove/foodir bin/hadoop Dfs–rmr/foodir
See a file content/foodir/myfile.txt Bin/hadoop dfs–cat/foodir/myfile.txt
The FS Shell is primarily used by scripting languages to interact with stored data.
Dfsadmin
The Dfsadmin command primarily manages the HDFs cluster. These commands are used by HDFS administrators.
Put the cluster in Safe mode Bin/hadoop Dfsadmin–safemode Enter
Generate a list of Datanode Bin/haddop Dfsadmin–report
Refresh Datanode Bin/haddop dfsadmin–refershnodes???
Browser interface
Typically HDFS configures a network server that can access the HDFs namespace through a TCP port. Users can browse the HDFs directory through a browser, producing the contents of the file
Space reclamation
File deletion and Undelete
When a user or app deletes a file, the file is not immediately removed from HDFs. Instead, it is renamed to the/trash directory first. As long as it is still in the Transh directory, it can be restored quickly. The time left in the directory can be configured. After the time has gone by, Namenode is removed from the namespace. It is then released with the file block corresponding to the file. It is important to note that there is a considerable delay in the increase in space from a file being deleted by the user to the HDFS system.
The user can recover a file as long as it is still in the/trash directory. The method is to access the file by entering the/trash directory. The Fs.trash.interval parameter in Core-site.xml controls the time the file is in the/trash directory.
Reducing the replica factor
When the replica factor decreases, Namenode chooses the extra copy that can be deleted. The next time you heartbeat, pass this message to Datanode. Datenode Delete the corresponding file block.
References
Hadoop JavaDoc API.
HDFS Source code:http://hadoop.apache.org/version_control.html
HDFS Architecture Guide 2.6.0-translation