HDFs is the implementation of the Distributed file system of Hadoop. It is designed to store http://www.aliyun.com/zixun/aggregation/13584.html "> Mass data and provide data access to a large number of clients distributed across the network. To successfully use HDFS, you must first understand how it is implemented and how it works.
HDFS Architecture
HDFs's design idea is based on Google file system (Google files System,gfs). Its implementation addresses a large number of problems that exist in many distributed file systems, such as Network File System (NFS). Specifically, the implementation of HDFS solves the following problems:
The ability to save very large amounts of data (terabytes or petabytes), the HDFs design supports the spread of data across a large number of machines, and it supports larger file sizes than other Distributed file systems, such as NFS.
Storing data reliably, HDFS using data replication in response to a single machine failure or inaccessible cluster.
To better integrate with Hadoop's MapReduce, HDFs allows data to be read and processed locally (data locality issues are discussed in detail in chapter 4th).
HDFs's design for scalability and high performance is also costly. HDFs applies only to a particular type of application-it is not a common distributed file system. A large number of additional decisions and trade-offs dominate the architecture and implementation of HDFS, which include the following:
HDFs is optimized for high-speed streaming read performance, with the attendant cost of reducing random lookup performance. This means that if your application needs to read data from HDFs, you should avoid finding it, or at least minimize the number of lookups. Sequential reads are the preferred way to access HDFs files.
HDFs only supports a limited set of file operations-write, delete, append, and read-and does not support updates. It assumes that the data is written to HDFs at once, and then read multiple times.
HDFs does not provide a local data caching mechanism. Caching is so expensive that you should simply reread the data from the source file, which is not a problem for applications that normally read large data files sequentially.
HDFs is implemented as a block-structured file system. As shown in Figure 2-1, a single file is split into blocks of fixed size, which are stored on a Hadoop cluster. A file can consist of multiple blocks stored in different datanode (a single machine in a cluster), and for each block, the Datanode is randomly selected. As a result, accessing a file usually requires access to multiple datanode, which means that the file size supported by HDFs far exceeds the disk capacity of the stand-alone machine.
Datanode saves individual HDFS data blocks on its local file system as a separate file, but it does not understand the HDFs file itself. To further improve throughput, Datanode does not create all files in the same folder. Instead, it uses heuristic algorithms to determine the optimal number of files in each directory and to create subdirectories appropriately.
One of the requirements for this block-structured file system is the ability to reliably store, manage, and Access file metadata (information about files and blocks) and provide quick access to the metadata store. Unlike the HDFs file itself (one write, multiple read access model), the meta data structure can be modified by a large number of clients at the same time. It is important to keep this information synchronized at all times. HDFs solves the problem by introducing a dedicated special machine called Namenode, which holds all the metadata for the entire cluster file system Namenode. This means that HDFs implements a master/from architecture. A separate namenode (that is, the primary server) manages the file System namespace and regulates client access to files. The existence of individual master nodes in the cluster greatly simplifies the system architecture. Namenode acts as a separate arbiter and the role of all HDFS metadata repositories.
Because the amount of each file metadata is relatively small (containing only the filename, access rights, and the location of each block), Namenode saves all metadata in memory, guaranteeing fast random access. Meta-data storage is designed to be compact. As a result, namenode with 4GB of RAM can support a large number of files and directories.
The metadata store is also persistent. The entire file System namespace (including block to file mappings and file system attributes) is contained in a file named Fsimage, which is stored in the Namenode local file system. Namenode also uses transaction logs to persistently record every change that occurs in the file system metadata (metadata store). The log is saved in the Editlog file on the Namenode local file system.
Subordinate Namenode
As mentioned earlier, the implementation of HDFs is based on the master/from architecture. On the one hand, this approach greatly simplifies the overall architecture of the HDFs. On the other hand, it also produces a single point of failure, that is, the Namenode failure actually means that the HDFs fails. To alleviate this problem to some extent, Hadoop implements the subordinate Namenode.
The subordinate Namenode is not a "standby Namenode". It cannot take over the function of the main namenode. It provides a checkpoint mechanism for the main namenode. In addition to saving the state of the HDFS Namenode, it maintains two data structures on disk to persist the current file system state, which is a mirrored file and an edit log, respectively. The mirrored file represents the state of the HDFs metadata at a point in time, and the edit log is a transaction log (as compared to a log in the database architecture) that records every change in file system metadata since the creation of the mirrored file.
During startup (restart), Namenode rebuilds the current state by reading the mirrored file and then replaying the edit log. Obviously, the larger the editing log, the longer it takes to replay it, so the longer it takes to start Namenode. To promote Namenode startup performance, the edit log is rotated periodically, by applying the edit log to an existing mirror to produce a new mirrored file. This operation is very resource intensive. To minimize the impact of creating checkpoints and to simplify the functionality of namenode, checkpoints are typically created by a subordinate Namenode daemon running on a stand-alone machine.
As a result of the checkpoint creation, the dependent Namenode holds an (expired) copy of the main Namenode persistence state in the format described in the preceding mirror file. You can use dependent namenode to restore the state of a file system when the editing log remains relatively small. Be aware that in this case a certain amount of metadata (and corresponding content data) is lost because the most recent changes saved in the edit log are not available.
One ongoing effort is to create a true backup Namenode that can be taken over when the primary node fails. The HDFs high availability implementation introduced in Hadoop 2 is discussed later in this chapter.
To keep namenode memory occupied, the default size of the HDFS block is 64mb--from the order of magnitude greater than the block size of most other block-structured file systems. The extra advantage of large blocks of data is that it allows HDFs to store large amounts of data sequentially on disk to support high-speed streaming of data.
Smaller blocks in HDFs
One of the misconceptions about Hadoop is that smaller blocks (smaller than block sizes) still occupy an entire block in the file system. That is not the case. Smaller blocks only occupy the amount of disk space they need.
But this does not mean that a large number of small files can effectively use HDFs. Regardless of the size of the block, its metadata occupies exactly the same amount of memory in the Namenode. As a result, a large number of HDFs small files (smaller than the block size) consume a large amount of namenode memory, which can negatively impact the scalability and performance of HDFs.
In a real-world system, it is almost impossible to avoid smaller hdfs blocks. The larger possibility is that a given HDFs file will occupy some complete blocks and some smaller chunks. Is this going to be a problem? Given that most HDFs files are quite large, the number of smaller blocks in the entire system will be relatively small, so there is usually no problem.
The disadvantage of the HDFs file organization is that a file requires multiple Datanode to provide services, which means that if any of these machines fail, the file becomes unavailable. To avoid this problem, HDFs replicates each block on multiple machines (the default is three).
The implementation of data replication in HDFs is part of the write operation, which takes the form of data pipeline. When the client writes data to the HDFs file, the data is first written to the local file. When a local file accumulates to an entire block of data, the client requests the Namenode to save the Datanode list of the block copy. The client then writes the data block from its local store to the first DataNode in 4KB (see Figure 2-1). The Datanode saves the received block in the local file system and forwards this part of the data to the next Datanode in the list. The next datanode that receives the data repeats the same action until the last node in the replica collection receives the data. This datanode only saves the data locally and no longer forwards it.
During block writing, if a datanode is invalidated, it is removed from the pipeline. In this case, after the current block's write operation completes, Namenode copies the block to compensate for the missing copy due to datanode failure. When the file is closed, the remaining data in the temporary local file is piped to the datanode. The client then notifies Namenode that the file is closed. At this point, Namenode commits the file creation operation to the persistent store. If Namenode crashes before the file is closed, the file is lost.
The default value for block size and replication factor is specified by the Hadoop configuration, but it can be overridden on a file-by-document basis. An application can specify the block size, the number of replicas, and the replication factor for a particular file when it is created.
One of the most powerful features of HDFs is the optimization of the placement of replicas, which is critical to the reliability and performance of HDFs. Namenode is responsible for making all decisions related to block replication, and it periodically receives heartbeats and block reports from each datanode (every 3 seconds). The heartbeat is used to ensure that the Datanode function is normal, and block reports can verify that the list of blocks on the Datanode is consistent with the information in Namenode. One of the first things that Datanode do at startup is to send block reports to Namenode. This allows Namenode to quickly build a picture of the block distribution in the entire cluster.
The most important feature of HDFS data replication is the rack perception. Large HDFs instances running on a cluster of computers usually span many racks. Typically, the network bandwidth (and associated network performance) between machines on the same rack is much larger than the network bandwidth between machines on different racks.
Namenode determines the rack IDs that each datanode belongs to through the Hadoop rack-aware process. A simple strategy is to place individual replicas on separate racks. This strategy prevents data loss when the entire rack fails and distributes the replicas evenly across the cluster. It also allows for the use of bandwidth originating from multiple racks when reading data. However, write performance is affected by the fact that, in this case, the write operation must transfer the block to multiple racks.
One of the optimization scenarios for a rack-aware strategy is to reduce the number of racks that are occupied to less than the number of replicas to decrease cross-rack write traffic (which in turn improves write performance). For example, when the replication factor is 3 o'clock, the two copies are placed on the same rack, and the third copy is placed on another different rack.
To minimize global bandwidth consumption and read latency, HDFS attempts to fetch data from the closest copy of the reader to satisfy the read request. If a copy exists on the same rack as the reader node, the copy is used to respond to the read request.
As mentioned earlier, each datanode periodically sends a heartbeat message to Namenode (see figure 2-1) Namenode Use these messages to discover Datanode invalidation (based on the absence of heartbeat messages). Namenode identifies a datanode that has not had a heartbeat recently as a panic and no longer distributes any new I/O requests to it. Because the data on the panic Datanode is no longer available to the HDFs, the Datanode panic may cause some blocks to be reduced to the replication factor below their set value. Namenode keep track of the blocks that need to be replicated and initiate the copy operation if necessary.
Similar to most other existing file systems, HDFS supports traditional hierarchical file organization structures. It supports the creation and deletion of files in a directory, moving files between different directories, and so on. It also supports user quotas and read/write permissions.