The main limitation of current HDFS implementations is a single namenode. Because all file metadata is stored in memory, the amount of namenode memory determines the number of files available on the Hadoop cluster. To overcome the limitations of a single namenode memory and to extend the name service horizontally, Hadoop 0.23 introduces the HDFS Federation (HDFS Federation), which is based on multiple independent namenode/namespaces.
The following are the main advantages of the HDFs alliance:
Namespace extensibility--hdfs cluster storage can be scaled horizontally, but namespaces cannot. By adding more Namenode to the cluster to extend the namespace, a large-scale deployment (or deployment with a large number of small files) will benefit.
Performance-The throughput of file system operations is limited by a single namenode. Adding more Namenode to the cluster can extend the throughput of the file system read/write operations.
Isolation-a single namenode cannot support isolation under a multiuser environment. Experimental applications may overload namenode and slow down critical production applications. With multiple Namenode, different categories of applications and users can be isolated to different namespaces.
As shown in Figure 2-5, the implementation of the HDFs Alliance is based on the collection of multiple independent namenode, and there is no need for coordination between them. All Namenode use Datanode as a public store for saving blocks. Each datanode will be registered with all Namenode in the cluster. Datanode periodically sends a heartbeat and block report and processes commands from Namenode.
The namespace operates on the collection of blocks-block pools. Although a block pool can only be used for a specific namespace, the actual data can be assigned to any datanode in the cluster. Each block pool is managed independently, which allows the namespace to generate block IDs for new blocks without having to coordinate with other namespaces. The expiration of one namenode does not affect other namenode in the Datanode service cluster.
Namespaces are called namespace volumes along with their block pools. This is a self-contained snap-in. When the namenode/namespace is deleted, the datanode corresponding block pool is also deleted. When a cluster is upgraded, each namespace volume is upgraded as a unit.
The HDFS Federation configuration is backward compatible and allows the existing single namenode configuration to work without any changes. The new configuration is designed so that all nodes in the cluster are configured identically, without having to deploy different configurations based on the type of nodes in them.
Although the HDFs alliance solves the problem of HDFs extensibility, it does not solve the Namenode reliability problem (in fact, it makes things worse-the probability of a single namenode failure is higher). Figure 2-6 shows a new HDFs high-availability architecture that contains two stand-alone machines configured as Namenode, and only one of them is active at any point in time. The active Namenode is responsible for responding to all client operations in the cluster, while the other (standby) is only subordinate, maintains sufficient state information, and provides rapid failover when needed. To keep the state of two nodes synchronized, the implementation requires that two nodes have access to a directory on the shared storage device.
When the active node makes any namespace modifications, it writes a record of the modification to a log file that is located in the shared directory. The standby node continuously observes the changes to the directory and applies the changes to its own namespace. When a failover occurs, the standby switches itself to active state after ensuring that all changes have been read.
To support fast failover, the standby node also needs to know the latest information about the block location in the cluster. This can be done by configuring Datanode to send block location information and heartbeat to two namenode at the same time.
Currently, only manual failover is supported. Hortonworks's core Hadoop patch, which was submitted to version 1.1 backbone and branch, eliminates that limitation. The solution is based on the Hortonworks failover controller, which automatically selects an active Namenode.
HDFs provides very strong and flexible support for storing large amounts of data. Some special file types (similar to sequencefile) are ideal for supporting MapReduce implementations. Mapfile and its derived types (Set, array, and BLOOMMAP) perform well in fast data access.
However, HDFs only supports a limited set of access modes-write, delete, and append. Although it is technically possible to implement an update as an overlay, the implementation of this granularity (covering only at the file level) is expensive in most cases. In addition, the HDFS is designed specifically to support a large number of sequential reads, which means that random access to data can result in significant performance overhead. And finally, HDFs does not apply to smaller files. Although technically, HDFS supports these files, their presence can cause significant overhead in namenode memory, thus reducing the amount of memory in the Hadoop cluster.
To overcome many limitations, Hadoop introduced a more flexible data storage and access model in hbase form.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.