Hadoop 2.0 Namenode HA and Federation practice

Source: Internet
Author: User
Tags file system zookeeper backup

I. BACKGROUND

The cloud trend in the second half of 2012 began to provide a large state-owned bank's historical transaction data backup and query technology solutions based on Hadoop, due to the industry's particularity, customer service availability has a very high demand, and HDFs has long been a problem with the point of trouble, until Apache Hadoop released its 2.0 alpha version in May 2012, where MRv2 is still immature, and HDFs new features are already largely available, especially the high availability (HA) and federation. Cloudera also produced CDH4.0.1 in July, including many of the new features and components of Hadoop 2.0, so we tested HA and federation based on CDH4.0.1.

This work by me and colleague Zhang June, Chanxing will be completed together.

Ii. why HA and federation are needed

1. Single point of failure

Before Hadoop 2.0, there were a number of techniques that tried to solve the problem of a single point of failure, and we'll make a short summary here.

Secondary Namenode. It is not ha, it is only a phased merging of edits and Fsimage to shorten the start time of the cluster. When Namenode (hereafter nn) fails, secondary NN does not provide immediate service, and secondary NN cannot even guarantee data integrity: If NN data is lost, changes to the file system after the last merge will be lost.

Backup Namenode (HADOOP-4539). It replicates the current state of NN in memory, is Warm Standby, can be limited to this, and there is no failover. It is also staged to do checkpoint, and can not guarantee data integrity.

Manually point the Name.dir to NFS. This is a secure cold Standby, which guarantees that the metadata is not lost, but the recovery of the cluster is entirely manual.

Facebook Avatarnode. Facebook has a strong operational support, so Avatarnode is only hot Standby, and there is no automatic switching, when the main NN failure, need to confirm the administrator, and then manually to provide services to the virtual IP map to Standby NN, The advantage of doing this is to ensure that the scene of the brain crack does not occur. Some of its design ideas are very similar to those in Hadoop 2.0, and in time, Hadoop 2.0 should draw on Facebook's approach.

There are also a number of solutions that rely largely on external ha mechanisms, such as Drbd,linux ha, VMware's ft, and so on.

2. Cluster capacity and cluster performance

A single NN architecture allows HDFs to have potential problems with clustering scalability and performance. When the cluster is large to a certain extent, the memory used by the NN process may reach the hundred g, the commonly used estimating formula is 1G corresponding to 1 million blocks, according to the default block size, probably 64T (this estimate ratio is relatively large rich , in fact, even if each file has only one block, all meta data information will not have 1kb/block. At the same time, all the metadata information reading and operation need to communicate with NN, such as the client's Addblock, Getblocklocations, and Datanode blockrecieved, Sendheartbeat, Blockreport, nn becomes the bottleneck of performance after the cluster scale becomes larger. The HDFs federation in Hadoop 2.0 was developed to address both of these issues.

Third, the implementation of Hadoop 2.0 ha

Image source: HDFS-1623 design Document

Picture Author: Sanjay Radia, Suresh Srinivas

In this diagram, we can see the general structure of Ha, whose design considerations include:

Use shared storage to synchronize edits information between two nn. The previous HDFS is share nothing but nn, now NN and share storage, this is actually transferred the location of a single point of failure, but in high-end storage devices have a variety of raid and redundant hardware including power and network cards, than the reliability of the server or slightly improved. The consistency of data is guaranteed through the flush operation after every meta data change in NN, plus the Close-to-open of NFS. The community is now also trying to put the metadata store on bookkeeper to eliminate the dependency on shared storage, Cloudera also provides Quorum Journal Manager implementation and code, a Chinese blog with detailed analysis: based on Qjm/qurom Principle and code Analysis of HDFs ha in Journal manager/paxos

Datanode (DN) reports block information to two nn at the same time. This is the necessary step for Standby nn to keep the cluster up to date, without repeating it.

Failovercontroller process for monitoring and controlling the NN process obviously, we can't synchronize the heartbeat and other information in the NN process, the simplest reason is that FULLGC can suspend nn for more than 10 minutes, so There has to be an independent, dapper watchdog to be dedicated to monitoring. This is also a loosely coupled design, easy to expand or change, the current version is used zookeeper (hereinafter referred to ZK) to do synchronous lock, but the user can conveniently put this zookeeper Failovercontroller (hereinafter referred to as ZKFC) Replace with other HA schemes or leader election programmes.

Isolation (fencing)), to prevent brain laceration, is to ensure that at any time there is only one primary NN, including three aspects:

Shared storage fencing ensure that only one NN can be written to edits.

Client fencing to ensure that only one NN can respond to client requests.

Datanode fencing ensures that only one NN can send commands to the DN, such as deleting blocks, copying blocks, and so on.

Four, Hadoop 2.0 Federation implementation of the way

Image source: HDFS-1052 design Document

Picture Author: Sanjay Radia, Suresh Srinivas

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.