Hadoop HDFs High Availability (HA)

Last Update:2018-07-25 Source: Internet

Author: User

Tags failover table name zookeeper ssh

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Release Notes: Early 2016-07-19 morning Draft

In Hadoop 1.x, Namenode is a single point of failure for a cluster, and once the namenode fails, the entire cluster will be unavailable, restarted, or a new namenode can be started to recover from. It is worth mentioning that secondary namenode does not provide the ability to fail over. The availability of the cluster is affected by the following: When a machine fails, such as a power outage, the administrator must restart Namenode to be available again. In routine maintenance upgrades, the need to stop Namenode will also cause the cluster to be unavailable for some time. Architecture

Hadoop HA (High Available) solves these problems by configuring two Namenode in active/passive mode, called Active namenode and Standby , respectively. Namenode. The Standby Namenode acts as a hot backup, allowing for rapid failover in the event of a machine failure, while using an elegant way of namenode switching during routine maintenance. Namenode can only be configured with one master and no more than two namenode.

The master Namenode handles all operation requests (read and write), while standby only acts as a slave, maintaining the state as synchronously as possible, so that the fault can be quickly switched to standby. To keep the standby namenode in sync with the active namenode data, two Namenode communicate with a group of journal node. When the primary namenode is doing the namespace operation of the task, it ensures that the log is permanently modified to the majority of the journal node nodes. Standby Namenode continuously monitors these edits and applies them to their namespace when monitoring changes.

When failover occurs, standby ensures that it has read all the edit logs in journal node before becoming active Namenode, keeping the data state consistent with the failure.

To ensure that failover can be completed quickly, Standby Namenode needs to maintain the latest block location information, that is, each block copy resides on which nodes in the cluster. To achieve this, the Datanode simultaneously configures the main two namenode, and simultaneously sends the block report and Heartbeat to both namenode.

It is important to ensure that only one namenode is in active state at any time, otherwise data loss or data corruption may occur. When both Namenode consider their active namenode, they try to write the data at the same time (no further detection and synchronization of the data). To prevent this brain crack, Journal nodes only allows one namenode to write data, which is controlled internally by maintaining the epoch to safely fail over.

There are two ways to do edit log sharing: Using the NFS Share edit log (stored in Nas/san) using QJM shared edit Log with NFS shared storage

As shown in the figure, NFS is the shared storage for the primary and standby namenode. This scenario may occur with Split-brain, where two nodes think they are the primary namenode and attempt to write data to edit log, which can result in data corruption. By configuring the Fencin script to resolve this issue, the fencing script is used to: continue to access the shared edit log file before the previous Namenode shutdown is disabled namenode

With this scenario, the administrator can manually trigger the Namenode switch and then perform the upgrade maintenance. However, this approach has the following problems: Only a manual failover is required, and each failure requires the administrator to take steps to switch. Nas/san provisioning is complex, error-prone, and the NAS itself is a single point of failure. Fencing is complex and often misconfigured. Unable to resolve unexpected (unplanned) incidents, such as hardware or software failures

There is a need to address these issues in a different way: automatic failover (introducing ZooKeeper to Automation) removes dependencies on external software hardware (Nas/san) while addressing unexpected and routine maintenance quorum-based storage + ZooKeeper

QJM (Quorum Journal Manager) is a component developed by Hadoop specifically for Namenode shared storage. Its cluster runs a set of journal nodes, each journal node exposes a simple RPC interface that allows Namenode to read and write data to the local disk where the data resides on the journal node. When Namenode writes to edit log, it sends a write request to all journal node of the cluster, and when the majority of the nodes reply to the acknowledgement that it is successfully written, edit log is considered a successful write. For example, there are 3 journal Node,namenode if you receive a confirmation message from 2 nodes, the write is considered successful.

In the process of automatic fault transfer, the Zookeeperfailcontroller (ZKFC) of monitoring Namenode state is introduced. The ZKFC typically runs on Namenode's host machine, collaborating with the zookeeper cluster to complete the automatic failover of the failure. The entire cluster architecture diagram is as follows:

QJM

Namenode interacts with Namenode using the RPC interface provided by the QJM client. The edit log is written in a quorum-based manner, where the data must be written to most nodes of the Journalnode cluster.
On the journal node (server side)
The server-side journal runs a lightweight daemon that exposes the RPC interface for client invocation. The actual edit log data is saved in the journal node Local disk, which is specified in the configuration using the Dfs.journalnode.edits.dir property.
Journal node solves the problem of brain fissure through the epoch number, called Journalnode fencing. Here's how it works:
1) When Namenode becomes active, it is assigned an integer epoch, which is unique and is higher than the epoch number held by all previous namenode.

2) When Namenode sends a message to journal node, it also brings the epoch. When journal node receives the message, it compares the number of epochs received with the promised epoch stored locally, and if it receives an epoch larger than its own, then updates its local epoch number with the received epoch. If the received is smaller than the local epoch, the request is rejected.

3) Edit log must be written to most nodes to be successful, that is, its epoch is higher than the epoch of most nodes.

This approach solves 3 problems with NFS: No additional hardware is required and the original physical machine fencing is controlled by the epoch number to avoid errors. Automatic failover: Zookeeper handles the problem. automatic failover with zookeeper

As mentioned earlier, in order to support failover, Hadoop introduces two new components: Zookeeper quorum and Zkfailovercontroller process (ZKFC).

Zookeeper's tasks include: failure detection: Each Namnode maintains a persistent session in ZK, and if the Namnode fails, the session expires, and a ZK event mechanism is used to notify other namenode that a failover is required. Namenode election: If the current active Namenode is hung, another namenode will try to acquire an exclusive lock in ZK, get this lock on the table name it will become the next active NN.

On each Namenode daemon machine, a ZKFC is also run to complete the following tasks: Namenode Health and Health ZK session management Namenode elections based on ZK

If the namenode of the ZKFC machine is in good health and the Znode lock used for the election is not held by another node, the ZKFC will attempt to acquire the lock, and the successful acquisition of the exclusive lock will represent the election, be responsible for the failover after the election, and, if necessary, Fencing the previous namenode to make it unusable, and then switches its namenode to the active state. Deployment and Configuration Hardware Resources

In order to allow HA clusters, the following resources are required:
1) Namenode machine: The machine configuration that runs active Namenode and standby namenode should remain the same as it does with configurations that do not use HA scenarios.
2) Journalnode machines: Machines running Journalnode, these daemons are lightweight, so they can be deployed in namenode or yarn ResourceManager. At least 3 Journalnode nodes need to be deployed in order to tolerate a node failure. Usually configured in odd numbers, for example, the total is N, you can tolerate (N-1)/2 machines fail and the cluster will still work properly.

It is important to note that Standby Namenode also completed the checkpoint function of the original secondary namenode, so there is no need to deploy secondary namenode independently. ha configuration

Nameservices: The logical name of the service

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
</property>

Namenode configuration:
Dfs.ha.namenodes. [nameservices]: nameserviecs corresponding Namenode:

<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value > <!--currently can only be up to 2--
</property>

Namenode RPC Address:

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value> machine1.example.com:8020</value>
</property>
<property>
  <name> Dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>machine2.example.com:8020</value>
</property>

Namenode HTTP Server configuration:

<property>
  <name>dfs.namenode.http-address.mycluster.nn1</name>
  <value> Machine1.example.com:50070</value>
</property> <!--If you have Hadoop security enabled, you need to use https-address-- >
<property>
  <name>dfs.namenode.http-address.mycluster.nn2</name>
  <value >machine2.example.com:50070</value>
</property>

Edit log saves the directory, which is the journal node cluster address, separated by semicolons:

<property>
  <name>dfs.namenode.shared.edits.dir</name>
 <value>qjournal:// Node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</ Property>

The client failover proxy class, which currently provides only one implementation:

<property>
  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value> Org.apache.hadoop.hdfs.server.namenode.ha.configuredfailoverproxyprovider</value>
</property>

Edit Log Save path:

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/path/to/journal/node /local/data</value>
</property>

Fencing Method Configuration:

<property>
      <name>dfs.ha.fencing.methods</name>
      <value>sshfence</value>
    </property>

    <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/home/exampleuser/.ssh/id_rsa</value>
 </property>

While using QJM as a shared storage, there is no simultaneous brain-splitting phenomenon. However, the old Namenode can still accept read requests, which may cause data to become stale until the original Namenode attempts to write to journal node. It is therefore recommended to configure a suitable fencing method. Deployment Startup

After the configuration is complete, start the JQM cluster with the following command:

hadoop-daemon.sh  start  journalnode

Configure and start the Zookeeper cluster, as is the case with regular configuration, including the data save location, Node ID, time configuration, etc., configured in Zoo.cfg. Detailed steps are not listed here. Before using, you need to format ZK's files:

HDFs Zkfc-formatzk

Format Namenode:

HDFs  Namenode-format

Start two namenode:

Master
hadoop-daemon.sh Start Namenode27
//Standby Namenode on
HDFs namenode-bootstrapstandby

Other components start in the same way as they do in normal mode.

Finish

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop HDFs High Availability (HA)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop HDFs High Availability (HA)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support