High Availability for the HDFS namenode

Source: Internet
Author: User
Tags failover

High Availability for the HDFS namenode

Sanjay Radia, Suresh Srinivas

Yahoo! INC

(This article is a translation of namdnoe ha design documents)

1. Problem Description

There are many ways to improve the availability of HDFS namednoe (NN), including reducing the startup time and updating the configuration without restarting the cluster, reduce the upgrade time and provide a manual or automatic NN failover. This article focuses on NN failover to solve the single point of failure (spof) Problem of NN.

 

There are many methods to solve NN failures, including using shared storage, using virtual IP addresses and Smart Clients. Zookeeper can be used for leader election, or other architectures are similar to Linux ha. These different solutions can share some framework components. The purpose of this article is to define these framework components and provide specific designs to build a machine failover solution, this provides high availability of HDFS namenode and isolates HDFS services.

 

2. Terms

1) Active Nn-activity nn that provides read/write operations for the client

2) standby Nn-This NN waits and becomes active when active NN dies

I. backupnode can be used in the hadoop-0.21 to implement the namespace of the standby shared storage file system.

3) in order not to cause confusion, we will not use primary or secondary to represent active and standby, because secondary is a checkpoing node in the old version.

4) failover of hot, warm, and cold, standby NN stores the running Active sub-states.

I. Cold Standby: standby NN is not in the status.

Ii. Warm Standby: standby has some statuses:

1. It has loaded fsimage and editlog but has not received the block report;

2. It has loaded fsimage and roolled logs with all block reports.

5) Hot Standby: standby has all the active States and starts immediately.

3. Upper-layer applications

1) scheduled shutdown: A hadoop cluster often needs to be stopped to upgrade software or configuration. It takes about two hours to restart a 4000-node hadoop cluster.

2) unplanned shutdown or no response to services: namenode services often fail due to hardware, system, NN process failure, or NN process program failure after several minutes of no response, however, these may unexpectedly affect some important upper-layer applications.

In the above two cases, a warm or hot failover can reduce the downtime. In fact, planned upgrades are the biggest factor affecting the unavailability of the HDFS service, because HDFS namenode fails rarely. (Based on Yahoo and Facebook experience ).

4. Not Considered

1) Active-active NNS-our initial design is to make an NN active while another standby (warm or hot). An optional solution is to allow standby to provide read operations. We believe that active-active requires additional work and may need to be redesigned.

2) One namespace is more than two NN

3) failure in a large area, which is usually called BCP.

5. Supported failures

1) As long as a single HW fails (disks, links, links, etc.), the two will not be processed, but in this case, the data will not be damaged.

2) software failure: for example, NN process failure or NN process deadlock. Note: The system cannot be restored. When standby becomes active, the software fails.

3) nn GC is a tricky problem. If a NN enters GC and does not reply, it cannot be considered dead.

6. Requirements

1) only one NN is active.

I. Only active can process client requests and reply.

Ii. Only the active node can change the Persistence State;

Iii. Optional: standby processes read requests.

2) The first step is to support manual Failover-some organizations want to use failover only when the software is upgraded, which is the biggest cause of hadoop cluster unavailability. ,

3) Automatic rollback is not possible if the old active instance is restarted or becomes healthy.

4) data is more important than availability

I. manual or automatic failover should not cause data corruption

5) Try not to use special hardware

6) Ha installation and failure management should be simple, and data corruption should be prevented even if the operation fails.

7) Short-term NN garbage collection should not be considered as a failure or trigger automatic failover.

7. Use Cases

1) Single Point NN configuration, no failover

2) manual switch between active and standby

I. Standby can be cold/warm/hot

3) automatic switch between active and standby

I. Start two NN instances, one automatically becomes active, and the other becomes standby.

Ii. Active and standby are running

Iii. Active failed or unhealthy, standby takes over

Iv. Active and standby operations, active manual shutdown

V. Active and standby run. Standby fails and active continues.

Vi. Active running, standby shutdown repair, active cannot be started, standby starts and becomes active

VII. Start two NN instances, only one of which becomes active.

VIII. Active and standby run. The active status is unknown and standby takes over.

 

8. Design Scheme

The following describes some design schemes. There are some options in many places, such as whether to store the real-time state of NN, how to conduct Leader Election (using zookeeper or Linux ha or other methods), or how to implement isolation technology. However, the rest is simple. The following two charts describe the overall policy of using zookeeper and Linux ha for shared storage. The design can be extended to backupnode.

Nn ha with shared storage and zookeeper

V \: * {behavior: URL (# default # VML);} O \: * {behavior: URL (# default # VML);} w \: * {behavior: URL (# default # VML );}. shape {behavior: URL (# default # VML );}

 

Nn ha with shared storage and Linux ha


1) shared and non-shared storage of NN metadata

Active and standby can both share storage (such as NFS) or active can send edits stream to standby (just like the implementation of backupnode in 0.21 ). Some of the considerations are as follows:

I. shared storage becomes a single point of failure and therefore requires high availability. Bookkeeper is a good solution, but it is not ready in prime time. It can be considered as a long-term solution. When using bookkeeper, NN does not need to be kept in the local disk, resulting in "stateless" when nn ends '. Some organizations already have ha NFS in their clusters for other reasons.

Ii. backupnode is cheaper because it does not need to use a Shared Server. However, it does not support the third case.

Iii. backupnode does not require isolation technology, as long as it does not have to solve the third case. Shared storage needs to be isolated. However, if we use stonith to solve the isolation problem, we can solve all isolation requirements.

IV. The backupnode does not have symmetry, so it cannot be replaced unless it is in the active state.

V. When the backupnode is down, it still depends on remote storage to store the additional status of the active state, which is switched back to shared storage.

 

2) Report parallel blocks to active and atandby

In our design, we need to concurrently send block reports to active and standby to ensure warm or hot failover. The block report can be directly sent by datanode, or sent to active and standby through the middle layer.

3) Client redirection during failover

When the active State fails, the client needs to connect to the new active state again. This is called client failover. There are multiple methods to achieve this.

I. Modify the DNS binding: this is not a good method, because the operating system and many databases cache the DNS, so no changes will be made immediately.

Ii. Smart Client: Server-based redirection, retry or search for the active status again.

1. Pay attention to server-based redirection, regardless of whether the server is redirected. In this case, a better isolation method is shared storage, so only one end can write editlog.

2. Can I work in HTTP or JMX.

3. Failover takes longer, because the client always needs to interact with the first NN (possibly dead) before finding the new NN address.

Iii. Use a Load balancer to send client requests to the correct NN, but this is very difficult in a large-scale environment (for example, 0.1 million client.

Iv. IP Failover-this is often used in production environments.

1. The namenode server uses a virtual IP address, and the virtual IP address is active.

2. Question: whether to work in a cross-switch environment and whether it can only be used in a VLAN.

4) The client times out when NN is started.

In some cases, NN takes a long time to start, load the image, and use edits to restore the block location information. This may cause the client to time out and think that NN is dead. Therefore, when active is started, "starting" should be returned in the client request to indicate that the client should wait. This mode is a special example of safemode.

5) failover control uses a Failover controller (watchdog) independent of the NN process)

Our method is to use a Failover controller process independent of the NN process. This failover controller is very similar to the resource manager of Linux ha. In the solution based on Linux ha, RM, which is a part of it, can be used directly. For zookeeper, we can write one by ourselves, or configure the resource manager of Linux ha to use zookeeper.

The failover controller performs the following functions:

I. Monitor healthy NN, OS, HW, and other resources such as network connections.

Ii. Use heartbeat to elect leaders. (Heartbeat is sent to zookeeper to use zookeeper to elect leaders)

Iii. Active is selected in the leadership election. The active failover controller indicates that its monitored NN is switched from standby to active. (Note that each NN is standby at startup and becomes active only after being instructed by the Failover controller)

Using an independent failover controller process has the following advantages:

I. Integrating this function into NN will cause GC failure in the heartbeat mechanism.

II. The failover controller should be a compact code that is independent from failed applications to increase fault tolerance.

Iii. Make the election mechanism a plug-in.

6) Fencing)

In the Failover solution, it is important to ensure that only one active instance can update the sharing status. Even if there is an election mechanism, the old active state may be isolated and cannot be standby immediately. It is possible to continue sharing. Fencing is a method that prevents old active users from continuing to write shared storage. Fencing requires the active service not to retry. When the shared storage device is restored, the fenced device returns an IO error; in this case, the old active instance should exit with an error message (it is not good to be standby ).

The following shared resources can be considered:

I. As the shared storage for NN metadata: ensure that only active writes are updated to edits logs.

Ii. datanode: ensure that only one NN is deleted to move/manage copies on datanode.

Iii. Client: the client is not strictly in the shared state that requires NN updates. However, the client sends the update command to one of the two NN instances. Make sure that only active NN can reply to the client. Note that if the shared memory fencing is performed, if the non-active NN tries to write, it will be fenced and in this case, it will not return success to the client.

2) Other failover Problems

I. Resume the lease during Failover-TBD.

Ii. Pipeline recovery during failover

 

2. Specific design

1) Fencing

We have already described fencing and shared resources/statuses that require fenced, and the need for NN to exit when the fencing write fails.

 

2) Fencing shared storage containing NN metadata

In the HDFS-1073, fsimage and editlogs have been detached, so only editlog needs fenced. note that starting a new NN always starts a new editlog. one thing to prevent is to prevent the old active node from writing the old editlog and telling the client the result.

I. the fencing solution for NFS needs to be investigated.

Ii. Using bookeeper, we are currently discussing with the bookkeeper team to add a fencing solution.

Iii. Using shared disks (SCSI or San), shared disks provide a fencing solution, but not suitable for hadoop environments.

3) Fencing datanodes

Two solutions:

1. In the heartbeat response, NN returns its status: active or standby.

If the DN status changes, ask ZK who is active.

If "active" is changed from "A" to "B" and then to "A", "DN" can then be detected.

A better solution, the Failover controller tells the DN, but too many dn is difficult to wait for confirmation, so it needs to be resolved in the Protocol.

2.

Each NN has a serial number, which is passed to the DN when the NN status changes.

The DN maintains this serial number during running. The DN only follows the last nn that is converted from standby to active.

If a previously active NN is returned (similar to GC), DN rejects it because its serial number is out of date and another new NN uses a new serial number instead.

4) Fencing Client

One client sends an update command to one of the two NN instances, and only the active NN sends the update command to the client. this requires a further investigation. note that if the shared storage is fenncing, the non-active NN attempt to write will not return success to the client.

5) Use stonith as the fencing solution.

If there are no other good solutions, stonith (shoot the other node in the head) is often used for fencing solutions, and stonith usually closes other nodes through power-off operations.

6) leader election and Failover controller Process

We have summarized the advantages of separating the control process. It also has other advantages. the failover controller process is called the resource manager in Linux ha. zookeeper does not have similar watchdog processes. Therefore, we recommend that you use the RM interface of linuxha:

Because linuxha uses the Linux ha resource manager as the Failover control process.

Write a Failover controller for ookeeper to test whether it is healthy. Use the Linux ha Resource Manager and zookeeper directly, which can effectively use zookeeper as the leader selector.

7) failover controller process operations

I. Heartbeat, used to ensure active survival, triggering leader election when heartbeat is lost.

For zookeeper, the Failover controller periodically sends heartbeat to ZK.

Linuxha, whose Resource Manager manages the Failover controller that sends standby heartbeat.

Ii. Use the Failover controller to monitor health.

Processing NN status (ps command)

Nn simple requirements (such as GC)

OS Detection

Nic Detection

Switch Detection

III. the failover controller must handle a series of commands, whether from standby-to-active or active-to-standby. these operations need to be configured. For example, Linux ha allows each resource to be managed to configure a series of commands.

Iv. Standby-to-active requires the following processes:

Fenced shared storage and DN (stonish can be used if there are no other resources)

Update the shared client address and/or virtual IP Address

Tells NN to convert to active

V. Active-to-standby conversion requires the following process

Update the client address or discard the virtual IP Address

Tells NN to change to standby or quit. If NN does not respond, kill it.

8) nn startup and active and standby status changes

When NN is started, it enters standby and can be switched to active only after the command of the Failover controller is received.

9) nn of standby

I. do not provide services to clients

Ii. Read the image and process edits

Iii. Receive and process the BR, but do not reply to the "delete" or "copy" command to the DN

10) nn becomes active

When NN changes to active, it ends processing the latest edits and tells the client that it is in startup mode.

Problem: If NN is only converted from active to standby or restarted.

11) Client redirection

We have summarized the above two feasible methods. TBD

12) Smart Clients

TBD describes the Smart Client method. When the client fails to connect to Nn, it searches for the active state through other services (such as zookeeper). The advantages and disadvantages of this method need to be discussed.

13) IP failover Methods

How to generate a standard domain method: TBD

Benefits: Suitable for various protocols, such as HDFS, HTTP, and JMX.

Challenge: the virtual IP address ranges across network segments.

14) shared storage method

Standby reads edits from the shared storage, and only the obsolete edits is written to the unrolled edits. Details: TBD

Fencing is described above.

15) Non-shared storage method: Use Backup NN

Describe the work of backupnn and this method: TBD

 

3. Appendix

1) Automatic Fault rollback

Describe the problem and its production conditions

2) amnesia

Information that has been previously communicated with the client is lost.

3) GC

How does one differentiate whether it is GC when nn does not reply? Investigation required

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.