"Hadoop learning" Apache Hadoop ResourceManager HA

Source: Internet
Author: User
Tags failover list of attributes

Brief introduction

This wizard outlines the HA for Yarn explorer and details how to configure and use this feature. RM is responsible for tracking resources in the cluster and dispatching applications such as MapReduce jobs. Hadoop2.4 Previously, RM was a single point of failure in a yarn cluster. The HA feature adds redundancy to the cluster in the form of active/standby RM pairs, eliminating this single point of failure.

Architecture

RM Failure Recovery

RM Ha is implemented through the Active/standby architecture-at any one time, there is an RM active (active), the other RM is in wait mode (Standby), and waits for the current active RM to fail when it can take over its work. The trigger source for the mode switch is from the administrator (via the CLI) or through an integrated recovery controller (enabled only when automatic recovery enabled).

Manual switching of A/S mode and failure recovery

When automatic recovery is not enabled, the administrator must manually switch an RM to active. In order to recover from one RM to another, you need to switch the active RM to standby mode before switching the other standby-rm to active. All operations are done using the "yarn rmadmin" command.

Automatic failure recovery

RM has an option to embed zookeeper-based activestandbyelector to determine which RM should be activated. When the active RM is down or unresponsive, the other RM is automatically selected to become active RM and take over the cluster. Note that in the case of HDFs ha, there is no need to run a separate ZKFC process because activestandbyelector embedded in RM will replace the independent ZKFC process to check for faults and elect leader.

Behavior of client, AM, and NodeManager during RM recovery

When there are multiple RM, all the RM is listed in the configuration (yarn-site.xml) used by the client and the node. The client, am, and NodeManager are polled to try to connect to the RM until the active RM is found. If the active RM goes down, they will continue to poll all RM until a new active RM is found. This default retry logic is implemented in Org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider . You can override this logic by simply implementing Org.apache.hadoop.yarn.client.RMFailoverProxyProvider and setting properties the value of Yarn.client.failover-proxy-provider is the class name that you implement.

Restore the status of a previously active RM

In the case where the ResourceManager restart is enabled, the new active RM loads the RM internal state and proceeds as far as possible from the location of the previous active RM exit, based on the RM restart feature. Try again to crawl each managed application that was previously submitted to RM. Applications can periodically check to avoid losing any work. The state store must be visible to Active/standby RM. Currently, there are two types of rmstatestore for persisting--filesystemrmstatestore and Zkrmstatestore. Zkrmstatestore implicitly allows write access to a single RM at any time, so it is recommended for use in Ha clusters. When using Zkrmstatestore, there is no need for a separate fencing mechanism to address a potential split-brain situation where mul Tiple RMs can potentially assume the Active role.

Deployment

Configuration

Most failure recovery features can be adjusted using a variety of configuration properties. The following is a list of necessary and important attributes. Yarn-default.xml contains a complete list of attributes. See Yarn-default.xml, which contains the default values for each property. View Document Resourcemanger Restart You can also get instructions for creating State-store.

Configuration Properties Describe
Yarn.resourcemanager.zk-address Zk-quorum The decision domain host address. For State-store and embedded leader-election.
Yarn.resourcemanager.ha.enabled Enable RM HA
Yarn.resourcemanager.ha.rm-ids The list of logical IDs for RM. For example, "Rm1,rm2"
Yarn.resourcemanager.hostname.rm-id For each rm-id with its corresponding RM hostname. Each of the RM service addresses can be set in turn.
Yarn.resourcemanager.ha.id Used to identify the RM in the cluster. This parameter is optional, however, if the parameter is set, the administrator must ensure that all the RM are configured with their own IDs.
Yarn.resourcemanager.ha.automatic-failover.enabled Enable automatic fault recovery; By default, it will only be enabled when HA is enabled.
yarn.resourcemanager.ha.automatic-failover.embedded When automatic fault recovery is enabled, an active RM is selected using the embedded leader-elector. By default, it will only be enabled when HA is enabled.
Yarn.resourcemanager.cluster-id Identifies the cluster. Used by elector to ensure that the current cluster's RM does not take over the work of the active RM of another cluster.
Yarn.client.failover-proxy-provider Classes that are used by clients, am, and nm for failback to active RM.
Yarn.client.failover-max-attempts Failoverproxyprovider the maximum number of failed recovery attempts.
Yarn.client.failover-sleep-base-ms The base of sleep time used to calculate the delay between two recovery failures. Unit milliseconds.
Yarn.client.failover-sleep-max-ms maximum sleep time between two recovery cycles
Yarn.client.failover-retries The number of retries to connect ResourceManager each time.
Yarn.client.failover-retries-on-socket-timeouts The number of retries for each attempt to connect ResourceManager at the socket timeout.
Sample Configuration

The following are the minimum attribute configurations required to create an RM recovery.

 < Property>   <name>Yarn.resourcemanager.ha.enabled</name>   <value>True</value> </ Property> < Property>   <name>Yarn.resourcemanager.cluster-id</name>   <value>Cluster1</value> </ Property> < Property>   <name>Yarn.resourcemanager.ha.rm-ids</name>   <value>Rm1,rm2</value> </ Property> < Property>   <name>Yarn.resourcemanager.hostname.rm1</name>   <value>Master1</value> </ Property> < Property>   <name>Yarn.resourcemanager.hostname.rm2</name>   <value>Master2</value> </ Property> < Property>   <name>Yarn.resourcemanager.zk-address</name>   <value>zk1:2181,zk2:2181,zk3:2181</value> </ Property>

Manage commands

The Yarn rmadmin command checks the health status of the RM with several HA-related options and switches the Active/standby mode. The HA command takes the RM service ID set by the Yarn.resourcemanager.ha.rm-ids property as the parameter.

$ yarn rmadmin-getservicestate rm1 Active $ yarn rmadmin-getservicestate RM2 Standby

If automatic recovery is enabled, then you can switch commands without having to manually.

$ yarn Rmadmin-transitiontostandby rm1 Automatic failover is enabled for [email protected] refusing to manually manage HA State, since it could cause a split-brain scenario or other incorrect state. If you is very sure you know what is doing, please specify the FORCEMANUAL flag.

See Yarncommands.

ResourceManager Web UI Service

Assuming a standby RM is online and running, the standby RM automatically redirects all Web requests to the active RM, in addition to the RM's respective "about" page.

WEB Services

Assuming a standby RM is online and running, the RM Web service described in ResourceManager REST APIs is redirected to the active RM when it wakes up on the standby rm.

"Hadoop learning" Apache Hadoop ResourceManager HA

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.