Brief introduction
This wizard outlines the HA for Yarn explorer and details how to configure and use this feature. RM is responsible for tracking resources in the cluster and dispatching applications such as MapReduce jobs. Hadoop2.4 Previously, RM was a single point of failure in a yarn cluster. The HA feature adds redundancy to the cluster in the form of active/standby RM pairs, eliminating this single point of failure.
Architecture
RM Failure Recovery
RM Ha is implemented through the Active/standby architecture-at any one time, there is an RM active (active), the other RM is in wait mode (Standby), and waits for the current active RM to fail when it can take over its work. The trigger source for the mode switch is from the administrator (via the CLI) or through an integrated recovery controller (enabled only when automatic recovery enabled).
Manual switching of A/S mode and failure recovery
When automatic recovery is not enabled, the administrator must manually switch an RM to active. In order to recover from one RM to another, you need to switch the active RM to standby mode before switching the other standby-rm to active. All operations are done using the "yarn rmadmin" command.
Automatic failure recovery
RM has an option to embed zookeeper-based activestandbyelector to determine which RM should be activated. When the active RM is down or unresponsive, the other RM is automatically selected to become active RM and take over the cluster. Note that in the case of HDFs ha, there is no need to run a separate ZKFC process because activestandbyelector embedded in RM will replace the independent ZKFC process to check for faults and elect leader.
Behavior of client, AM, and NodeManager during RM recovery
When there are multiple RM, all the RM is listed in the configuration (yarn-site.xml) used by the client and the node. The client, am, and NodeManager are polled to try to connect to the RM until the active RM is found. If the active RM goes down, they will continue to poll all RM until a new active RM is found. This default retry logic is implemented in Org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider . You can override this logic by simply implementing Org.apache.hadoop.yarn.client.RMFailoverProxyProvider and setting properties the value of Yarn.client.failover-proxy-provider is the class name that you implement.
Restore the status of a previously active RM
In the case where the ResourceManager restart is enabled, the new active RM loads the RM internal state and proceeds as far as possible from the location of the previous active RM exit, based on the RM restart feature. Try again to crawl each managed application that was previously submitted to RM. Applications can periodically check to avoid losing any work. The state store must be visible to Active/standby RM. Currently, there are two types of rmstatestore for persisting--filesystemrmstatestore and Zkrmstatestore. Zkrmstatestore implicitly allows write access to a single RM at any time, so it is recommended for use in Ha clusters. When using Zkrmstatestore, there is no need for a separate fencing mechanism to address a potential split-brain situation where mul Tiple RMs can potentially assume the Active role.
Deployment
Configuration
Most failure recovery features can be adjusted using a variety of configuration properties. The following is a list of necessary and important attributes. Yarn-default.xml contains a complete list of attributes. See Yarn-default.xml, which contains the default values for each property. View Document Resourcemanger Restart You can also get instructions for creating State-store.
Configuration Properties |
Describe |
Yarn.resourcemanager.zk-address |
Zk-quorum The decision domain host address. For State-store and embedded leader-election. |
Yarn.resourcemanager.ha.enabled |
Enable RM HA |
Yarn.resourcemanager.ha.rm-ids |
The list of logical IDs for RM. For example, "Rm1,rm2" |
Yarn.resourcemanager.hostname.rm-id |
For each rm-id with its corresponding RM hostname. Each of the RM service addresses can be set in turn. |
Yarn.resourcemanager.ha.id |
Used to identify the RM in the cluster. This parameter is optional, however, if the parameter is set, the administrator must ensure that all the RM are configured with their own IDs. |
Yarn.resourcemanager.ha.automatic-failover.enabled |
Enable automatic fault recovery; By default, it will only be enabled when HA is enabled. |
yarn.resourcemanager.ha.automatic-failover.embedded |
When automatic fault recovery is enabled, an active RM is selected using the embedded leader-elector. By default, it will only be enabled when HA is enabled. |
Yarn.resourcemanager.cluster-id |
Identifies the cluster. Used by elector to ensure that the current cluster's RM does not take over the work of the active RM of another cluster. |
Yarn.client.failover-proxy-provider |
Classes that are used by clients, am, and nm for failback to active RM. |
Yarn.client.failover-max-attempts |
Failoverproxyprovider the maximum number of failed recovery attempts. |
Yarn.client.failover-sleep-base-ms |
The base of sleep time used to calculate the delay between two recovery failures. Unit milliseconds. |
Yarn.client.failover-sleep-max-ms |
maximum sleep time between two recovery cycles |
Yarn.client.failover-retries |
The number of retries to connect ResourceManager each time. |
Yarn.client.failover-retries-on-socket-timeouts |
The number of retries for each attempt to connect ResourceManager at the socket timeout. |
Sample Configuration
The following are the minimum attribute configurations required to create an RM recovery.
< Property> <name>Yarn.resourcemanager.ha.enabled</name> <value>True</value> </ Property> < Property> <name>Yarn.resourcemanager.cluster-id</name> <value>Cluster1</value> </ Property> < Property> <name>Yarn.resourcemanager.ha.rm-ids</name> <value>Rm1,rm2</value> </ Property> < Property> <name>Yarn.resourcemanager.hostname.rm1</name> <value>Master1</value> </ Property> < Property> <name>Yarn.resourcemanager.hostname.rm2</name> <value>Master2</value> </ Property> < Property> <name>Yarn.resourcemanager.zk-address</name> <value>zk1:2181,zk2:2181,zk3:2181</value> </ Property>
Manage commands
The Yarn rmadmin command checks the health status of the RM with several HA-related options and switches the Active/standby mode. The HA command takes the RM service ID set by the Yarn.resourcemanager.ha.rm-ids property as the parameter.
$ yarn rmadmin-getservicestate rm1 Active $ yarn rmadmin-getservicestate RM2 Standby
If automatic recovery is enabled, then you can switch commands without having to manually.
$ yarn Rmadmin-transitiontostandby rm1 Automatic failover is enabled for [email protected] refusing to manually manage HA State, since it could cause a split-brain scenario or other incorrect state. If you is very sure you know what is doing, please specify the FORCEMANUAL flag.
See Yarncommands.
ResourceManager Web UI Service
Assuming a standby RM is online and running, the standby RM automatically redirects all Web requests to the active RM, in addition to the RM's respective "about" page.
WEB Services
Assuming a standby RM is online and running, the RM Web service described in ResourceManager REST APIs is redirected to the active RM when it wakes up on the standby rm.
"Hadoop learning" Apache Hadoop ResourceManager HA