Resource Manager due to Capacityscheduler's NPE abnormally exited, causing failover to switch

Last Update:2016-12-01 Source: Internet

Author: User

Tags failover

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the problem description

Yarn2.0 occurs resource manager down (MASTER2) and causes resource Manager to switch failover

Second, the problem analysis

1) See resource Manager's Log on Master2

2016-06-26 12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.rmauditlogger:  user=warehouse        operation=am released container  Target=schedulerapp     result=success  appid=application_1466451117456_ 12139   containerid=container_1466451117456_12139_02_0000012016-06-26 12:35:41,504  info org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: updating  application attempt appattempt_1466451117456_12139_000002 with final state:  failed, and exit status: -1002016-06-26 12:35:41,504 info  Org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_ 12139_000002 state change from allocated to final_saving2016-06-26  12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.applicationmasterservice: unregistering app attempt :  appattempt_1466451117456_12139_0000022016-06-26 12:35:41,504 fatal  Org.apache.hadoop.yarn.server.resourcemanager.resourcemanager: error in handling event  type CONTAINER_EXPIRED to the schedulerjava.lang.NullPointerException         at  Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer (Leafqueue.java : 1664)         at  Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer ( capacityscheduler.java:1231)         at  Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle ( capacityscheduler.java:1117)         at  Org.apache.hadoop.yarn.serveR.resourcemanager.scheduler.capacity.capacityscheduler.handle (capacityscheduler.java:114)          at org.apache.hadoop.yarn.server.resourcemanager.resourcemanager$ Schedulereventdispatcher$eventprocessor.run (resourcemanager.java:686)          at java.lang.thread.run (thread.java:724) 2016-06-26 12:35:41,504 info  Org.apache.hadoop.yarn.server.resourcemanager.security.amrmtokensecretmanager: application finished,  removing password for appattempt_1466451117456_12139_0000022016-06-26 12:35:41,504  info org.apache.hadoop.yarn.server.resourcemanager.resourcemanager: exiting, bbye. 2016-06-26 12:35:41,504 info  Org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_ 12139_000002 state change from final_saving to failed2016-06-26  12:35:41,504&nBsp;info org.apache.hadoop.yarn.server.resourcemanager.rmapp.rmappimpl: the number of  failed attempts is 0. The max attempts is 22016-06-26  12:35:41,505 info org.apache.hadoop.yarn.server.resourcemanager.applicationmasterservice:  registering app attempt : appattempt_1466451117456_12139_0000032016-06-26  12:35:41,505 info org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl:  appattempt_1466451117456_12139_000003 state change from new to subm

You can see that Capacityscheduler's NPE causes ResourceManager to exit. This exit mechanism itself is secure and prevents scheduler exceptions that cause ResourceManager to continue to be unavailable.

2) The reason for the analysis may be that the Capacityscheduler asynchronous dispatch caused the exception, The source code is as follows (Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)

Static void schedule (CAPACITYSCHEDULER&NBSP;CS)  {    // First  randomize the start point    int current = 0;     collection<ficaschedulernode> nodes = cs.nodetracker.getallnodes ();     int start = random.nextint (Nodes.size ());     //here, when the loop is processed, Nodes may have been modified by other threads     for  (ficaschedulernode node : nodes)  {       if  (Current++ >= start)  {         cs.allocatecontainerstonode (node);      }     }    // Now, just get everyone to be safe     for  (ficaschedulernode node : nodes)  {       cs.allocatecontainersTonode (node);    }    try {       Thread.Sleep (Cs.getasyncscheduleinterval ());    } catch  (InterruptedException  e)  {}  }

Third, the solution

Modify Capacity-scheduler.xml, Cancel asynchronous dispatch

<property> <name>yarn.scheduler.capacity.schedule-asynchronously.enable</name> <VALUE&G T;false</value> </property>

This modification requires a restart of ResourceManager to take effect

This article is from the "Scattered People" blog, please be sure to keep this source http://zouqingyun.blog.51cto.com/782246/1878530

Resource Manager caused failover switchover because Capacityscheduler's NPE abnormally exited

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More