First, the problem description
Yarn2.0 occurs resource manager down (MASTER2) and causes resource Manager to switch failover
Second, the problem analysis
1) See resource Manager's Log on Master2
2016-06-26 12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.rmauditlogger: user=warehouse operation=am released container Target=schedulerapp result=success appid=application_1466451117456_ 12139 containerid=container_1466451117456_12139_02_0000012016-06-26 12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: updating application attempt appattempt_1466451117456_12139_000002 with final state: failed, and exit status: -1002016-06-26 12:35:41,504 info Org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_ 12139_000002 state change from allocated to final_saving2016-06-26 12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.applicationmasterservice: unregistering app attempt : appattempt_1466451117456_12139_0000022016-06-26 12:35:41,504 fatal Org.apache.hadoop.yarn.server.resourcemanager.resourcemanager: error in handling event type CONTAINER_EXPIRED to the schedulerjava.lang.NullPointerException at Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer (Leafqueue.java : 1664) at Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer ( capacityscheduler.java:1231) at Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle ( capacityscheduler.java:1117) at Org.apache.hadoop.yarn.serveR.resourcemanager.scheduler.capacity.capacityscheduler.handle (capacityscheduler.java:114) at org.apache.hadoop.yarn.server.resourcemanager.resourcemanager$ Schedulereventdispatcher$eventprocessor.run (resourcemanager.java:686) at java.lang.thread.run (thread.java:724) 2016-06-26 12:35:41,504 info Org.apache.hadoop.yarn.server.resourcemanager.security.amrmtokensecretmanager: application finished, removing password for appattempt_1466451117456_12139_0000022016-06-26 12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.resourcemanager: exiting, bbye. 2016-06-26 12:35:41,504 info Org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_ 12139_000002 state change from final_saving to failed2016-06-26 12:35:41,504&nBsp;info org.apache.hadoop.yarn.server.resourcemanager.rmapp.rmappimpl: the number of failed attempts is 0. The max attempts is 22016-06-26 12:35:41,505 info org.apache.hadoop.yarn.server.resourcemanager.applicationmasterservice: registering app attempt : appattempt_1466451117456_12139_0000032016-06-26 12:35:41,505 info org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_12139_000003 state change from new to subm
You can see that Capacityscheduler's NPE causes ResourceManager to exit. This exit mechanism itself is secure and prevents scheduler exceptions that cause ResourceManager to continue to be unavailable.
2) The reason for the analysis may be that the Capacityscheduler asynchronous dispatch caused the exception, The source code is as follows (Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
Static void schedule (CAPACITYSCHEDULER&NBSP;CS) { // First randomize the start point int current = 0; collection<ficaschedulernode> nodes = cs.nodetracker.getallnodes (); int start = random.nextint (Nodes.size ()); //here, when the loop is processed, Nodes may have been modified by other threads for (ficaschedulernode node : nodes) { if (Current++ >= start) { cs.allocatecontainerstonode (node); } } // Now, just get everyone to be safe for (ficaschedulernode node : nodes) { cs.allocatecontainersTonode (node); } try { Thread.Sleep (Cs.getasyncscheduleinterval ()); } catch (InterruptedException e) {} }
Third, the solution
Modify Capacity-scheduler.xml, Cancel asynchronous dispatch
<property> <name>yarn.scheduler.capacity.schedule-asynchronously.enable</name> <VALUE&G T;false</value> </property>
This modification requires a restart of ResourceManager to take effect
This article is from the "Scattered People" blog, please be sure to keep this source http://zouqingyun.blog.51cto.com/782246/1878530
Resource Manager caused failover switchover because Capacityscheduler's NPE abnormally exited