Resource Manager due to Capacityscheduler's NPE abnormally exited, causing failover to switch

Source: Internet
Author: User
Tags failover

First, the problem description

Yarn2.0 occurs resource manager down (MASTER2) and causes resource Manager to switch failover

Second, the problem analysis

1) See resource Manager's Log on Master2

2016-06-26 12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.rmauditlogger:  user=warehouse        operation=am released container  Target=schedulerapp     result=success  appid=application_1466451117456_ 12139   containerid=container_1466451117456_12139_02_0000012016-06-26 12:35:41,504  info org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: updating  application attempt appattempt_1466451117456_12139_000002 with final state:  failed, and exit status: -1002016-06-26 12:35:41,504 info  Org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_ 12139_000002 state change from allocated to final_saving2016-06-26  12:35:41,504 info org.apache.hadoop.yarn.server.resourcemanager.applicationmasterservice: unregistering app attempt :  appattempt_1466451117456_12139_0000022016-06-26 12:35:41,504 fatal  Org.apache.hadoop.yarn.server.resourcemanager.resourcemanager: error in handling event  type CONTAINER_EXPIRED to the schedulerjava.lang.NullPointerException         at  Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer (Leafqueue.java : 1664)         at  Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer ( capacityscheduler.java:1231)         at  Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle ( capacityscheduler.java:1117)         at  Org.apache.hadoop.yarn.serveR.resourcemanager.scheduler.capacity.capacityscheduler.handle (capacityscheduler.java:114)          at org.apache.hadoop.yarn.server.resourcemanager.resourcemanager$ Schedulereventdispatcher$eventprocessor.run (resourcemanager.java:686)          at java.lang.thread.run (thread.java:724) 2016-06-26 12:35:41,504 info  Org.apache.hadoop.yarn.server.resourcemanager.security.amrmtokensecretmanager: application finished,  removing password for appattempt_1466451117456_12139_0000022016-06-26 12:35:41,504  info org.apache.hadoop.yarn.server.resourcemanager.resourcemanager: exiting, bbye. 2016-06-26 12:35:41,504 info  Org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl: appattempt_1466451117456_ 12139_000002 state change from final_saving to failed2016-06-26  12:35:41,504&nBsp;info org.apache.hadoop.yarn.server.resourcemanager.rmapp.rmappimpl: the number of  failed attempts is 0. The max attempts is 22016-06-26  12:35:41,505 info org.apache.hadoop.yarn.server.resourcemanager.applicationmasterservice:  registering app attempt : appattempt_1466451117456_12139_0000032016-06-26  12:35:41,505 info org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.rmappattemptimpl:  appattempt_1466451117456_12139_000003 state change from new to subm

You can see that Capacityscheduler's NPE causes ResourceManager to exit. This exit mechanism itself is secure and prevents scheduler exceptions that cause ResourceManager to continue to be unavailable.

2) The reason for the analysis may be that the Capacityscheduler asynchronous dispatch caused the exception, The source code is as follows (Org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)

Static void schedule (CAPACITYSCHEDULER&NBSP;CS)  {    // First  randomize the start point    int current = 0;     collection<ficaschedulernode> nodes = cs.nodetracker.getallnodes ();     int start = random.nextint (Nodes.size ());     //here, when the loop is processed, Nodes may have been modified by other threads     for  (ficaschedulernode node : nodes)  {       if  (Current++ >= start)  {         cs.allocatecontainerstonode (node);      }     }    // Now, just get everyone to be safe     for  (ficaschedulernode node : nodes)  {       cs.allocatecontainersTonode (node);    }    try {       Thread.Sleep (Cs.getasyncscheduleinterval ());    } catch  (InterruptedException  e)  {}  }

Third, the solution

Modify Capacity-scheduler.xml, Cancel asynchronous dispatch

<property> <name>yarn.scheduler.capacity.schedule-asynchronously.enable</name> <VALUE&G T;false</value> </property>


This modification requires a restart of ResourceManager to take effect

This article is from the "Scattered People" blog, please be sure to keep this source http://zouqingyun.blog.51cto.com/782246/1878530

Resource Manager caused failover switchover because Capacityscheduler's NPE abnormally exited

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.