yarn RM crash问题一例

推荐原创

菜菜光 2014-07-08 22:59:06 博主文章分类：hadoop ©著作权

©著作权归作者所有：来自51CTO博客作者菜菜光的原创作品，请联系作者获取转载授权，否则将追究法律责任

今天收到线上的resource manager报警：

报错信息如下：

2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx:53356 Timed out after 600 secs 2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx:53356 as it is now LOST 2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx:53356 Node Transitioned from UNHEALTHY to LOST 2014-07-08 13:22:54,118 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler java.lang.NullPointerException         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:715)         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:974)         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:108)         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:378)         at java.lang.Thread.run(Thread.java:662) 2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1000 2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2000

这是一个bug，bug id：https://issues.apache.org/jira/browse/YARN-502

根据bug的描述，是在rm删除标记为UNHEALTHY的nm的时候可能会触发bug（第一次已经删除，后面删除再进行删除操作时就会报错）。

根据堆栈信息来看代码:

org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler:   protected ResourceScheduler scheduler;      private final class EventProcessor implements Runnable { // 开启一个EventProcessor 线程，对event进行处理       @Override       public void run() {         SchedulerEvent event;         while (!stopped && !Thread.currentThread ().isInterrupted()) {           try {             event = eventQueue.take();  // 从event queue里面拿出event           } catch (InterruptedException e) {             LOG.error("Returning, interrupted : " + e);             return; // TODO: Kill RM.           }           try {             scheduler.handle(event); //处理event           } catch (Throwable t) { // cache event的异常             // An error occurred, but we are shutting down anyway.             // If it was an InterruptedException, the very act of             // shutdown could have caused it and is probably harmless.             if (stopped ) {               LOG.warn("Exception during shutdown: " , t);               break;             }             LOG.fatal("Error in handling event type " + event.getType() //根据日志来看，这里获取的event.getType()为 NODE_REMOVED                 + " to the scheduler", t);             if (shouldExitOnError                 && !ShutdownHookManager.get().isShutdownInProgress()) {               LOG.info("Exiting, bbye.." );               System. exit(-1);             }           }         }       }     }

这里可以看到可以通过shouldExitOnError可以控制RM线程是否退出。

private boolean shouldExitOnError = false; // 初始设置为false     @Override     public synchronized void init(Configuration conf) {  // 在做初始化时，可以通过配置文件获取       this. shouldExitOnError =           conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,             Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR); // 参数在Dispatcher类中定义       super.init(conf);     }

org.apache.hadoop.yarn.event.Dispatcher类： public interface Dispatcher {      // Configuration to make sure dispatcher crashes but doesn't do system-exit in   // case of errors. By default, it should be false, so that tests are not   // affected. For all daemons it should be explicitly set to true so that   // daemons can crash instead of hanging around.   public static final String DISPATCHER_EXIT_ON_ERROR_KEY =       "yarn.dispatcher.exit-on-error"; // 控制参数   public static final boolean DEFAULT_DISPATCHER_EXIT_ON_ERROR = false; // 默认为false   EventHandler getEventHandler();   void register(Class<? extends Enum> eventType, EventHandler handler); }

在ResourceManager类的init函数中：

 @Override   public synchronized void init(Configuration conf) {     this. conf = conf;     this. conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, true);  // 这个值的默认值为true了（覆盖了Dispatcher类中的DEFAULT设置）

即默认在遇到dispather的错误时，会退出。
遇到错误是否退出可以由配置参数yarn.dispatcher.exit-on-error决定。不过这个改动影响比较大，最好还是不要设置，还是打patch来解决吧。

官方的patch也比较简单，即在rmnm时进行一次判断，防止二次删除操作：

--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java @@ -501,8 +501,13 @@ public DeactivateNodeTransition(NodeState finalState) {      public void transition(RMNodeImpl rmNode, RMNodeEvent event) {        // Inform the scheduler        rmNode.nodeUpdateQueue.clear(); -      rmNode.context.getDispatcher().getEventHandler().handle( -          new NodeRemovedSchedulerEvent(rmNode)); +      // If the current state is NodeState.UNHEALTHY +      // Then node is already been removed from the +      // Scheduler +      if (!rmNode.getState().equals(NodeState.UNHEALTHY)) { +        rmNode.context.getDispatcher().getEventHandler() +          .handle( new NodeRemovedSchedulerEvent(rmNode)); +      }        rmNode.context.getDispatcher().getEventHandler().handle(            new NodesListManagerEvent(                NodesListManagerEventType.NODE_UNUSABLE, rmNode));

上一篇：hadoop Unexpected end of input stream 错误

下一篇：hbase java sample

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯