application master 持續org.apache.hadoop.ipc.Client: Retrying connect to server

來源:互聯網
上載者:User

標籤:hadoop   yarn.client   nodemanger   application master   org.apache.hadoop.ipc.client;retrying connect to server   

一、問題現象

    某一個nodemanager退出後,導致 application master中出現大量的如下日誌,並且持續很長時間,application master才成功退出。

2016-06-24 09:32:35,596 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:35,596 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:35,597 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,539 INFO [ContainerLauncher #5] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,539 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,539 INFO [ContainerLauncher #6] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,539 INFO [ContainerLauncher #2] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,539 INFO [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,596 INFO [ContainerLauncher #4] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,597 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 09:32:36,597 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS).......2016-06-24 12:57:52,328 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:57:53,339 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:04,357 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:05,367 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:06,378 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:07,392 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:08,399 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:09,408 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:10,417 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:11,425 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2016-06-24 12:58:12,434 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

1)dchadoop206上的nodemanager退出後(由於重啟),導致application master持續的去串連之前nodemanager上的container。顯然這些container是已經串連不上了。

2)最終經過非常長的時間大概3-4小時後,串連不上的異常才拋出,application master正常結束。

二、問題分析

這個問題主要涉及hadoop的rpc機制。首先看下面兩個配置參數

 #定義client串連到nodemanager的最大逾時時間,不是單次串連,而是經過多少時間串連不上nodemanager,則認為操作失敗 <property>        <name>yarn.client.nodemanager-connect.max-wait-ms</name>        <value>15*60*1000</value> </property> # 定義每次嘗試去串連nodemanager的時間間隔 <property>        <name>yarn.client.nodemanager-connect.retry-interval-ms</name>        <value>10*1000</value> </property>

    根據這兩個參數的定義,ApplicationMaster經過15分鐘仍然連不上nodemanager的container,會取消try connect。但觀察的情況是Application Master 需要等大約30分鐘,才取消try connect。主要原因在於hadoop 的rpc機制如下,首先ApplicationMaster 會根據上面的兩個參數,構造一個RetryUpToMaximumCountWithFixedSleep的重連策略,這個重連策略會通過以下方式計算

MaximumCount:yarn.client.nodemanager-connect.max-wait-ms/yarn.client.nodemanager-connect.retry-interval-ms=90次

而每次的RPC請求中,Client也有自己的重連策略,就是類似這樣的東東:

2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)由兩個rpc參數控制,ipc.client.connect.max.retries=10和ipc.client.connect.retry.interval=1000ms控制

所以最終ApplicationMaster 放棄try connect的等待時間是:90*(10+10)=1800s

三、解決辦法

1)在提交map-reduce/hive sql/hive server2等用戶端機器修改yarn-site.xml的以下參數
2)hadoop命令列中通過-D設定該參數

 <property>        <name>yarn.client.nodemanager-connect.max-wait-ms</name>        <value>180000</value> </property>

這樣總的等待時間就是6分鐘。

注意事項

這個修改是不需要做任何重啟yarn組件操作的,是一個用戶端相關的操作!


本文出自 “散人” 部落格,請務必保留此出處http://zouqingyun.blog.51cto.com/782246/1881294

application master 持續org.apache.hadoop.ipc.Client: Retrying connect to server

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.