背景
因為這兩天升級上線spark 2.2.1 thriftserver服務,對服務運行情況,尤其對失敗作業的關注比較高,今天發現有一台機器上的task失敗率格外高,報錯FetchFailedException,過去我會認為是資源競爭,導致executor因記憶體不足而掛掉,從而導致block fetch失敗。今天仔細一查,發現問題的另一個原因。
下面是追查過程: 1) 首先看到spark web顯示報錯:
```FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513) at
2) 然後進入stderr頁面觀察task 日誌情況:
17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)at java.lang.Thread.run(Thread.java:745)Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing
日誌的重點是連線逾時,而且重試多次仍然逾時。 3)登入到hadoop244.bx,使用top觀察進程資源佔用情況
發現有一個PID為95479的進程cpu佔用一直超過100% 4)然後jps查看進程情況
然後發現,PID 95479 對應是一個tez作業,這個作業長期佔用大量cpu資源,導致其他進程(如當前情境的spark進程)擷取執行時間要等待很長,從而導致串連timeout,導致spark作業失敗。
總結: 有必要讓yarn採用類似cgroup的資源限制功能,限制單個進程對資源的長期佔用,避免異常對其他作業的影響;同時加大timeout逾時時間,提高對計算及網路環境不好情況下的魯棒性。