Spark ERROR: org.apache.spark.shuffle.FetchFailedException 問題追查__spark

來源:互聯網
上載者:User
背景

因為這兩天升級上線spark 2.2.1 thriftserver服務,對服務運行情況,尤其對失敗作業的關注比較高,今天發現有一台機器上的task失敗率格外高,報錯FetchFailedException,過去我會認為是資源競爭,導致executor因記憶體不足而掛掉,從而導致block fetch失敗。今天仔細一查,發現問題的另一個原因。

下面是追查過程: 1) 首先看到spark web顯示報錯:

```FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)    at 
2) 然後進入stderr頁面觀察task 日誌情況:

17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)at java.lang.Thread.run(Thread.java:745)Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing

日誌的重點是連線逾時,而且重試多次仍然逾時。 3)登入到hadoop244.bx,使用top觀察進程資源佔用情況


發現有一個PID為95479的進程cpu佔用一直超過100% 4)然後jps查看進程情況


然後發現,PID 95479 對應是一個tez作業,這個作業長期佔用大量cpu資源,導致其他進程(如當前情境的spark進程)擷取執行時間要等待很長,從而導致串連timeout,導致spark作業失敗。

總結: 有必要讓yarn採用類似cgroup的資源限制功能,限制單個進程對資源的長期佔用,避免異常對其他作業的影響;同時加大timeout逾時時間,提高對計算及網路環境不好情況下的魯棒性。

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.