Spark ERROR: org.apache.spark.shuffle.FetchFailedException 問題追查_

Spark ERROR: org.apache.spark.shuffle.FetchFailedException 問題追查__spark

最後更新：2018-08-21 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

背景

因為這兩天升級上線spark 2.2.1 thriftserver服務，對服務運行情況，尤其對失敗作業的關注比較高，今天發現有一台機器上的task失敗率格外高，報錯FetchFailedException，過去我會認為是資源競爭，導致executor因記憶體不足而掛掉，從而導致block fetch失敗。今天仔細一查，發現問題的另一個原因。

下面是追查過程： 1）首先看到spark web顯示報錯：

```FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)    at

2）然後進入stderr頁面觀察task 日誌情況：

17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)at java.lang.Thread.run(Thread.java:745)Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing

日誌的重點是連線逾時，而且重試多次仍然逾時。 3）登入到hadoop244.bx，使用top觀察進程資源佔用情況

發現有一個PID為95479的進程cpu佔用一直超過100% 4）然後jps查看進程情況

然後發現，PID 95479 對應是一個tez作業，這個作業長期佔用大量cpu資源，導致其他進程（如當前情境的spark進程）擷取執行時間要等待很長，從而導致串連timeout，導致spark作業失敗。

總結：有必要讓yarn採用類似cgroup的資源限制功能，限制單個進程對資源的長期佔用，避免異常對其他作業的影響；同時加大timeout逾時時間，提高對計算及網路環境不好情況下的魯棒性。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark ERROR: org.apache.spark.shuffle.FetchFailedException 問題追查__spark

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support