Spark ERROR:org.apache.spark.shuffle.FetchFailedException Problem Tracking _

Spark ERROR:org.apache.spark.shuffle.FetchFailedException Problem Tracking __spark

Last Update:2018-08-21 Source: Internet

Author: User

Tags shuffle cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

background

Because these two days to upgrade the online Spark 2.2.1 Thriftserver Service, the service operation, especially for the failure to focus on the job is relatively high, today found a machine task failure rate is exceptionally high, error fetchfailedexception, In the past I would have thought it was a resource competition that caused executor to hang up due to insufficient memory, causing block fetch to fail. Take a closer look today and find another reason for the problem.

Here is the tracking process: 1 First see the Spark Web display error:

'
fetchfailed (Blockmanagerid (149, hadoop848.bx.com, 11681, None), shuffleid=135, mapid=12, reduceid=154, Message=
org.apache.spark.shuffle.FetchFailedException:Failed to connect to hadoop848.bx.com/ 10.88.69.188:11681 at
    org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException ( shuffleblockfetcheriterator.scala:513)

2 then enter the STDERR page to observe the task log:

17/12/11 11:42:02 ERROR retryingblockfetcher:exception while beginning to fetch of 6 outstanding blocks (after 1 retries) ja va.io.IOException:Failed to connect to hadoop972.bx.com/10.87.112.82:15733 at Org.apache.spark.network.client.TransportClientFactory.createClient (transportclientfactory.java:232) at Org.apache.spark.network.client.TransportClientFactory.createClient (transportclientfactory.java:182) at org.apache.spark.network.netty.nettyblocktransferservice$ $anon $1.createandstart ( nettyblocktransferservice.scala:97) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding (retryingblockfetcher.java:141) at org.apache.spark.network.shuffle.retryingblockfetcher.lambda$initiateretry$0 ( retryingblockfetcher.java:169) at Java.util.concurrent.executors$runnableadapter.call (Executors.java:511) at Java.util.concurrent.FutureTask.run (futuretask.java:266) at Java.util.concurrent.ThreadPoolExecutor.runWorker ( threadpoolexecutor.java:1142) at Java.util.concurrent.ThreadPoolExecUtor$worker.run (threadpoolexecutor.java:617) at io.netty.util.concurrent.defaultthreadfactory$ Defaultrunnabledecorator.run (defaultthreadfactory.java:144) at Java.lang.Thread.run (thread.java:745) caused by: Io.netty.channel.abstractchannel$annotatedconnectexception:connection timed out:hadoop972.bx.com/ 10.87.112.82:15733 at Sun.nio.ch.SocketChannelImpl.checkConnect (Native method) at Sun.nio.ch.SocketChannelImpl.finishConnect (socketchannelimpl.java:717) at Io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect (niosocketchannel.java:257) at Io.netty.channel.nio.abstractniochannel$abstractniounsafe.finishconnect (abstractniochannel.java:291) at Io.netty.channel.nio.NioEventLoop.processSelectedKey (nioeventloop.java:631) at Io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized (nioeventloop.java:566) at Io.netty.channel.nio.NioEventLoop.processSelectedKeys (nioeventloop.java:480) at Io.netty.channel.nio.NioEventLoop.run (nioeventloop.java:442) at Io.netty.util.concurrent.SingleTHreadeventexecutor$2.run (Sing

The focus of the log is the connection timeout, and the retry still times out. 3 Login to HADOOP244.BX, use top observation process resource occupancy

found that a process with PID 95479 CPU usage has been over 100% 4 and then JPS to view process conditions

It is then found that the PID 95479 corresponds to a tez job that takes up a large amount of CPU resources for a long time, causing other processes (such as the spark process of the current scene) to wait very long to get execution times, causing the connection to timeout, causing the spark job to fail.

Summary: It is necessary to let yarn use similar Cgroup resource restriction function, limit the long-term occupation of resources by single process, avoid the effect of exception on other jobs, and increase the timeout timeout time, improve the robustness of computing and network environment.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More