Spark ERROR:org.apache.spark.shuffle.FetchFailedException Problem Tracking __spark

Source: Internet
Author: User
Tags shuffle cpu usage
background

Because these two days to upgrade the online Spark 2.2.1 Thriftserver Service, the service operation, especially for the failure to focus on the job is relatively high, today found a machine task failure rate is exceptionally high, error fetchfailedexception, In the past I would have thought it was a resource competition that caused executor to hang up due to insufficient memory, causing block fetch to fail. Take a closer look today and find another reason for the problem.

Here is the tracking process: 1 First see the Spark Web display error:

'
fetchfailed (Blockmanagerid (149, hadoop848.bx.com, 11681, None), shuffleid=135, mapid=12, reduceid=154, Message=
org.apache.spark.shuffle.FetchFailedException:Failed to connect to hadoop848.bx.com/ 10.88.69.188:11681 at
    org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException ( shuffleblockfetcheriterator.scala:513)
    
2 then enter the STDERR page to observe the task log:

17/12/11 11:42:02 ERROR retryingblockfetcher:exception while beginning to fetch of 6 outstanding blocks (after 1 retries) ja va.io.IOException:Failed to connect to hadoop972.bx.com/10.87.112.82:15733 at Org.apache.spark.network.client.TransportClientFactory.createClient (transportclientfactory.java:232) at Org.apache.spark.network.client.TransportClientFactory.createClient (transportclientfactory.java:182) at org.apache.spark.network.netty.nettyblocktransferservice$ $anon $1.createandstart ( nettyblocktransferservice.scala:97) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding (retryingblockfetcher.java:141) at org.apache.spark.network.shuffle.retryingblockfetcher.lambda$initiateretry$0 ( retryingblockfetcher.java:169) at Java.util.concurrent.executors$runnableadapter.call (Executors.java:511) at Java.util.concurrent.FutureTask.run (futuretask.java:266) at Java.util.concurrent.ThreadPoolExecutor.runWorker ( threadpoolexecutor.java:1142) at Java.util.concurrent.ThreadPoolExecUtor$worker.run (threadpoolexecutor.java:617) at io.netty.util.concurrent.defaultthreadfactory$ Defaultrunnabledecorator.run (defaultthreadfactory.java:144) at Java.lang.Thread.run (thread.java:745) caused by: Io.netty.channel.abstractchannel$annotatedconnectexception:connection timed out:hadoop972.bx.com/ 10.87.112.82:15733 at Sun.nio.ch.SocketChannelImpl.checkConnect (Native method) at Sun.nio.ch.SocketChannelImpl.finishConnect (socketchannelimpl.java:717) at Io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect (niosocketchannel.java:257) at Io.netty.channel.nio.abstractniochannel$abstractniounsafe.finishconnect (abstractniochannel.java:291) at Io.netty.channel.nio.NioEventLoop.processSelectedKey (nioeventloop.java:631) at Io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized (nioeventloop.java:566) at Io.netty.channel.nio.NioEventLoop.processSelectedKeys (nioeventloop.java:480) at Io.netty.channel.nio.NioEventLoop.run (nioeventloop.java:442) at Io.netty.util.concurrent.SingleTHreadeventexecutor$2.run (Sing 

The focus of the log is the connection timeout, and the retry still times out. 3 Login to HADOOP244.BX, use top observation process resource occupancy


found that a process with PID 95479 CPU usage has been over 100% 4 and then JPS to view process conditions


It is then found that the PID 95479 corresponds to a tez job that takes up a large amount of CPU resources for a long time, causing other processes (such as the spark process of the current scene) to wait very long to get execution times, causing the connection to timeout, causing the spark job to fail.

Summary: It is necessary to let yarn use similar Cgroup resource restriction function, limit the long-term occupation of resources by single process, avoid the effect of exception on other jobs, and increase the timeout timeout time, improve the robustness of computing and network environment.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.