background
Because these two days to upgrade the online Spark 2.2.1 Thriftserver Service, the service operation, especially for the failure to focus on the job is relatively high, today found a machine task failure rate is exceptionally high, error fetchfailedexception, In the past I would have thought it was a resource competition that caused executor to hang up due to insufficient memory, causing block fetch to fail. Take a closer look today and find another reason for the problem.
Here is the tracking process: 1 First see the Spark Web display error:
'
fetchfailed (Blockmanagerid (149, hadoop848.bx.com, 11681, None), shuffleid=135, mapid=12, reduceid=154, Message=
org.apache.spark.shuffle.FetchFailedException:Failed to connect to hadoop848.bx.com/ 10.88.69.188:11681 at
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException ( shuffleblockfetcheriterator.scala:513)
2 then enter the STDERR page to observe the task log:
17/12/11 11:42:02 ERROR retryingblockfetcher:exception while beginning to fetch of 6 outstanding blocks (after 1 retries) ja va.io.IOException:Failed to connect to hadoop972.bx.com/10.87.112.82:15733 at Org.apache.spark.network.client.TransportClientFactory.createClient (transportclientfactory.java:232) at Org.apache.spark.network.client.TransportClientFactory.createClient (transportclientfactory.java:182) at org.apache.spark.network.netty.nettyblocktransferservice$ $anon $1.createandstart ( nettyblocktransferservice.scala:97) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding (retryingblockfetcher.java:141) at org.apache.spark.network.shuffle.retryingblockfetcher.lambda$initiateretry$0 ( retryingblockfetcher.java:169) at Java.util.concurrent.executors$runnableadapter.call (Executors.java:511) at Java.util.concurrent.FutureTask.run (futuretask.java:266) at Java.util.concurrent.ThreadPoolExecutor.runWorker ( threadpoolexecutor.java:1142) at Java.util.concurrent.ThreadPoolExecUtor$worker.run (threadpoolexecutor.java:617) at io.netty.util.concurrent.defaultthreadfactory$ Defaultrunnabledecorator.run (defaultthreadfactory.java:144) at Java.lang.Thread.run (thread.java:745) caused by: Io.netty.channel.abstractchannel$annotatedconnectexception:connection timed out:hadoop972.bx.com/ 10.87.112.82:15733 at Sun.nio.ch.SocketChannelImpl.checkConnect (Native method) at Sun.nio.ch.SocketChannelImpl.finishConnect (socketchannelimpl.java:717) at Io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect (niosocketchannel.java:257) at Io.netty.channel.nio.abstractniochannel$abstractniounsafe.finishconnect (abstractniochannel.java:291) at Io.netty.channel.nio.NioEventLoop.processSelectedKey (nioeventloop.java:631) at Io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized (nioeventloop.java:566) at Io.netty.channel.nio.NioEventLoop.processSelectedKeys (nioeventloop.java:480) at Io.netty.channel.nio.NioEventLoop.run (nioeventloop.java:442) at Io.netty.util.concurrent.SingleTHreadeventexecutor$2.run (Sing
The focus of the log is the connection timeout, and the retry still times out. 3 Login to HADOOP244.BX, use top observation process resource occupancy
found that a process with PID 95479 CPU usage has been over 100% 4 and then JPS to view process conditions
It is then found that the PID 95479 corresponds to a tez job that takes up a large amount of CPU resources for a long time, causing other processes (such as the spark process of the current scene) to wait very long to get execution times, causing the connection to timeout, causing the spark job to fail.
Summary: It is necessary to let yarn use similar Cgroup resource restriction function, limit the long-term occupation of resources by single process, avoid the effect of exception on other jobs, and increase the timeout timeout time, improve the robustness of computing and network environment.