Spark Series 8 Spark Shuffle fetchfailedexception Error Resolution _

Spark Series 8 Spark Shuffle fetchfailedexception Error Resolution __spark

Last Update:2018-08-21 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First half Source: http://blog.csdn.net/lsshlsw/article/details/51213610

The latter part is my optimization plan for everyone's reference.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Sparksql Shuffle the error caused by the operation

Org.apache.spark.shuffle.MetadataFetchFailedException: 
Missing An output location for shuffle 0

Org.apache.spark.shuffle.FetchFailedException:
Failed to connect to hostname/192.168.xx.xxx:50268

Error from Rdd's shuffle operation

WARN tasksetmanager:lost Task 17.1 in Stage 4.1 (TID 1386, spark050013): java.io.FileNotFoundException:/data04/spark/tmp /blockmgr-817d372f-c359-4a00-96dd-8f6554aa19cd/2f/temp_shuffle_e22e013a-5392-4edb-9874-a196a1dad97c

Fetchfailed (Blockmanagerid (6083b277-119a-49e8-8a49-3539690a2a3f-s155, spark050013, 8533), shuffleid=1, mapId=143, reduceid=3, message=
org.apache.spark.shuffle.FetchFailedException:Error in opening filesegmentmanagedbuffer{ File=/data04/spark/tmp/blockmgr-817d372f-c359-4a00-96dd-8f6554aa19cd/0e/shuffle_1_143_0.data, offset=997061, length=112503}

(The author presses: Shuffle's principle may refer to my another summary: http://blog.csdn.net/zongzhiyuan/article/details/77676662)

Below, mainly from the shuffle data volume and processing shuffle data partition number two angles to start.

1. Reduce Shuffle data

Think about whether you can use the map side join or the broadcast join to circumvent the generation of shuffle.

The unnecessary data in the shuffle before filtering, such as the original data has 20 fields, as long as the need to select the field to deal with, will reduce a certain amount of shuffle data. 2. Sparksql and Dataframe Join,group by operation (providing shuffle concurrency)

By Spark.sql.shuffle.partitions control the number of partitions, the default is 200, according to the amount of shuffle and the complexity of the calculation to improve this value.

3. Operation of Rdd Join,groupby,reducebykey, etc.

The number of partitions processed by shuffle read and reduce is controlled by spark.default.parallelism, by default the total number of core of the running task (Mesos fine grained mode is 8, local mode is the total number of core), The official proposal is 2-3 times as large as the core of the running task.

4. Improve the Executor memory

To improve the memory value of executor properly by spark.executor.memory

5. Is there a problem with data skew

Whether the null value has been filtered. Whether a key can be handled individually. Consider zoning rules for changing data.

The above content comes from http://blog.csdn.net/lsshlsw/article/details/5121361

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The scene I encountered:

Big data: 1.7 billion logs

Constraints: Some fields are null, cannot discard logs, 200 cores, 20 g memory per kernel, no additional resources.

Problem Troubleshooting:

1. Because more fields are taken out of the dataframe, some fields are very long strings, resulting in a large amount of data.

2. For 3 fields using Reducebykey for multiple statistical aggregation, the final need to switch to Dataframe for the original data and statistics to join, a total of 3 times join

3. During the 3 join process, a key in one join has a data skew problem.

Solution:

1. Will need to do join operation of the field extracted separately, do not need to do join and field value is larger than the field alone processing, to prevent each shuffle will produce a large number of useless data;

2. In my scene, the median results are mainly used for the following rules to filter out the problematic account, so the period can be done in advance filtering, that is, if the aggregation statistics of the median value itself is less than n (the subsequent rule of the threshold must be greater than N), then directly discard the statistic intermediate results, Do not enter the shuffle phase of the back join to further reduce the amount of data;

3. For a join key data skew problem, the original table divided into 3, using the randomspilt operator, for each small number of original table to do 3 join, and finally 3 results unionall associated operations.

After 3 steps, my problem has been solved. Of course, the solution varies according to the scenario and the habits of everyone. Other solution to the data skew can refer to my other summary: http://blog.csdn.net/zongzhiyuan/article/details/77676614

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More