Spark Series 8 Spark Shuffle fetchfailedexception Error Resolution __spark

Source: Internet
Author: User
Tags shuffle


First half Source: http://blog.csdn.net/lsshlsw/article/details/51213610

The latter part is my optimization plan for everyone's reference.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Sparksql Shuffle the error caused by the operation

Org.apache.spark.shuffle.MetadataFetchFailedException: 
Missing An output location for shuffle 0
Org.apache.spark.shuffle.FetchFailedException:
Failed to connect to hostname/192.168.xx.xxx:50268


Error from Rdd's shuffle operation

WARN tasksetmanager:lost Task 17.1 in Stage 4.1 (TID 1386, spark050013): java.io.FileNotFoundException:/data04/spark/tmp /blockmgr-817d372f-c359-4a00-96dd-8f6554aa19cd/2f/temp_shuffle_e22e013a-5392-4edb-9874-a196a1dad97c

Fetchfailed (Blockmanagerid (6083b277-119a-49e8-8a49-3539690a2a3f-s155, spark050013, 8533), shuffleid=1, mapId=143, reduceid=3, message=
org.apache.spark.shuffle.FetchFailedException:Error in opening filesegmentmanagedbuffer{ File=/data04/spark/tmp/blockmgr-817d372f-c359-4a00-96dd-8f6554aa19cd/0e/shuffle_1_143_0.data, offset=997061, length=112503}

(The author presses: Shuffle's principle may refer to my another summary: http://blog.csdn.net/zongzhiyuan/article/details/77676662)


Below, mainly from the shuffle data volume and processing shuffle data partition number two angles to start.

1. Reduce Shuffle data

Think about whether you can use the map side join or the broadcast join to circumvent the generation of shuffle.

The unnecessary data in the shuffle before filtering, such as the original data has 20 fields, as long as the need to select the field to deal with, will reduce a certain amount of shuffle data. 2. Sparksql and Dataframe Join,group by operation (providing shuffle concurrency)

By Spark.sql.shuffle.partitions control the number of partitions, the default is 200, according to the amount of shuffle and the complexity of the calculation to improve this value.

3. Operation of Rdd Join,groupby,reducebykey, etc.

The number of partitions processed by shuffle read and reduce is controlled by spark.default.parallelism, by default the total number of core of the running task (Mesos fine grained mode is 8, local mode is the total number of core), The official proposal is 2-3 times as large as the core of the running task.

4. Improve the Executor memory

To improve the memory value of executor properly by spark.executor.memory

5. Is there a problem with data skew

Whether the null value has been filtered. Whether a key can be handled individually. Consider zoning rules for changing data.

The above content comes from http://blog.csdn.net/lsshlsw/article/details/5121361


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The scene I encountered:

Big data: 1.7 billion logs

Constraints: Some fields are null, cannot discard logs, 200 cores, 20 g memory per kernel, no additional resources.

Problem Troubleshooting:

1. Because more fields are taken out of the dataframe, some fields are very long strings, resulting in a large amount of data.

2. For 3 fields using Reducebykey for multiple statistical aggregation, the final need to switch to Dataframe for the original data and statistics to join, a total of 3 times join

3. During the 3 join process, a key in one join has a data skew problem.

Solution:

1. Will need to do join operation of the field extracted separately, do not need to do join and field value is larger than the field alone processing, to prevent each shuffle will produce a large number of useless data;

2. In my scene, the median results are mainly used for the following rules to filter out the problematic account, so the period can be done in advance filtering, that is, if the aggregation statistics of the median value itself is less than n (the subsequent rule of the threshold must be greater than N), then directly discard the statistic intermediate results, Do not enter the shuffle phase of the back join to further reduce the amount of data;

3. For a join key data skew problem, the original table divided into 3, using the randomspilt operator, for each small number of original table to do 3 join, and finally 3 results unionall associated operations.


After 3 steps, my problem has been solved. Of course, the solution varies according to the scenario and the habits of everyone. Other solution to the data skew can refer to my other summary: http://blog.csdn.net/zongzhiyuan/article/details/77676614



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.