First half Source: http://blog.csdn.net/lsshlsw/article/details/51213610
The latter part is my optimization plan for everyone's reference.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Sparksql Shuffle the error caused by the operation
Org.apache.spark.shuffle.MetadataFetchFailedException:
Missing An output location for shuffle 0
Org.apache.spark.shuffle.FetchFailedException:
Failed to connect to hostname/192.168.xx.xxx:50268
Error from Rdd's shuffle operation
WARN tasksetmanager:lost Task 17.1 in Stage 4.1 (TID 1386, spark050013): java.io.FileNotFoundException:/data04/spark/tmp /blockmgr-817d372f-c359-4a00-96dd-8f6554aa19cd/2f/temp_shuffle_e22e013a-5392-4edb-9874-a196a1dad97c
Fetchfailed (Blockmanagerid (6083b277-119a-49e8-8a49-3539690a2a3f-s155, spark050013, 8533), shuffleid=1, mapId=143, reduceid=3, message=
org.apache.spark.shuffle.FetchFailedException:Error in opening filesegmentmanagedbuffer{ File=/data04/spark/tmp/blockmgr-817d372f-c359-4a00-96dd-8f6554aa19cd/0e/shuffle_1_143_0.data, offset=997061, length=112503}
(The author presses: Shuffle's principle may refer to my another summary: http://blog.csdn.net/zongzhiyuan/article/details/77676662)
Below, mainly from the shuffle data volume and processing shuffle data partition number two angles to start.
1. Reduce Shuffle data
Think about whether you can use the map side join or the broadcast join to circumvent the generation of shuffle.
The unnecessary data in the shuffle before filtering, such as the original data has 20 fields, as long as the need to select the field to deal with, will reduce a certain amount of shuffle data. 2. Sparksql and Dataframe Join,group by operation (providing shuffle concurrency)
By Spark.sql.shuffle.partitions control the number of partitions, the default is 200, according to the amount of shuffle and the complexity of the calculation to improve this value.
3. Operation of Rdd Join,groupby,reducebykey, etc.
The number of partitions processed by shuffle read and reduce is controlled by spark.default.parallelism, by default the total number of core of the running task (Mesos fine grained mode is 8, local mode is the total number of core), The official proposal is 2-3 times as large as the core of the running task.
4. Improve the Executor memory
To improve the memory value of executor properly by spark.executor.memory
5. Is there a problem with data skew
Whether the null value has been filtered. Whether a key can be handled individually. Consider zoning rules for changing data.
The above content comes from http://blog.csdn.net/lsshlsw/article/details/5121361
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
The scene I encountered:
Big data: 1.7 billion logs
Constraints: Some fields are null, cannot discard logs, 200 cores, 20 g memory per kernel, no additional resources.
Problem Troubleshooting:
1. Because more fields are taken out of the dataframe, some fields are very long strings, resulting in a large amount of data.
2. For 3 fields using Reducebykey for multiple statistical aggregation, the final need to switch to Dataframe for the original data and statistics to join, a total of 3 times join
3. During the 3 join process, a key in one join has a data skew problem.
Solution:
1. Will need to do join operation of the field extracted separately, do not need to do join and field value is larger than the field alone processing, to prevent each shuffle will produce a large number of useless data;
2. In my scene, the median results are mainly used for the following rules to filter out the problematic account, so the period can be done in advance filtering, that is, if the aggregation statistics of the median value itself is less than n (the subsequent rule of the threshold must be greater than N), then directly discard the statistic intermediate results, Do not enter the shuffle phase of the back join to further reduce the amount of data;
3. For a join key data skew problem, the original table divided into 3, using the randomspilt operator, for each small number of original table to do 3 join, and finally 3 results unionall associated operations.
After 3 steps, my problem has been solved. Of course, the solution varies according to the scenario and the habits of everyone. Other solution to the data skew can refer to my other summary: http://blog.csdn.net/zongzhiyuan/article/details/77676614