Four ways to resolve spark data skew (Skew)

Last Update:2018-07-26 Source: Internet

Author: User

Tags join shuffle unique id

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this paper, we illustrate several scenarios of spark data skew and corresponding solutions, including avoiding data source tilt, adjusting parallelism, using custom partitioner, using map side join instead of reduce side join, and adding random prefix to tilt key.

Article listing 1 why to handle data skew (SKEW) 1.1 What is data skew 1.2 how data skew is caused by 2 how to mitigate/eliminate data Skew 2.1 try to avoid data source data skew 2.2 Adjust the degree of parallelism to disperse the different key 2.2.1 principle of the same task 2. 2.2 Case 2.2.3 Summary 2.3 custom partitioner 2.3.1 Principle 2.3.2 Case 2.3.3 Summary 2.4 convert reduce side join to map side join 2.4.1 Principle 2.4.2 case 2.4. 3 Summary 2.5 for skew key add random front/suffix 2.5.1 principle 2.5.2 case 2.5.3 Summary 2.6 Big Table randomly add n random prefix, small table expand n times 2.6.1 principle 2.6.2 Case 2.6.3 Summary 3 summary Why to deal with data dumping Skew (data Skew) what is skew

For big data systems such as SPARK/HADOOP, data volumes are not scary, and data skew is scary.

What is data skew. Data skew refers to the fact that in a data set that is processed in parallel, there is significantly more data than other parts of a part (such as Spark or one partition of Kafka), which makes the part's processing speed a bottleneck for the entire data set processing. how data skew is caused

In spark, different partition of the same stage can be processed in parallel, while the different stages with dependencies are serially processed. Assuming that a spark job is divided into stage 0 and stage 12 stages, and Stage 1 relies on stage 0, stage 0 will not be processed until the stage 1 complete processing ends. While stage 0 may contain n tasks, these n tasks can be performed in parallel. If one of the N-1 tasks is completed in 10 seconds and the other task takes 1 minutes, the total time of the stage is at least 1 minutes. In other words, the time spent on a stage is determined primarily by the slowest task.
because all the tasks in the same stage perform the same calculations, the time-consuming differences between different tasks are largely determined by the amount of data processed by the task, while excluding the differences in computing capacity of different compute nodes. The data source of the
stage is mainly divided into the following two classes that are read directly from the data source. If reading Hdfs,kafka reads shuffle data from the previous stage how to mitigate/de-skew data try to avoid data skew from the data source

Take the spark stream to read the Kafka data as an example using the Directstream method. Because each Partition of Kafka corresponds to a task (Partition) of Spark, the data in the topic of the Kafka is balanced against each other, directly determining whether or not the data is skewed when spark processes the data.
As described in the Kafka anatomy: Kafka Background and architecture Introduction, Kafka the distribution of messages between different partition in a topic, mainly by the producer implementation class used by the partition end. If a random partitioner is used, each message is randomly sent to a partition, so the data between the partition is balanced in terms of probability. At this point, the source stage (directly reading the stage of the Kafka data) does not produce data skew.
However, many times, a business scenario may require that data in the same feature be consumed sequentially, and data with the same characteristics need to be placed in the same partition. A typical scenario is the need to place the same user-related PV information in the same partition. At this point, if data skew occurs, it needs to be handled in a different way. adjust parallelism to scatter different keys for the same task principle

When Spark is doing Shuffle, the data is partitioned by default using Hashpartitioner (non-hash Shuffle). If the parallelism setting is inappropriate, it can cause a large number of different key data to be assigned to the same task, causing the task to handle data that is much larger than other tasks, resulting in data skew.
If the parallelism of the shuffle is adjusted so that the different keys assigned to the same task are disposed on different tasks, the amount of data to be processed by the original task can be reduced, thereby alleviating the short board effect caused by the data skew problem.
Case

The existing test table, named Student_external, has 1.05 billion data, each with a unique ID value. Now take the ID value from 900 million to 1.05 billion of a total of 1.5 data, and through some processing, so that the ID of 900 million to 940 million of all data to 12 modulo the remainder of 8 (that is, at shuffle parallelism is 12 o'clock the dataset is all hashpartition assigned to the 8th Task), the other data set its ID Divided by 100, so that data with IDs greater than 940 million can be evenly distributed across all tasks when shuffle, while data with an ID of less than 940 million is all assigned to the same task. The processing process is as follows

INSERT OVERWRITE TABLE Test SELECT case while ID < 940000000 then (9500000 + (CAST (RAND () * 8 as INTEGER)) * 12) ELSE CAST (id/100 as INTEGER) END, name from student_external WHERE ID between 900000000 and 10500 00000;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More