In this paper, we illustrate several scenarios of spark data skew and corresponding solutions, including avoiding data source tilt, adjusting parallelism, using custom partitioner, using map side join instead of reduce side join, and adding random prefix to tilt key.
Article listing 1 why to handle data skew (SKEW) 1.1 What is data skew 1.2 how data skew is caused by 2 how to mitigate/eliminate data Skew 2.1 try to avoid data source data skew 2.2 Adjust the degree of parallelism to disperse the different key 2.2.1 principle of the same task 2. 2.2 Case 2.2.3 Summary 2.3 custom partitioner 2.3.1 Principle 2.3.2 Case 2.3.3 Summary 2.4 convert reduce side join to map side join 2.4.1 Principle 2.4.2 case 2.4. 3 Summary 2.5 for skew key add random front/suffix 2.5.1 principle 2.5.2 case 2.5.3 Summary 2.6 Big Table randomly add n random prefix, small table expand n times 2.6.1 principle 2.6.2 Case 2.6.3 Summary 3 summary Why to deal with data dumping Skew (data Skew) what is skew
For big data systems such as SPARK/HADOOP, data volumes are not scary, and data skew is scary.
What is data skew. Data skew refers to the fact that in a data set that is processed in parallel, there is significantly more data than other parts of a part (such as Spark or one partition of Kafka), which makes the part's processing speed a bottleneck for the entire data set processing. how data skew is caused
In spark, different partition of the same stage can be processed in parallel, while the different stages with dependencies are serially processed. Assuming that a spark job is divided into stage 0 and stage 12 stages, and Stage 1 relies on stage 0, stage 0 will not be processed until the stage 1 complete processing ends. While stage 0 may contain n tasks, these n tasks can be performed in parallel. If one of the N-1 tasks is completed in 10 seconds and the other task takes 1 minutes, the total time of the stage is at least 1 minutes. In other words, the time spent on a stage is determined primarily by the slowest task.
because all the tasks in the same stage perform the same calculations, the time-consuming differences between different tasks are largely determined by the amount of data processed by the task, while excluding the differences in computing capacity of different compute nodes. The data source of the
stage is mainly divided into the following two classes that are read directly from the data source. If reading Hdfs,kafka reads shuffle data from the previous stage how to mitigate/de-skew data try to avoid data skew from the data source
Take the spark stream to read the Kafka data as an example using the Directstream method. Because each Partition of Kafka corresponds to a task (Partition) of Spark, the data in the topic of the Kafka is balanced against each other, directly determining whether or not the data is skewed when spark processes the data.
As described in the Kafka anatomy: Kafka Background and architecture Introduction, Kafka the distribution of messages between different partition in a topic, mainly by the producer implementation class used by the partition end. If a random partitioner is used, each message is randomly sent to a partition, so the data between the partition is balanced in terms of probability. At this point, the source stage (directly reading the stage of the Kafka data) does not produce data skew.
However, many times, a business scenario may require that data in the same feature be consumed sequentially, and data with the same characteristics need to be placed in the same partition. A typical scenario is the need to place the same user-related PV information in the same partition. At this point, if data skew occurs, it needs to be handled in a different way. adjust parallelism to scatter different keys for the same task principle
When Spark is doing Shuffle, the data is partitioned by default using Hashpartitioner (non-hash Shuffle). If the parallelism setting is inappropriate, it can cause a large number of different key data to be assigned to the same task, causing the task to handle data that is much larger than other tasks, resulting in data skew.
If the parallelism of the shuffle is adjusted so that the different keys assigned to the same task are disposed on different tasks, the amount of data to be processed by the original task can be reduced, thereby alleviating the short board effect caused by the data skew problem.
Case
The existing test table, named Student_external, has 1.05 billion data, each with a unique ID value. Now take the ID value from 900 million to 1.05 billion of a total of 1.5 data, and through some processing, so that the ID of 900 million to 940 million of all data to 12 modulo the remainder of 8 (that is, at shuffle parallelism is 12 o'clock the dataset is all hashpartition assigned to the 8th Task), the other data set its ID Divided by 100, so that data with IDs greater than 940 million can be evenly distributed across all tasks when shuffle, while data with an ID of less than 940 million is all assigned to the same task. The processing process is as follows
INSERT OVERWRITE TABLE Test SELECT case while ID < 940000000 then (9500000 + (CAST (RAND () * 8 as INTEGER)) * 12) ELSE CAST (id/100 as INTEGER) END, name from student_external WHERE ID between 900000000 and 10500 00000; |