Spark SQL is one of the most widely used components of Apache Spark, providing a very friendly interface for distributed processing of structured data, with successful production practices in many applications, but on hyper-scale clusters and datasets, Spark SQL still encounters a number of ease-of-use and scalability challenges. To address these challenges, the Intel Big Data Technology team and the Baidu Big Data infrastructure engineer have improved and implemented the adaptive execution engine based on the spark community release. This article begins with a discussion of the challenges of Spark SQL on large datasets, then describes the background and basic architecture of adaptive execution, and how adaptive execution should respond to spark SQL, and finally we will compare adaptive execution and existing community versions of Spark SQL at Scale TPC-DS benchmarks meet the challenges and performance differences, as well as the use of adaptive execution on the Baidu Big SQL platform.
Challenge 1: About shuffle partition number
In Spark SQL, the number of shufflepartition can be set by the parameter spark.sql.shuffle.partition, which is the default value of 200. This parameter determines the number of tasks per reduce phase of the SQL job and has a significant impact on the overall query performance. Assuming that a query has requested E executor before it runs, each executor contains a C core (number of concurrent execution threads), then the number of tasks that the job can execute concurrently at run time is equal to E x C, or the concurrency of the job is E x C. Assuming that the number of shuffle partition is P, the number of tasks in each of the subsequent reduce stages is p, in addition to the number of tasks in the map stage and the amount of files and sizes of the original data. Since the spark job schedule is preemptive, the E x C Concurrent Task execution Unit will preempt the P-task, "ANI", until all tasks are completed, then go to the next stage. But in this process, if there is a task because the amount of processing data is too large (for example: Data skew causes a large amount of data to be divided into the same reducer partition) or other reasons cause the task to take too long, on the one hand will cause the entire stage execution time is longer, on the other hand, E X While most of the C concurrent execution units may be idle, the overall utilization of cluster resources drops sharply.
So how much is the spark.sql.shuffle.partition parameter more appropriate? If the settings are too small, the more data that is allocated to each reduce task, the more it will have to overflow (spill) to the compute node local disk in the case of limited memory size. Spill can cause additional disk reads and writes, affect the performance of the entire SQL query, and may cause serious GC problems or even oom in the worse case. Conversely, if the shuffle partition is set too large. First, the amount of data processed by each reduce task is small and ends quickly, which in turn leads to a greater load on the Spark task schedule. Second, each mapper task must put its own shuffle output data into P-hash bucket, that is, determine which of the data belongs to the reduce partition, when the shuffle partition number too much, hash The amount of data in the bucket will be very small, when the number of jobs concurrency is very large, the reduce task shuffle pull data will cause a certain degree of random small data read operation, when using mechanical hard disk as shuffle data temporary access to the performance degradation is more obvious. Finally, when the last stage saves the data, it writes out P files, which can also cause a large number of small files in the HDFs file system.
From the top, the settings of the shuffle partition are neither too small nor too large. In order to achieve optimal performance, it is often necessary to experiment with multiple tests to determine the best shuffle partition value for an SQL query. However, in a production environment, where SQL often handles data at different time periods in a timed manner, the size of the data can vary greatly, and we cannot do time-consuming manual tuning for every SQL query, which also means that these SQL jobs are difficult to perform optimally.
Another problem with Shuffle partition is that the same Shuffle partition number settings will be applied to all the stage. When Spark executes a SQL job, it is divided into multiple stages. Generally, the data distribution and size of each stage may not be the same, the global shuffle partition settings can only be optimal for some or some stage, there is no way to achieve all the stage settings of the global optimal.
This series of performance and usability challenges for shufflepartition prompted us to think about new ways to automatically set the appropriate shuffle for each stage based on the amount of shuffle data that the runtime obtains, such as the size of the data block, the number of rows recorded, and so on. Partition value?
Challenge 2:spark SQL Best execution Plan
Before executing SQL, Spark SQL parses the SQL or DataSet program into a logical plan and then undergoes a series of optimizations to finalize an executable physical plan. The different physical plans that are ultimately selected have a significant impact on performance. How to choose the best execution plan, which is the core work of the Catalyst Optimizer for spark SQL. Catalyst was primarily a rule-based optimizer (RBO) in the early stages, and the cost-based optimization (CBO) was added to spark 2.2. The current implementation plan is defined at the planning stage and will no longer change once confirmed. However, during operation, when we get more runtime information, we are likely to get a better execution plan.
As an example of join operations, the most common strategy in Spark is Broadcasthashjoin and sortmergejoin. Broadcasthashjoin belongs to the map side join, and the principle is that when one of the table storage spaces is much smaller than the broadcast threshold, Spark chooses to broadcast this small table to each executor, and then in the map phase, Each mapper reads a shard of a large table and joins the entire small table, avoiding the shuffle of large table data in the cluster. and Sortmergejoin in the map Phase 2 data tables are shuffle written in the same way, the reduce stage each reducer two tables belong to the corresponding partition data pull to the same task to do join. Rbo the join operation to Broadcasthashjoin as much as possible based on the size of the data. In Spark, the parameter spark.sql.autoBroadcastJoinThreshold is used to control the threshold value of broadcasthashjoin selection, which is 10MB by default. For complex SQL queries, however, it is possible to use intermediate results as input to join, and during the planning phase, spark does not know exactly how large the two tables in the join are or incorrectly estimates their size. The opportunity to use the Broadcasthashjoin strategy to optimize join execution is missed. But at run time, we can choose Broadcasthashjoin dynamically by the information written from shuffle. The following is an example where the input size of the join side is only 600K, but Spark is still planned as sortmergejoin.
This prompted us to think about the second question: can we dynamically adjust the execution plan by the information we gather at runtime?
Challenge 3: Data skew
Data skew is a common problem that causes spark SQL performance to become worse. Data skew refers to the amount of data in one partition that is much larger than that of other partition, causing individual tasks to run much longer than other tasks, thus slowing down the entire SQL run time. In the actual SQL job, data skew is very common, the hash bucket corresponding to the join key will always appear less than the average number of records, in extreme cases, the same join key corresponding to the number of records is particularly many, A large amount of data is bound to be divided into the same partition thus causing serious data skew. 2, you can see that most tasks are completed in about 3 seconds, while the slowest task takes 4 minutes, and the amount of data processed is several times the number of other tasks.
At present, some common means of data skew when dealing with joins are: (1) Increasing the number of shuffle partition, expecting that the data in the same partition can be scattered across multiple partition, but it has no effect on the data of the same key. (2) Increase the threshold of broadcasthashjoin, in some scenarios can convert Sortmergejoin to Broadcasthashjoin and avoid the shuffle data skew. (3) Manually filter the tilted key, and add a random prefix to the data, in the other table the corresponding data of these keys are expanded accordingly, and then do the join. In summary, these instruments have their own limitations and involve a lot of human treatment. Based on this, we think of the third question: can spark automatically handle data skew in joins at run time?
Adaptive
execution Background and introduction
As early as 2015, the spark Community proposed the basic idea of adaptive execution, adding an interface to the dagscheduler of Spark to submit a single map stage, and trying to adjust the number of shuffle partition at runtime. However, the implementation has some limitations, in some scenarios will introduce more shuffle, that is, more stage, for the three tables in the same stage to do join, etc. can not be handled very well. So the function is always in the experimental phase, and the configuration parameters are not mentioned in the official documentation.
Based on the work of these communities, the Intel Big Data Technology team has redesigned adaptive execution to achieve a more flexible, self-adaptive execution framework. Under this framework, we can add additional rules to achieve more functionality. Currently, the implemented features include: Automatically set the number of shuffle partition, dynamically adjust the execution plan, dynamically handle data skew, and so on.
Adaptive
Execution Architecture
In Spark SQL, when spark determines the final physical execution plan, it generates an RDD DAG graph based on each operator definition of the RDD. Spark then statically divides the stage and commits execution based on the Dag diagram, so once the execution plan is determined, it cannot be updated at run time. The basic idea of adaptive execution is to prioritize the stage in the execution plan, then submit execution on the stage, collect the shuffle statistics for the current stage at run time to optimize the execution plan for the next stage, and then submit the execution of the subsequent stage.
From Figure 3 We can see how the adaptive execution works by first dividing the execution plan tree into multiple querystage (the Exchange node represents shuffle in Spark SQL) as a demarcation of the Exchange node. Each querystage is a separate subtree and an independent execution unit. While joining Querystage, we also added a querystageinput leaf node as input to the Father Querystage. For example, we will create 3 querystage for the execution plan of the two table joins in the graph. The execution plan in the last Querystage is the join itself, which has 2 querystageinput representing its input, pointing to the querystage of 2 children, respectively. When we execute querystage, we first submit its child stage and collect information about the runtime of the stage. When the children stage is complete, we can know their size and other information to determine whether the plan in Querystage can optimize the update. For example, when we know the size of a table is 5M, it is less than the threshold of broadcast, we can convert Sortmergejoin to Broadcasthashjoin to optimize the current execution plan. We can also dynamically adjust the number of reducer on the stage based on the amount of shuffle data generated by the child stage. After a series of optimizations have been completed, we will eventually generate a DAG diagram of the RDD for the querystage and submit it to the DAG Scheduler for execution.
automatically set number of reducer
Suppose we set the number of Shufflepartition to 5, and at the end of the map stage, we know that each partition size is 70MB,30MB,20MB,10MB and 50MB, respectively. Assuming we set the target data volume for each reducer processing to 64MB, we can actually use 3 reducer at runtime. The first reducer deals with partition 0 (70MB), the second reducer processes continuous partition 1 to 3, a total of 60MB, and the third reducer processing Partition 4 (50MB), 4 as shown.
In the framework of adaptive execution, because each querystage knows all of its children's stage, it is possible to consider all the stage inputs when adjusting the number of reducer. In addition, we can also use the number of record bars as a target value for reducer processing. Because shuffle data is often compressed, and sometimes the amount of data partition is not large, but the number of records after decompression is much larger than other partition, resulting in uneven data. So considering both the data size and the number of records can better determine the number of reducer.
dynamically adjust execution plans
Currently we support the strategy of dynamically adjusting joins at runtime, where a table is less than the broadcast threshold and can be converted to Broadcasthashjoin if the condition is met. Since the partition of the Sortmergejoin and broadcasthashjoin outputs are not the same, an optional conversion may introduce additional shuffle operations on the next stage. So when we dynamically adjust the join strategy, we follow a rule that transforms without introducing additional shuffle.
What are the benefits of converting sortmergejoin into Broadcasthashjoin? Since the data has been shuffle written to disk, we still need to shuffle read the data. We can take a look at the example in Figure 5, assuming that table A and table B join,map Stage 2 tables each have 2 map tasks, and shuffle partition number is 5. If you do sortmergejoin, you need to start 5 reducer in the reduce phase, and each reducer reads its own data through the network shuffle. However, when we find that B can be broadcast at runtime and convert it to broadcasthashjoin, we only need to start 2 reducer, each reducer read a mapper of the entire shuffle output file. When we dispatch these 2 reducer tasks, we can prioritize them on the executor running mapper, so the entire shuffle read becomes local read and no data is transmitted over the network. and read a file in such a sequential reading, compared to the original shuffle when the random small file read, more efficient. In addition, the sortmergejoin process tends to have varying degrees of data skew problems, slowing down the overall running time. After conversion to Broadcasthashjoin, the amount of data is generally more uniform, also avoids the tilt, we can see in the following experimental results more specific information.
dynamic processing of data skew
In the framework of adaptive execution, we can easily detect partition with data skew at runtime. When a stage is executed, we collect the shuffle data size and number of records for each mapper of the stage. If the amount of data or number of records in a partition is more than n times the median, and is greater than a pre-configured threshold, we assume that this is a data skew partition that requires special processing.
Suppose we do a inner join on table A and B, and the NO. 0 partition in Table A is a slanted partition. In general, partition 0 data in both A and B tables is shuffle to the same reducer, and because this reducer requires a large amount of data to be pulled through the network and processed, it becomes one of the slowest tasks to slow down the overall performance. In the adaptive execution framework, once we find that partition 0 in table A is tilted, we then use n tasks to process the partition. Each task reads only a few mapper shuffle output files and then reads the data from table B partition 0 to join. Finally, we combine the results of n task joins through the union operation. To achieve this, we have also changed the interface of shuffle read to allow it to read only one of the partition data in some mapper. In such a process, partition 0 of table B will be read n times, although this adds a certain additional cost, but the benefits of handling skewed data through n tasks are still greater than the cost. If partition 0 in table B also has a tilt, for inner join we can also divide the partition 0 of table B into several blocks, join with partition 0 of table A, and finally union together. But for other join types such as the left Semi join we temporarily do not support splitting the partition 0 of table B.
Performance comparison of adaptive execution and spark SQL on 100TB
We built a cluster using 99 machines, experimented with Spark2.2 in the Tpc-ds 100TB dataset, comparing the performance of original spark and adaptive execution. The following is the details of the cluster:
Experimental results show that, in the adaptive execution mode, 103 SQL has 92 performance improvements, of which 47 SQL performance increased by more than 10%, the maximum performance increase of 3.8 times times, and there is no performance degradation situation. In addition, in the original spark, there are 5 SQL because Oom and other reasons can not run smoothly, in the adaptive mode we also have to optimize these problems, so that 103 SQL on the Tpc-ds 100TB DataSet all successfully run. Here are a few of the most obvious SQL for performance boost ratios and performance improvements.
By carefully analyzing these performance-enhancing SQL, we can see the benefits of adaptive execution. The first is to set the number of reducer automatically, the original spark uses 10976 as the shuffle partition number, in the adaptive execution, the following SQL reducer automatically adjusted to 1064 and 1079, you can clearly see that the execution time also improved a lot. This is because of the reduced burden of scheduling and the time of task initiation, as well as the reduction of disk IO requests.
Original spark:
Adaptive execution:
Dynamically adjusting the execution plan at run time, converting Sortmergejoin to Broadcasthashjoin also brings a big boost in some SQL. For example, in the example below, it took 2.5 minutes to use sortmergejoin because of data skew. In adaptive execution, because one of the tables has a size of only 2.5k, it is converted to Broadcasthashjoin at runtime and the execution time is reduced to 10 seconds.
Original spark:
Adaptive execution:
challenges and optimizations for the TB
Successfully running all of the SQL in the Tpc-ds-TB dataset is also a challenge for Apache Spark. Although Sparksql officially supports the tpc-ds of all SQL, it is based on small datasets. At 100TB this magnitude, spark exposes some problems that cause some SQL execution to be inefficient or even impossible to execute smoothly. In the process of doing experiments, we have also made other optimizations to spark, based on the adaptive Execution framework, to ensure that all SQL runs successfully on the 100TB dataset. Here are some typical questions.
Driver single-point bottleneck Optimization (SPARK-22537) for statistical map-side output data
At the end of each map task, there will be a data structure representing each partition size (that is, the compressedmapstatus or highlycompressedmapstatus mentioned below) returned to driver. In adaptive execution, when the shuffle map stage ends, driver aggregates the partition size information given by each mapper to get the total data size of all partition outputs on each mapper. This statistic is done by single thread, if the number of mapper is S, then the time complexity of the statistic is between O (M x s) ~ o (m x S x log (m x s)), and when Compressedmapstatus is used, the complex degree is the lower limit of this interval, when Highlycompressedmapstatus is used, space is saved and time is longer, and when almost all the partition data is empty, the complexity is close to the upper limit of the interval.
When m x s increases, we encounter a single point bottleneck on the driver, and one obvious manifestation is the pause between the map stage and the reduce stage on the UI. To solve this single-point bottleneck, we divide the task evenly into multiple threads, and the threads do not intersect to assign aggregated values to different elements in the Scala array.
In this optimization, The new Spark.shuffle.mapOutput.parallelAggregationThreshold (abbreviated threshold) is introduced to configure thresholds using multi-threaded aggregation, and the degree of parallelism of the aggregation is determined by the number of cores available in the JVM and M * S/ Threshold + 1 is determined by the small value.
Shuffle optimization when reading continuous partition (SPARK-9853)
In the adaptive execution mode, a reducer may read a contiguous block of data from a mapoutput file. In the current implementation, it needs to be split into many separate getblockdata calls, each of which reads a small chunk of data from the hard disk, which requires a lot of disk IO. We have optimized this scenario so that spark can read all of these contiguous chunks at once, which greatly reduces the IO on the disk. In the small benchmark program, we found that the performance of shuffle read can be increased by 3 times times.
Avoiding unnecessary partition reading optimization in Broadcasthashjoin
Adaptive execution can provide more optimization possibilities for existing operator. There is a basic design in Sortmergejoin: Each reducetask will read the records in the left table first, and if the partition of the left table is empty, then the data in the right table is not a concern (for non-anti joins). Such a design in the left table has some partition empty can save unnecessary right table read, in sortmergejoin such implementation is very natural.
There is no process to partition by join key in Broadcasthashjoin, so this optimization is missing. In some cases of adaptive execution, however, we can retrieve this optimization by using accurate statistics between the stages: if Sortmergejoin is converted to Broadcasthashjoin at runtime, and we can get each partition The key corresponds to the exact size of the partition, then the newly converted Broadcasthashjoin will be told that there is no need to read the empty partition in those small tables, because no results will be join.
Baidu Real product line trial situation
We apply the adaptive Execution optimization application in Baidu internal Spark SQL based ad hoc query service Baidubig SQL, do a further verification of the ground, by selecting the full-day real user query, according to the original execution sequence replay rerun and analysis, get the following conclusions:
1. For simple queries of seconds, the performance gains of adaptive versions are not obvious, mainly because their bottlenecks and major time-consuming focus on Io, which is not an optimization point for adaptive execution.
2. Consider the test results in terms of query complexity dimensions: The more iterations in the query, the more complex the multi-table join scenario, the better the adaptive execution effect. We simply follow the group BY, sort, join, subquery and other operations to classify the query, such as the above keyword more than 3 of the query has a significant performance improvement, optimization than from 50%~200%, the main optimization points from the shuffle dynamic concurrency number adjustment and join optimization.
3. From the point of view of business use, the optimization of Sortmergejoin to Broadcasthashjoin in the big SQL scene hit a variety of typical business SQL templates, Consider the following calculation requirements: Users expect to be billed for the overall two dimensions of the user list of interest from two different dimensions of billing information. Income information The original table size at the hundred T level, the user list only contains the corresponding user's meta information, size within 10M. Two Billing information table fields are basically the same, so we have two tables with the user list do inner join after the union for further analysis, SQL expression is as follows:
1 SelectT.C1, T.id, T.c2, t.c3, t.c4,sum(T.NUM1),sum(T.NUM2),sum(T.NUM3) from2 (3 SelectC1, T1.id asID, C2, C3, C4,sum(NUM1S) asNUM1,sum(num2) asNUM2,sum(NUM3) asNum3 frombasedata.shitu_a T1INNER JOINbasedata.user_82_1512023432000 T2 on(t1.id=T2.id)where(Event_day=20171107) andFlag!= 'true' Group byC1, T1.id, C2, C3, C44 Union All5 SelectC1, T1.id asID, C2, C3, C4,sum(NUM1S) asNUM1,sum(num2) asNUM2,sum(NUM3) asNum3 fromBasedata.shitu_b T1INNER JOINbasedata.user_82_1512023432000 T2 on(t1.id=T2.id)where(Event_day=20171107) andFlag!= 'true' Group byC1, T1.id, C2, C3, C46) TGroup byT.C1, T.id, T.c2, t.c3, C4
The corresponding original spark execution plan is as follows:
For such user scenarios, it is possible to hit all the join optimization logic of adaptive execution, sortmergejoin to broadcasthashjoin during execution, reduce the intermediate memory consumption and multi-round sort, and get nearly 200% performance improvement.
Combined with the above 3 points, the next step of adaptive execution in Baidu internal optimization of the work will be further concentrated on the large data volume, complex query on the routine batch job, and consider the user query complexity associated with dynamic switching control. For complex queries running on thousands of large-scale clusters, adaptive execution can dynamically adjust the degree of parallelism in the calculation process, which can help significantly increase the resource utilization of the cluster. In addition, adaptive execution can obtain more complete statistical information between the multiple-stage stages, and next we also consider opening the corresponding data and strategy interface to the top users of the Baidu Spark platform, further customizing the strategy strategy for special jobs.
Summary
With the wide use of Spark SQL and the growing size of your business, the challenges of ease-of-use and performance on large-scale datasets will become increasingly apparent. This article discusses three typical issues, including adjusting the number of shuffle partition, choosing the best execution plan, and data skew. These problems are not easy to solve under the existing framework, and adaptive execution is a good response to these problems. We have introduced the basic architecture of adaptive execution and the specific methods to address these issues. Finally, we verified the advantages of adaptive execution on the Tpc-ds 100TB data set, compared to the original spark sql,103 SQL query, 90% of the queries have been significantly improved performance, the maximum increase of 3.8 times times, and the original failure of 5 queries under adaptive execution also successfully completed. Our big SQL platform in Baidu has also done further verification, for complex real queries can achieve twice times the performance improvement. In summary, adaptive execution solves many of the challenges that spark SQL encounters on big Data scale, and greatly improves the ease of use and performance of Spark SQL, and improves the resource utilization of clusters in multi-tenant and multiple concurrent jobs in super-large clusters. In the future, we will consider providing more strategies that the runtime can optimize under the framework of adaptive execution, and giving back our contribution to the community, and we hope that more friends can participate in the process and further refine it.
Reprinted from http://blog.csdn.net/fl63zv9zou86950w/article/details/79049280
Spark SQL Adaptive Execution Practice on 100TB (reprint)