Cost-based Optimizer (CBO) (reproduced) in Apache Spark 2.2

Source: Internet
Author: User
Tags joins shuffle

Apache Spark 2.2 recently introduced an advanced cost-based optimizer framework for collecting and balancing statistical work for different column data (for example, base (cardinality), number of unique values, null value, maximum minimum, average/maximum length, And so on) to improve the execution plan of the query class job. Balancing these jobs helps spark make better decisions when choosing the best query plan. Examples of these optimizations include choosing the right party to build hash when doing hash-join, choosing the Right Join type (broadcast hash join and full shuffle hash-join) or adjusting the order of the multiple joins, etc.)in this blog, we'll dive into Spark's cost-based optimizer (CBO) and discuss how spark collects and stores this data, optimizes queries, and shows the performance impact in stress test queries. an illuminating example.at the Spark2.2 core, the Catalyst Optimizer is a unified library for representing query plans into multiple trees and sequentially using multiple optimization rules to transform them. Large-sector optimization rules are based on heuristics, for example, they are only responsible for the structure of the query and do not care about the properties that are handling the data, which severely limits their usability. Let's demonstrate it in a simple example. Consider the following query, which filters the T1 table with a size of 500GB and joins with another T2 table of size 20GB. Spark uses a hash join, which selects a small join relationship as the party that constructs the hash table and selects the large join relationship as the probe side. Since the T2 table is smaller than the T1 table, Apache Spark 2.1 will select the right side as the party building the hash table rather than filtering it (in this case, most of the data from the T1 table is filtered out). Choosing the wrong side to build the hash table often causes the system to discard the fast hash join for reasons of memory limitations and use sort-merge Join (Sort-merge join). Apache Spark 2.2 does not do this, it collects statistics for each operation and finds that the left side is filtered to 100MB (1 million records), and the filter to the right will have 20GB (100 million records). With the correct table size/base information on both sides, Spark 2.2 chooses the left side as the builder, which can greatly speed up the query. to improve the quality of the query execution plan, we used detailed statistics to strengthen the spark SQL optimizer. From the detailed statistics, we propagate the statistics to the other operators (because we traverse the query tree from the bottom up). At the end of the propagation, we can estimate the number of output records of each database operator and the size of the output record so that an efficient query plan can be obtained. statistical information Collection Framework ANALYZE TABLE Commandthe CBO relies on detailed statistical information to optimize the query plan. To collect these statistics, users can use the following new SQL commands:
1 TABLE COMPUTE STATISTICS

The above SQL statement can collect table-level statistics, such as the number of records, the size of the table, in bytes. It is important to note that analyze, COMPUTE, and statistics are reserved keywords, they have specific column names as entry parameters, and the table-level statistics are saved in Metastore.

1 TABLE COMPUTE STATISTICS  for column - column -name2, ....

Note that it is not necessary to specify each column of the table in the Analyze statement-as long as the columns involved in filtering the/join condition or group by are specified

Statistical information TypesThe following table lists the types of statistics that are collected, including number types, dates, timestamps, and strings, binary data typessince the CBO is a logical planning tree that traverses spark in a subsequent way, we can propagate these statistics to other operators from the bottom up. Although we have a lot of statistical information to evaluate and the corresponding overhead, we will explain the two most complex and interesting process of evaluating statistics for the operator's filter and join. Filter Selectiona filter condition is a predicate expression that is configured in a WHERE clause in an SQL SELECT statement. A predicate can be a complex logical expression that contains a logical operand and, or, not and contains multiple conditions. A single condition usually contains a comparison operator, such as =, <, <=, >=, or <=>. Therefore, it is very complex to estimate the selection based on all filter expressions. let's demonstrate some of the calculations that make filtering choices for complex logical expressions that contain multiple conditional logical expressions.
    • For the logical expression and, his filter selection is the selection of the left condition multiplied by the right condition selection, such as FS (A and B) = FS (a) * FS (b).
    • For a logical expression OR, his filter selection is the selection of the left condition plus the right condition selection and subtracts the selection of the logical expression and in the left condition, such as FS (a OR b) = FS (a) + FS (b)-FS (A and B) = FS (a) + FS (b) – (FS ( A) * FS (b))
    • For a logical expression not, his filter factor is 1.0 minus the original expression, such as FS (not a) = 1.0-fs (a)
Now let's look at a single logical condition that may have multiple operands such as =, <, <=, >=, or <=>. For the case of a single operator as a column and another operator as a string, we first calculate the filter selection equal to (=) and less than (<) operators. The other comparison operators are similar.
    • equals operator (=): We check whether the string constant value in the condition falls within the range of the current minimum and maximum values of the column. This step is necessary because if the previous condition is used first, it may result in interval changes. If the constant value falls outside the interval, then the filter selection is 0.0. Otherwise, the inverse value is reversed (note: No additional histogram information is included, we only estimate the uniform distribution of the column values). Later releases will balance the histogram to optimize the accuracy of the estimate.
    • Less than operator (<): Checks to what interval the string constant value in the condition falls. If the minimum value is smaller than the current column value, then the filter selection is 0.0 (if it is greater than the maximum value, the selection is 1.0). Otherwise, we calculate the filter factor based on the available information. If there is no histogram, propagate and set the filter selection to: (constant value – minimum)/(Maximum value – minimum value). Also, if there is a histogram, the histogram bucket density between the minimum and constant values of the current column is added when the filter selection is calculated. Also, notice that the constant at the right of the condition becomes the maximum value for that column.
Join Cardinalitywe have discussed the filter selection and now discuss the output base of the join. Before we calculate the output base of the two joins, we need to have the output base of the children's nodes first. The base of each join end will not exceed the base of the original table record count. More precisely, the number of valid records that are obtained after all operations are performed before the join operation is performed. Here, we prefer to calculate the gene for the inner JOIN operation as it is often used to evolve the base of other join types. We calculate the number of records for a join B under A.K = B.K conditions, i.e.num (a IJ B) = num (a) *num (b)/max (DISTINCT (A.K), distinct (B.K))num (A) is the number of valid records for table A after the previous operation of the join operation, distinct is the number of unique values for the Join column K. as shown below, by calculating the inner join base, we can probably evolve the base of the other join types:
    • left OUTER join (Left-outer join): num (a loj b) = max (num (a IJ b), num (a)) is a larger value between the base of the inner join output base and the left outer connector end A. This is because we need to count each record in the outer end, although they do not appear in the join output record.
    • Right-outer Join:num (a roj b) = max (num (a IJ b), num (b))
    • Full-outer Join:num (a foj b) = num (a loj b) + num (a roj b)-num (a IJ b)
Optimal Plan selectionNow that we have the intermediate results of data statistics, let's discuss how to use this information to choose the best query plan. Earlier we explained the choice of builders based on precise base and statistical information in a hash join operation. Similarly, based on the estimation of the size of all pre-operations of the base and join operations, we can better estimate the size of the join test to determine whether the test conforms to the broadcast conditions. These statistics also help us to balance the reordering optimizations of cost-based joins. We adapted the dynamic programming algorithm [Selinger 1979]3 to select the best multi-join sequence. To be more precise, when building a multi-join, we only consider the best scenario (with the lowest cost) of the same set (with M elements). For example, for a 3-way join, we consider the best join scenario based on the possible order of elements {A, B, C}, i.e. (a join B) join C, (a join C) join B and (B join C) join A. The algorithms we adapt consider all combinations, including the left linear tree (left-deep trees), the dense tree (bushy trees), and the right linear tree (right-deep-trees). We also trim the Cartesian product (Cartesian product) to use when building a new plan if both the left and right subtrees do not contain references required by the join condition. This pruning strategy significantly reduces the search scope. Most database optimizer takes the CPU and I/O considerations into account, separating the cost to estimate the total operational overhead. In Spark, we estimate the cost of a join operation with a simple formula:Cost = weight * cardinality + (1.0-weight) * Size 4the first part of the formula corresponds to the approximate CPU cost, and the second part corresponds to IO. The cost of a join tree is the sum of all the intermediate join costs. Performance Testing and analysis for querieswe use a non-intrusive approach to add these cost-based optimizations to spark and switch this feature by adding a global configuration spark.sql.cbo.enabled. In Spark 2.2, this parameter is false by default. Intentionally setting this feature in the short term is turned off by default, because Spark is used by thousands of companies for production environments, and by default this feature may cause the production environment to become more stressful and result in undesirable consequences. Configuration and MethodologyA cluster of four nodes (single configuration: Huawei fusionserver RH2288, 40 cores, and 384 GB memory) uses TPC-DS to test Apache Spark 2.2 query performance. Run a test Query performance statement in a cluster of four nodes and set the scale factor to 1000 (approximately 1TB data). It takes about 14 minutes to collect all 24 tables (245 columns in total). before verifying the end-to-end results, let's look at a query statement tpc-ds (Q25; below) to better understand the power of cost-based join sequencing. This query statement consists of three fact tables: Store_sales (2.9 billion rows of records), Store_returns (288 million rows of records) and Catalog_sales (1.44 billion rows of records). It also includes three dimension tables: Date_dim (73,000 rows of records), store (1K row record) and item (300K line record).
1 SELECT2 i_item_id,3 I_item_desc,4 s_store_id,5 S_store_name,6  sum(Ss_net_profit) asStore_sales_profit,7  sum(Sr_net_loss) asStore_returns_loss,8  sum(Cs_net_profit) asCatalog_sales_profit9  fromTen store_sales, Store_returns, Catalog_sales, Date_dim D1, Date_dim D2, Date_dim D3, One Store, item A WHERE -D1.d_moy= 4 -     andD1.d_year= 2001 the     andD1.d_date_sk=Ss_sold_date_sk -     andI_item_sk=Ss_item_sk -     andS_store_sk=Ss_store_sk -     andSs_customer_sk=Sr_customer_sk +     andSs_item_sk=Sr_item_sk -     andSs_ticket_number=Sr_ticket_number +     andSr_returned_date_sk=D2.d_date_sk A     andD2.d_moybetween 4  and Ten at     andD2.d_year= 2001 -     andSr_customer_sk=Cs_bill_customer_sk -     andSr_item_sk=Cs_item_sk -     andCs_sold_date_sk=D3.d_date_sk -     andD3.d_moybetween 4  and Ten -     andD3.d_year= 2001 in GROUP  by - i_item_id, I_item_desc, s_store_id, S_store_name to ORDER  by + i_item_id, I_item_desc, s_store_id, S_store_name -LIMIT -
not using CBO's Q25Let's take a look at the join tree without using cost-optimized Q25 (below). In general, this tree is also called the left linear tree. Here, the join #1 and #2 is the big fact table with the fact table Join,join 3 Facts Table Store_sales, Store_returns, and Catalog_sales, and produces a large intermediate result table. These two joins are executed in shuffle join and produce large output, where join #1输出了1.9.9 billion rows. In short, closing the CBO, the query took 241 seconds. using the Q25 of the CBOon the other hand, using Cbo,spark to create an optimization scheme can reduce intermediate results (see below). In this case, spark creates a dense tree rather than a left-depth tree. Under the CBO rules, Spark joins the dimension table corresponding to the fact table (before attempting to join the fact table directly). Avoiding large table joins means avoiding the shuffle of large overhead. In this query, the median result size is reduced to the original 1/6 (compared to the previous). Finally, the Q25 only took 71 seconds and improved performance by 3.4 times times. tpc-ds Query Performancenow we have an intuitive feel for the reason for the performance boost, and we look at the bottom-to-end Tpc-ds query results. The following table shows the cost of using the CBO or not using all the TPC-DS queries under the CBO:first, it is important to note that there is no performance change in half of tpc-ds performance queries. This is because the query plan using or not using the CBO is not different (for example, even without the CBO, the Spark's Catalyst optimizer's histogram can be optimized for these queries.) The rest of the query performance is improved, the most interesting of the 16 queries, the CBO has made a huge change to the query plan and brought more than 30% performance improvement (below) Overall, the icon we observed shows that 16 queries are about 2.2 times times faster, with Q72 accelerating up to 8 times times. ConclusionRecalling the previous article, the blog shows the Apache Spark 2.2 new CBO with different specular levels. We discussed the details of the statistical information collection framework, the base propagation when filtering and join, the CBO open (choosing the builder and the multiplexing), and the Tpc-ds query performance improvement. last year, we handled 32 subtasks for the CBO umbrella JIRA SPARK-16026, involving more than 50 patches and more than 7,000 lines of code. In other words, it is very difficult to balance the CBO in a distributed database, and this is a small step in this direction. In a future release, we plan to continue in this direction and continue to add more complex statistics (histograms, total records-minimum rough estimates, statistical partitioning levels, etc.) and improve our formulas. We are very excited about the progress we have made and hope you enjoy these improvements. We want you to try the new cbo! in Apache Spark 2.2. Extended Readingcan view Spark 2017 (Summit) speech: Cost Based Optimizer in Spark 2.2
    1. The principle is that smaller relationships are easier to put into memory
    2. <=> indicates ' safe null equals ', returns true if both sides of the result are null, and returns False if only one side is null
    3. P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, T. G. Price, "Access Path Selection in A relational Database Management System ", Proceedings of ACM SIGMOD Conference, 1979
    4. Weight (weighted value) is a tuning parameter that can be configured by Spark.sql.cbo.joinReorder.card.weight (default is 0.7)
 

Cost-based Optimizer (CBO) (reproduced) in Apache Spark 2.2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.