SparkSql predicate push-down rule in Chinese and foreign join queries, sparksql Predicate

Source: Internet
Author: User

SparkSql predicate push-down rule in Chinese and foreign join queries, sparksql Predicate

SparkSql

SparkSql is a distributed SQL engine built on the spark computing framework. It uses DataFrame and DataSet to carry structured and semi-structured data for complex data query and processing, the provided DSL can directly use the scala language to complete SQL queries, and also use the thrift server to provide a service-oriented SQL query function. SparkSql provides a Data Source API through which you can develop a set of ctor to directly query various Data sources, including distributed FS files such as NoSql, RDBMS, search engines, and HDFS. For systems similar to SparkSql, Hive should be used in the separation of SQL and computing frameworks; prestoDB and Impala can be used to cope with instant queries to a certain extent ).

Predicate push-down

Predicate is defined in English as follows: A predicate is a function that returns bool (or something that can be implicitly converted to bool ), that is, the return value is a function of true or false. Anyone who has used scala or spark knows that there is a filter method. The input parameter of this high-level function is a function that returns true or false. If there is no method in the SQL language, there is only an expression. The expression behind the where is used for filtering. After these statements are parsed and processed by the SQL layer, in the database, it is displayed as a predicate.

Why do predicates need to be pushed down? To put it bluntly, this question is to answer who will complete the data filtering operation. So who can complete data filtering? We can roughly divide the Query Process in SparkSql into the following:

SparkSql first analyzes a series of input SQL statements, including lexical parsing (Word Segmentation in search engines) syntax analysis and semantic analysis (such as determining whether a database or table exists, group by must be combined with Aggregate functions, and other rules). Then, execution plans are generated, including logical and physical plans, there will be a lot of optimization in the logical plan phase, while the physical plan is the generation of the DAG of RDD; after these two steps are completed, the specific execution (that is, various heavyweight computing logic) will be executed, and various physical operators (RDD's Transformation) will be entered in disorder, the Filter and Scan operators are related to the problems discussed in this article. The Scan operator directly targets the underlying data source to Scan and read the data source. The Filter operator filters data after scanning.

We know that the Data Source API of SparkSql can be encapsulated to query various Data sources. If the underlying Data Source cannot efficiently filter Data, a direct global scan is performed, each piece of related data is sent to the Filter operator of SparkSql for filtering. Although the Code Generation technology used by SparkSql greatly improves the efficiency of data filtering, however, this process cannot avoid reading large volumes of data from disks, and may even involve network I/O (for example, when data is not localized ); if the underlying data source can quickly filter the data during scanning, the filter will be handed over to the underlying data source, this is the predicate push-down in SparkSql (which data sources can efficiently filter data and how SparkSql can efficiently filter data is not the focus of this Article ).

External Connection query and connection conditions

External join queries are divided into left outer join queries, right outer join queries, and full outer join queries. There are not many scenarios for full outer join queries, therefore, this article focuses on left join query and right join query.

When this condition is met, the two rows of data in the two tables can be "joined" and returned together. For example, the following query is performed:

 
  1. SELECT LT.value, RT.value

  2. FROM lefttable LT LEFT JOIN righttable RT

  3.    ON LT.id = RT.id AND LT.id > 1

  4. WHERE RT.id > 2

"LT. id = RT. id and lt. the "id> 1" condition is called the "join condition". It is used to determine whether the two rows of the two tables to be joined can be joined together. If this condition is not met, these two rows in two tables are not all kicked out, but are processed differently based on different connection query types, therefore, this is not a single-Table filtering process or a "joint filtering" process for two tables, but the "RT. id> 2 "is called the" join condition ", which is a single table filtering process. The predicate push-down mentioned above can be used in two types of conditions. In SparkSql, there are specific rules. Take the left outer join query as an example. The rules are as follows:


Next, we will analyze the rules in this table in detail.

Suppose we have two tables. The table structure is very simple and there are only two data tables, but it is sufficient to clarify our push-down rules. The two tables are as follows:
Lefttable:

Rigthtable:

Conditional push-down after join in the left table

The query statement is as follows:

 
  1. SELECT LT.id, LT.value, RT.value

  2. FROM lefttable LT

  3. LEFT JOIN righttable RT

  4.    ON LT.id = RT.id

  5. WHERE LT.id > 1

To analyze the results of data filtering by <. id> 1 down to the left table. After filtering by LT. id> 1, the left table is changed:


In this case, connect to the right table. The row with id 2 in the left table can be found in the right table. The connection result is as follows:


It can be seen that the condition is pushed down to filter 50% of the data in the left table, although there are only two. The reason is that in SparkSql, the above query is parsed into the following subquery:

 
  1. SELECT LT.id, LT.value, RT.value

  2. FROM (SELECT id, value

  3.    FROM lefttable LT

  4.    WHERE LT.id > 1

  5.    ) TT

  6. LEFT JOIN righttable RT

  7.    ON TT.id = RT.id

This is a non-correlated subquery, that is, you can complete the subquery first and then complete the parent query. The subquery is not associated with the external query during the Query Process.

Conditions in the left table join cannot be pushed.

The query statement is as follows:

 
  1. SELECT LT.id, LT.value, RT.value

  2. FROM lefttable LT LEFT JOIN righttable RT

  3.    ON LT.id = RT.id AND LT.id > 1

Predicate push-down is used to improve the query efficiency. If you do not push the push, you can get the correct query results. So let's take a look at the correct results calculated without pushing the push. The join process is as follows:

Step 1: the row with the id of the Left Table 1 can find the same id in the right table, but the id of the Left table is 1, which does not meet the second join condition (LT. id> 1), so the left table is not joined to the right table, so the value of the Left table is retained, the value in the right table is null. (If you do not meet the join conditions, you still keep your values on the join operation. Can you give me a null value? No way, it's so capricious ).

Step 2: the row with the id of the Left Table 2 can be found in the right table, and the id of the row with the id of the Left table 2 is greater than 1. Both join conditions are met, therefore, it is joined to the right table, so the values of the left and right tables are retained. The final query result is as follows:



If we push "LT. id> 1" to the table, what will happen?
First, the left table is filtered by "LT. id> 1", as follows:


Now connect to the right table. The row with the id of 2 in the left table can be found in the right table. id = RT. id and lt. id> 1 "This join condition, so the values of both tables are retained. If no data exists in the left table, the query is complete. The query result is as follows:


This query result is inconsistent with the correct result of not pushing, which is obviously an incorrect result. Therefore, the conditions in the left table join cannot be pushed down for data filtering.

Context push-down in right table join

The query statement is as follows:

 
  1. SELECT LT.id, LT.value, RT.value

  2. FROM lefttable LT LEFT JOIN righttable RT

  3.    ON LT.id = RT.id

  4. AND RT.id > 1

Now we push down the condition RT. id> 1 in the right table join to filter the right table. The filter is as follows:


Then the left table is connected to the right table.The process is as follows:

Step 1: the row with the id of 1 in the left table does not exist in the right table. The value of the Left table is retained, and the value of the right table is null.
Step 2: the rows of id 2 in the left table are in the right table, and RT. id is greater than 1. If both join conditions are met, the values of the left and right tables are retained. The query result is as follows:


If you do not need to push (to get the correct results), let's take a look at the results. The process is as follows:

Step 1: the row with the id of 1 in the left table exists in the right table but does not meet the second join condition. Therefore, this row is not joined. Therefore, the data in the left table is retained and the right table is null.
Step 2: the row with the id of 2 in the left table exists in the right table and meets the second join condition. Therefore, the data in the left and right tables is retained.


It can be seen that in the right table join, the push-down condition is not supported, and the results are the same. So, why not push? You can filter out half of the data. The equivalent processing statement in Sparksql is:

 
  1. SELECT LT.id, LT.value, RT.value

  2. FROM LT LEFT JOIN (SELECT id, value

  3.    FROM righttable RT

  4.    WHERE RT.id > 1

  5.    ) TT

  6. ON LT.id = TT.id

Condition not pushed after right table join

The query statement is as follows:

 
  1. SELECT LT.id, LT.value

  2. FROM lefttable LEFT JOIN righttable RT

  3.    ON LT.id = RT.id

  4. WHERE RT.id > 1

First, we can see that the conditions after join cannot be pushed. The process is as follows:

Step 1: The rows with the id of 1 in the left table can be found in the right table, but only the join condition is satisfied. When the where condition is used to determine the data after the connection, it is found that the right table id does not meet RT. the join result is not retained because of the condition id> 1, when the value of the right table is null, this is also the key)

Step 2: join the rows with id 2 in the left table and the rows with id 2 in the right table, and meet the where condition of RT. id> 1.


Obviously, this is a correct query result that matches the semantics.

Now let's take a look at the push-down condition after right table join:

Step 1: Use RT. id> 1 to filter the right table. After filtering, only one row with id 2 is left in the right table.
Step 2: the row with the id of 1 in the left table does not exist in the filtered right table. The value of the Left table is retained, and the value of the right table is null.
Step 3: the row with the id of 2 in the left table exists in the right table. The value of the Left table is retained, and the value of the right table is retained.

The result is as follows:


Obviously, this is actually a wrong result.

At this point, the four rules for left join query have been analyzed. We can see that the filter conditions for external join queries in SparkSql cannot be used to filter data sources in all cases, if used properly, the query performance will be greatly improved. If used improperly, incorrect query results will be generated, and such incorrect results are not easy to detect. Therefore, you must be extremely careful when using them.

Recommended reading:

1. SparkSql optimizer-Catalyst

2. simple illustration of SparkSql Catalyst

3. phoenix





Spark Learning Skills

Kafka, hbase, spark, Flink, and Other beginners go deep into source code, spark machine learning, big data security, and Big Data O & M. Please pay attention to the Public Account of the wave point and read high-quality articles.


More articles coming soon




Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.