Original address: Spark SQL joins implementation of Spark SQL joinShangang 2017-07-19 217 Tags: spark, database
Join as an important syntax feature in SQL, almost all of the slightly more complex data analysis scenarios are inseparable from join, now Spark SQL (Dataset/dataframe) has become the mainstream of spark application development, as a developer, It is important to understand how join is organized in spark.
Sparksql General Process Introduction
Before we explain the join implementation, we first briefly introduce the overall process of sparksql, generally, we have two ways to use Sparksql, one is to write directly SQL statements, which requires metabase support, such as Hive, and the other is through the dataset/ Dataframe writing spark applications. As shown, the SQL statement is parsed (SQL AST) into a query plan, or we are organized into a query plan through the APIs provided by Dataset/dataframe, and the query plan is divided into two main categories: the logical plan and the physical plan, which is often called the logical plan, which is parsed ( Analyzer), a series of query optimizations (Optimizer) are optimized after the logical plan, and finally mapped into a physical plan, converted into Rdd execution.
For more information on Sparksql parsing and execution, please refer to the article "Parsing and execution of SQL". For syntax parsing, parsing, and query optimization, this article does not elaborate, this article focuses on the physical implementation of join process.
Join basic elements
As shown, join roughly consists of three elements: Join mode, join condition, and filter criteria. Where the filter condition can also be placed in the join condition by the and statement.
Spark supports all types of joins, including:
- INNER JOIN
- Left OUTER JOIN
- Right outer join
- Full OUTER JOIN
- Left Semi Join
- Left anti Join
These join implementations are described below.
Join BASIC Implementation process
In general, the basic implementation process for joins is as shown, and spark will abstract the two tables participating in the join into a Stream traversal table (Streamiter) and lookup table (builditer), usually streamiter to a large table, builditer to a small table, We don't have to worry about which table is Streamiter and which is builditer, and this spark will do it automatically based on the join statement.
In the actual calculation, Spark will traverse based on Streamiter, each time a record in the Streamiter is taken out Rowa, the Keya is calculated based on the join condition, and then the Keya goes to Builditer to find all of the join conditions that are satisfied (keyb== Keya) records Rowbs, and each record in Rowbs is given a join record with Rowajoin, and finally a record of the final join is obtained based on the filtering criteria.
From the above calculation process is not difficult to find, for each record from Streamiter to Builditer to find matching records, so builditer must look for better performance of the data structure. Spark provides three join implementations: the sort merge join, the broadcast join, and the hash join.
Sort Merge Join implementation
To allow two records to join together, you first need to record the same key with the same partition, so generally, it is necessary to do a shuffle,map phase to determine the key of each record according to the join condition, and make shuffle write based on the key. Divide the records that may join together into the same partition, so that records with the same key in two tables can be pulled to the same partition during the shuffle read phase. We also mentioned earlier, for Builditer must look for better performance of the data structure, usually we can think of the hash table, but for a larger table, it is not possible to put all the records in the hash table, in addition to Builditer can be sorted first, look up in order to find, Finding the cost is also acceptable, and we know that the spark shuffle phase naturally supports sequencing, which is very well implemented, and the following is the sort merge join.
In the shuffle read phase, the merge sort is performed for Streamiter and Builditer respectively, and for each record, the corresponding record is looked up from builditer in sequential lookup, as two tables are sorted. , after each processing of a record of streamiter, for the next record of Streamiter, just start looking at the end of the last lookup in Builditer, so that each time in the Builditer to find do not have to start from scratch, overall, the search performance is better.
Broadcast JOIN implementation
In order to be able to have the same key records into the same partition, we usually do shuffle, then if the builditer is a very small table, then there is no need to make a shuffle, the Builditer broadcast directly to each compute node, Then put the builditer into the hash table, as shown in.
As you can see, you don't have to do shuffle, you can do it directly in a map, which is often called a map join. So the question comes, when will it be implemented with broadcast join? This does not concern us, spark SQL automatically help us to complete, when the estimated size of builditer does not exceed the value set by the parameter Spark.sql.autoBroadcastJoinThreshold (the default 10M), then will automatically adopt broadcast Join, otherwise use the sort merge join.
Hash Join implementation
In addition to the above two join implementations, Spark also provides a way to implement the hash join, in the shuffle read phase does not sort records, anyway, from the two table of the same key record will be in the same partition, just in the partition is not sorted, Put the records from Builditer into the hash table so that they can be found, as shown in.
It is not difficult to find that the records from Builditer into the hash table, then each partition from the Builditer record can not be too large, otherwise it will not exist, by default, the implementation of the hash join is closed, if you want to use a hash join, the following four conditions must be met:
- Builditer Total estimated size exceeds spark.sql.autoBroadcastJoinThreshold set value, i.e. broadcast join condition not met
- Turn on the switch that tries to use the hash join, Spark.sql.join.prefersortmergejoin=false
- The average size of each partition does not exceed the value set by Spark.sql.autoBroadcastJoinThreshold, that is, the records from Builditer for each partition in the shuffle read stage are to be put into memory
- The size of the streamiter is more than three times times builditer
So, the condition of using hash join is very harsh, in most real scenarios, even if you can use the hash join, but using the sort merge join is not much worse than the hash join, so try to use hash
Below we describe the implementation process of the different join methods separately.
INNER JOIN
Inner JOIN is sure to find records that meet the join condition in the left and right tables, and when we write SQL statements or use Datafrmae, we don't have to worry about which is the left table, which is the table, and spark automatically sets the large table to the left table in the Spark SQL query optimization phase. That is streamiter, the small table is set to the right table, that is, Builditer. This makes the lookup of the small table relatively superior. As shown in the basic implementation process, in the lookup phase, if the right table does not have a record that satisfies the join condition, it is skipped.
Left OUTER JOIN
The left outer join is based on left table, finds matching records in the right table, and returns a record where all fields are null if the lookup fails. When we write SQL statements or use Datafrmae, we generally keep large tables on the left and small tables on the right. The basic implementation process is as shown.
Right outer join
Right outer join is to find a matching record in the left table and a record that all fields are null if the lookup fails. So, the right table is Streamiter, the left table is builditer, we write SQL statements or use Datafrmae, generally let the big table on the right, small table on the left. The basic implementation process is as shown.
Full OUTER JOIN
Full outer join is relatively complex, in general it is necessary to do the left outer join, but also to do right outer join, but can not simply first left outer join, then right outer join, Finally, the union gets the final result because there are two inner join results in the final result. Because now that the left outer join is complete and the right outer join is completed, the full outer join is only implemented with the sort merge join, and the table on the other side is both Streamiter and Builditer, The basic implementation process is as shown.
Since the left table and the right table have been sorted, first take out a record in the left table and right table separately, compare key, if key is equal, then Joinrowa and ROWB, and update Rowa and ROWB to the left table and the next record of the right table respectively; keyA<keyB
It is indicated that there is no record corresponding to the left table Rowa in the right table, then Joinrowa and Nullrow, and then Rowa updates to the next record in the left table, and if keya>keyb, it indicates that there is no record corresponding to the right table ROWB in the left table. Then Joinnullrow and ROWB, followed by ROWB update to the next record on the right table. This loops through until the records of the left table and right tables are all processed.
Left Semi Join
The left semi join is based on the left table, looking for matching records in the right table, and if the lookup succeeds, returns only the records on the leftmost side, otherwise returns NULL, as shown in the basic implementation process.
Left anti Join
The left anti join, in contrast to the semi join, is based on left table, finds the matching record in the right table, returns null if the lookup succeeds, or returns only the record on the leftmost side, as shown in the basic implementation process.
Summarize
Join is a very important grammatical feature in database query, in the database field can be said to be "the world of the join", Sparksql as a distributed Data Warehouse system, provides us with a comprehensive join support, and in the internal implementation of the silent to do a lot of optimization, Understanding the implementation of join will help us to get a deeper understanding of the running trajectory of our application.
Original address: Spark SQL Join Implementation
Join implementation of Spark SQL