1. Small Table pair large table (broadcast join)
Distributes data from small tables to each node for use by large tables. Executor stores all the data for a small table, sacrificing space for a certain amount of time, in exchange for shuffle operations, which is called broadcast Join in Sparksql.
The conditions for broadcast join are as follows:
* The broadcast table needs to be less than the value configured by Spark.sql.autoBroadcastJoinThreshold, which defaults to 10M (or hint with broadcast join)
* The base table cannot be broadcast, such as a left outer join, only the right table can be broadcast
2.Shuffle Hash Join
Because the broadcast table is first collect to the driver segment and then distributed redundantly to each executor, when the table is larger, the use of the broadcast join can cause greater pressure on the driver and executor ends.
Spark can divide large quantities of data into small, n-smaller datasets for parallel computation in the form of partitioning.
By using the same principle of the same inevitable partitioning of key, Sparksql divides the joins of the larger table into n partitions, and then hashes the data corresponding to the partitions in two tables.
This reduces the pressure on the driver broadcast side table to some extent, and also reduces the memory consumption of the executor end to the entire broadcast table.
*shuffle Hash join is divided into two steps:
The two tables are re-partitioned by the join keys, that is, shuffle, to allow records with the same join keys value to be divided into corresponding partitions
Joins the data in the corresponding partition, where the small table partition is constructed as a hash table and then matched according to the join keys value recorded in the large table partition
The *shuffle Hash join has the following conditions:
The average size of the partition does not exceed the value configured by Spark.sql.autoBroadcastJoinThreshold, which is 10M by default
When a base table cannot be broadcast, such as a left outer join, only the right table can be broadcast
One side of the table to be significantly smaller than the other side, the small side will be broadcast (significantly less than the definition is 3 times times smaller, here is the experience)
3. Large table to large table (Sort Merge Join)
The two tables were re-shuffle by the join keys to ensure that records with the same join keys value were divided into the corresponding partitions. After partitioning, the data within each partition is sorted, sorted, and then connected to the records in the corresponding partition
Because two sequences are ordered, traverse from the beginning, hit the same key on the output, if different, the left side of the small continue to take the left, and vice versa to the right (that is, to get lost)
Several joins in Spark SQL