Introduction
Join is a common operation in SQL statements, a good table structure can be scattered in different tables, so that they conform to a certain paradigm, reduce table redundancy, update fault tolerance and so on. The best way to establish relationships between tables and tables is by join operations.
For Spark, there are 3 join implementations, each of which corresponds to a different application scenario:
- Broadcast Hash join: Suitable for a smaller table and a large table for join
- Shuffle Hash join: Fits a small table and a large table for join, or a join between two small tables
- Sort Merge join: Suitable for joins between two larger tables
The first two are based on the hash join, but before the hash join need to first shuffle or first broadcast. Here is a detailed explanation of the specific principles of these three different joins
Hash Join
Let's take a look at this SQL statement: SELECT * from Order,item where item.id = order.i_id, very simple a join node, two tables participating in the join are item and Order,join Keys are item.id and order.i_id, respectively. Now assume that this join is using the hash join algorithm, the whole process will go through three steps:
1. Determine the build table and probe table: This concept is important, build table uses the join key to construct the hash table, and probe table uses the join key to probe, the probe succeeds can join together. Typically, a small table will be used as a build table and a large table as a probe table. In this case, item is build table,order for probe Table;
2. Build Hash Table: Read the data of build table, and for each row of data according to join key (Item.id) Hash,hash to the corresponding bucket, generate a record in the hash table. Data is cached in memory, dump to external memory if memory does not fit;
3. Probe: Scan probe table (order) sequentially, use the same hash function to map the records in the hash table, and then check the join condition (item.id = order.i_id) After the mapping succeeds. If the match succeeds, you can join the two together.
The basic process can be consulted, there are two small issues to be concerned about:
1. How is hash join performance? Obviously, the hash join basically only scan two tables once, can think O (a+b), compared to the most extreme Cartesian set operation A*b, do not know how many streets dumped;
2. Why does the build table choose a small table? The reason is very simple, because the built hash table best can be loaded in memory, the most efficient; this also determines that the hash join algorithm is only suitable for at least one small table of the join scene, for the two large table join scene does not apply.
As mentioned above, the hash join is a single-machine join algorithm in the traditional database, which needs to undergo some distributed transformation in the distributed environment, in the final analysis, it is to use the distributed computing resources to parallelize computing and improve the overall efficiency. Hash join distributed transformation generally has two classic scenarios:
1. Broadcast HASH join: distributes one of the small table broadcasts to the partition node on which the other large table resides, and then hashes the partition records on it with each other. Broadcast is suitable for the small table is small, can broadcast the scene directly;
2. Shuffler hash join: Once the size of the small table data is large, it is no longer suitable for broadcast distribution. In this case, the two tables can be re-organized by the join key according to the same principle of the same partition as the join key, so that the join divide and divide into a lot of small joins and take advantage of the parallelization of the cluster resources.
Broadcast Hash Join
As you know, in a common model of a database (such as a star model or snowflake model), tables are generally divided into two types: fact tables and dimension tables. Dimension tables generally refer to fixed, less-variable tables, such as contacts, types of items, etc., with limited general data. The fact table generally records the flow of water, such as sales list, and so on, usually with the growth of time is expanding.
Because the join operation is to connect to records with the same key value in two tables, in Sparksql, the most straightforward way to join two tables is to first partition the key and then take out the same key value in each partition for the connection operation. But this inevitably involves shuffle, and shuffle is a time-consuming operation in spark, and we should design spark applications as much as possible to avoid a lot of shuffle.
In order to avoid shuffle, we can distribute all the data of a dimension table of limited size to each node for use by the fact table when the dimension table and fact table are in the join operation. Executor stores all of the data in the dimension table, sacrificing space to some extent, in exchange for a lot of time-consuming shuffle operations, which are called broadcast joins in Sparksql, as shown in:
Table B is a smaller table, and black means that it is broadcast to each executor node, and each partition of table A will fetch data from table A through the block manager. Depending on the join key for each record, the corresponding record in table B is taken, and the operation is based on the join type. This process is relatively simple, do not repeat.
The conditions for broadcast join are as follows:
1. The broadcast table needs to be less than the value configured by Spark.sql.autoBroadcastJoinThreshold, which defaults to 10M (or hint with broadcast join)
2. The base table cannot be broadcast, such as a left outer join, only the right table can be broadcast
It seems that broadcasting is a more ideal solution, but does it have any shortcomings? It's also obvious. This scheme can only be used to broadcast smaller tables, otherwise the redundant transmission of data is much greater than shuffle overhead, in addition, broadcast performance needs to be broadcast collect to the driver, when there are frequent broadcasts, the memory of driver is also a test.
As shown, the broadcast hash join can be divided into two steps:
1. Broadcast stage: Distribute the small table broadcast to all hosts where the large table resides. Broadcast algorithm can have a lot, the simplest is to send driver,driver and then unified distribution to all executor, or is based on bittorrete-peer thinking;
2. Hash join stage: Execute single-stand hash join, small table mapping, large table probing on each executor;
SPARKSQL specifies that the basic condition for broadcast hash join execution is that the broadcast small table must be less than the parameter spark.sql.autoBroadcastJoinThreshold, which defaults to 10M.
Shuffle Hash Join
When the table on one side compares hours, we choose to broadcast it to avoid shuffle and improve performance. However, because the broadcast table is first collect to the driver segment and then distributed redundantly to each executor, when the table is larger, the use of the broadcast join can cause greater pressure on the driver and executor ends.
However, because Spark is a distributed computing engine, it can be partitioned to divide large quantities of data into small, n-smaller datasets for parallel computing. This thought applied to the join is the shuffle Hash join. Using the same principle of the same inevitable partition key, sparksql the larger table Join divide and conquer, first divides the table into n partitions, and then the two tables in the corresponding partition of the data are hash Join, so that to a certain extent, to reduce the driver broadcast side of the table pressure, It also reduces the memory consumption of the executor end to the entire broadcast table. The principle is as follows:
Shuffle Hash join is divided into two steps:
1. Repartition the two tables according to the join keys, i.e. shuffle, in order to divide records with the same join keys value into the corresponding partition
2. Join the data in the corresponding partition, where the small table partition is constructed as a hash table and then matched according to the join keys value recorded in the large table partition
The Shuffle Hash join has the following conditions:
1. The average size of the partition does not exceed the value configured by Spark.sql.autoBroadcastJoinThreshold, the default is 10M
2. The base table cannot be broadcast, such as a left outer join, only the right table can be broadcast
3. One side of the table to be significantly smaller than the other side, the small side will be broadcast (significantly less than the definition of 3 times times smaller, here is the experience)
As we can see, in a certain size table, sparksql from the angle of time-space integration, the two tables are re-partitioned, and the partition in the small table is hashed to complete the join. On the basis of maintaining certain complexity, the memory pressure of driver and executor is minimized, and the stability of the calculation is improved.
Under Big data conditions, if a table is small, the best choice to perform the join operation is undoubtedly broadcast hash join, which is the most efficient. However, once the size of the small table data increases, the need for broadcast memory, bandwidth and other resources will inevitably be too large, broadcast hash join is no longer the optimal solution. At this point can be partitioned according to the join key, according to the same principle of the same partition key, you can divide and conquer the large table join, divided into many small table joins, make full use of cluster resource parallelization. As shown, the shuffle hash join can also be divided into two steps:
1. Shuffle stage: Two tables are partitioned according to join Key, the records of the same join key are distributed to the same node, and the data of the two tables is distributed to all nodes in the cluster. This process is called shuffle
2. Hash join stage: The data on each partition node executes the single machine hash join algorithm separately.
See here, can be preliminarily summed up if two small table joins can directly use a single-machine hash join, if a large table joins a minimum table, you can choose the broadcast hash join algorithm, and if it is a large table join a small table, you can choose shuffle hash join algorithm; What if it's two big tables for join?
Sort Merge Join
The two implementations described above apply to a table of a certain size, but when two tables are very large, it is obvious that whatever applies will put a lot of pressure on the computational memory. This is because the join is both a hash join, which is to fully load the data on one side into memory, and use the hash code to join the record with equal value of the keys.
When two tables are very large, Sparksql uses a new scheme to join the table, the sort Merge join. This implementation does not have to load one side of the data all the way into the star hash join, but need to sort the data before the join, as shown in:
As you can see, the first two tables are re-shuffle according to the join keys, ensuring that records with the same join keys value are divided into the corresponding partitions. After partitioning, the data within each partition is sorted, sorted, and then connected to the records in the corresponding partition, as shown in:
Looks familiar, doesn't it? is also very simple, because two sequences are ordered, from the beginning to traverse, hit the same key on the output, if different, the left side of the small continue to take the left, vice versa take the right.
It can be seen that regardless of the size of the partition, the sort Merge join does not have to load the data on one side into the memory, but is lost by the fetch, which greatly improves the stability of the SQL join under the large data volume.
Sparksql uses a new algorithm-sort-merge join for two large table joins, as shown in the entire process is divided into three steps:
1. Shuffle stage: The two large tables are re-partitioned according to the join key, and the data of the two tables is distributed across the cluster for distributed parallel processing;
2. Sort stage: The two table data of a single partition node are sorted separately;
3. Merge stage: Performs a join operation on two well-ordered partitioned table data. Join operation is very simple, traversing two sequential sequences respectively, encounter the same join key on the merge output, otherwise take a smaller side, see the schematic:
Through the above analysis, it can be clear that each join algorithm has its own application scenario, the Data Warehouse design is best to avoid large tables and large table join query, Sparksql can also be based on memory resources, The amount of bandwidth resources spark.sql.autoBroadcastJoinThreshold the parameter, so that more joins are actually executed as broadcast hash join.
Reference:
http://blog.csdn.net/asongoficeandfire/article/details/53574034
Https://mp.weixin.qq.com/s/7ohGjdaTC56T4U3ISxGwIw
3 kinds of join implementations of Sparksql