Several joins in Spark SQL

Last Update:2017-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Small Table pair large table (broadcast join)

Distributes data from small tables to each node for use by large tables. Executor stores all the data for a small table, sacrificing space for a certain amount of time, in exchange for shuffle operations, which is called broadcast Join in Sparksql.

The conditions for broadcast join are as follows:

* The broadcast table needs to be less than the value configured by Spark.sql.autoBroadcastJoinThreshold, which defaults to 10M (or hint with broadcast join)

* The base table cannot be broadcast, such as a left outer join, only the right table can be broadcast

2.Shuffle Hash Join

Because the broadcast table is first collect to the driver segment and then distributed redundantly to each executor, when the table is larger, the use of the broadcast join can cause greater pressure on the driver and executor ends.

Spark can divide large quantities of data into small, n-smaller datasets for parallel computation in the form of partitioning.

By using the same principle of the same inevitable partitioning of key, Sparksql divides the joins of the larger table into n partitions, and then hashes the data corresponding to the partitions in two tables.

This reduces the pressure on the driver broadcast side table to some extent, and also reduces the memory consumption of the executor end to the entire broadcast table.

*shuffle Hash join is divided into two steps:

The two tables are re-partitioned by the join keys, that is, shuffle, to allow records with the same join keys value to be divided into corresponding partitions

Joins the data in the corresponding partition, where the small table partition is constructed as a hash table and then matched according to the join keys value recorded in the large table partition

The *shuffle Hash join has the following conditions:

The average size of the partition does not exceed the value configured by Spark.sql.autoBroadcastJoinThreshold, which is 10M by default

When a base table cannot be broadcast, such as a left outer join, only the right table can be broadcast

One side of the table to be significantly smaller than the other side, the small side will be broadcast (significantly less than the definition is 3 times times smaller, here is the experience)

3. Large table to large table (Sort Merge Join)

The two tables were re-shuffle by the join keys to ensure that records with the same join keys value were divided into the corresponding partitions. After partitioning, the data within each partition is sorted, sorted, and then connected to the records in the corresponding partition

Because two sequences are ordered, traverse from the beginning, hit the same key on the output, if different, the left side of the small continue to take the left, and vice versa to the right (that is, to get lost)

Several joins in Spark SQL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More