Several joins in Spark SQL

Source: Internet
Author: User

1. Small Table pair large table (broadcast join)

Distributes data from small tables to each node for use by large tables. Executor stores all the data for a small table, sacrificing space for a certain amount of time, in exchange for shuffle operations, which is called broadcast Join in Sparksql.

The conditions for broadcast join are as follows:

* The broadcast table needs to be less than the value configured by Spark.sql.autoBroadcastJoinThreshold, which defaults to 10M (or hint with broadcast join)

* The base table cannot be broadcast, such as a left outer join, only the right table can be broadcast

2.Shuffle Hash Join

Because the broadcast table is first collect to the driver segment and then distributed redundantly to each executor, when the table is larger, the use of the broadcast join can cause greater pressure on the driver and executor ends.

Spark can divide large quantities of data into small, n-smaller datasets for parallel computation in the form of partitioning.

By using the same principle of the same inevitable partitioning of key, Sparksql divides the joins of the larger table into n partitions, and then hashes the data corresponding to the partitions in two tables.

This reduces the pressure on the driver broadcast side table to some extent, and also reduces the memory consumption of the executor end to the entire broadcast table.

*shuffle Hash join is divided into two steps:

The two tables are re-partitioned by the join keys, that is, shuffle, to allow records with the same join keys value to be divided into corresponding partitions

Joins the data in the corresponding partition, where the small table partition is constructed as a hash table and then matched according to the join keys value recorded in the large table partition

The *shuffle Hash join has the following conditions:

The average size of the partition does not exceed the value configured by Spark.sql.autoBroadcastJoinThreshold, which is 10M by default

When a base table cannot be broadcast, such as a left outer join, only the right table can be broadcast

One side of the table to be significantly smaller than the other side, the small side will be broadcast (significantly less than the definition is 3 times times smaller, here is the experience)

3. Large table to large table (Sort Merge Join)

The two tables were re-shuffle by the join keys to ensure that records with the same join keys value were divided into the corresponding partitions. After partitioning, the data within each partition is sorted, sorted, and then connected to the records in the corresponding partition

Because two sequences are ordered, traverse from the beginning, hit the same key on the output, if different, the left side of the small continue to take the left, and vice versa to the right (that is, to get lost)

Several joins in Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.