Sparksql Big Data Combat: Uncover the mystery of join

Source: Internet
Author: User
Tags joins shuffle

This article comes from netease Cloud community .

Join operation is an advanced feature in database and Big data computation, most scenes need complex join operation, this article introduces the common join algorithm supported by SPARKSQL and its applicable scenario from the principle level.

Join Background Introduction

Join is a topic that database query can never be opened up, the traditional query SQL technology is divided into simple operation (filtering operation-where, sorting operation-limit etc.), aggregation operation-groupby and join operation, etc. The join operation is the most complex and costly type of operation, and it is also a relatively large number of operations used in OLAP scenarios. Therefore, it is necessary to conduct in-depth study.

In addition, from a business level, users in the construction of several warehouses will also involve the use of join problems. Typically, the tables in the Data warehouse are generally divided into "low-level tables" and "High Table".

The so-called "low-Level table" is the data source after the import of several warehouses directly generated tables, single-table column values are less, generally can be clearly attributed to the dimension table or fact table, the table and the table is mostly outside the health dependency, so the query will encounter a large number of join operations, query efficiency is poor. The "High-level table" is processed on the basis of the "low table", and it is common practice to use SQL statements to pre-merge tables that require joins into a "wide table", where queries on wide tables do not require a large number of joins and are highly efficient. However, the disadvantage of wide table is that the data will have a lot of redundancy, and relative generation lag, the query results may not be timely.

In order to get more sensitive query results, most scenarios require complex join operations. Join operation is complicated, it is usually the time space complexity is high, and there are many algorithms, in different scenarios need to select a specific algorithm to achieve the best optimization results. This article will introduce several common join algorithms supported by Sparksql and their applicable scenarios.

join common classification and basic implementation mechanism

The current SPARKSQL supports three join algorithms: Shuffle hash join, broadcast hash join, and sort merge join. In the final analysis, both of them belong to the hash join, but before the hash join need to first shuffle or broadcast. In fact, the hash join algorithm from the traditional database, and shuffle and broadcast is the big data of the skin (distributed), the combination of the two become a big data algorithm. So it can be said that the root of big data is traditional database. Since the hash join is the "kernel", then dig out to see, read the "Skin" analysis again.

Hash Join

Let's take a look at this SQL statement: SELECT * from Order,item where item.id = order.i_id, very simple a join node, two tables participating in the join are item and Order,join Keys are item.id and order.i_id, respectively. Now assume that this join is using the hash join algorithm, the whole process will go through three steps:

    1. Determine the build table and probe table: This concept is important, build table uses the join key to construct the hash table, and probe table uses the join key to probe and the probe succeeds to join. Typically, a small table will be used as a build table and a large table as a probe table. In this case, item is build Table,order to probe Table.
    2. Build Hash Table: reads the data from the Build table (item), and for each row the data is hash,hash to the corresponding bucket according to the join key (Item.id), generating a record in the hash table. The data is cached in memory and needs to be dump to external memory if the memory does not fit.
    3. Probe: Scan probe table (order) sequentially, use the same hash function to map the records in the hash table, and then check the join condition (item.id = order.i_id) After the mapping succeeds. If the match succeeds, you can join the two together.

The basic process can be consulted, there are two small issues to be concerned about:

    1. How is hash join performance? Obviously, the hash join basically only scans two tables once, can think O (a+b), compared to the most extreme Cartesian set operation A*b, did not know how many streets dumped.
    2. Why does the build table choose a small table? The reason is very simple, because the built hash table best can be loaded in memory, the most efficient; this also determines that the hash join algorithm is only suitable for at least one small table of the join scene, for the two large table join scene does not apply.

As mentioned above, the hash join is a stand-alone join algorithm in the traditional database, which needs to undergo a certain distributed transformation in the distributed environment, which is to use the distributed computing resources to calculate the parallelism and improve the overall efficiency. Hash join distributed transformation generally has two classic scenarios:

    1. Broadcast HASH join: distributes one of the small table broadcasts to the partition node on which the other large table resides, and then hashes the partition records on it with each other. The broadcast is suitable for small tables that can be broadcast directly to a scene.
    2. Shuffler hash join: Once the size of the small table data is large, it is no longer suitable for broadcast distribution. In this case, the two tables can be re-organized by the join key according to the same principle of the same partition as the join key, so that the join divide and divide into a lot of small joins and take advantage of the parallelization of the cluster resources.

The following are explained in detail separately.

Broadcast Hash Join

As shown, the broadcast hash join can be divided into two steps:

    1. Broadcast stage: Distributes the small table broadcast to all hosts where the large table resides. Broadcast algorithms can be many, the simplest is to send driver,driver and then unified distribution to all executor, or is based on BitTorrent torrentbroadcast.
    2. Hash Join phase: Perform single-machine hash joins, small table mappings, and large table heuristics on each executor.

3.SparkSQL rules broadcast hash join execution basic conditions for the broadcast small table must be less than the parameter spark.sql.autoBroadcastJoinThreshold, the default is 10M.

Shuffle Hash Join

Under Big data conditions, if a table is small, the best choice to perform the join operation is undoubtedly broadcast hash join, which is the most efficient. However, once the size of the small table data increases, the need for broadcast memory, bandwidth and other resources will inevitably be too large, broadcast hash join is no longer the optimal solution. At this point can be partitioned according to the join key, according to the same principle of the same partition key, you can divide and conquer the large table join, divided into many small table joins, make full use of cluster resource parallelization. As shown, the shuffle hash join can also be divided into two steps:

    1. Shuffle stage: Each of the two tables is partitioned by the join key and the records of the same join key are distributed to the same node, and the data for the two tables is distributed to all nodes in the cluster. This process is called shuffle.
    2. Hash Join phase: Separate single-machine hash join algorithm for data on each partition node.

See here, can be preliminarily summed up if two small table joins can directly use a single-machine hash join, if a large table joins a minimum table, you can choose the broadcast hash join algorithm, and if it is a large table join a small table, you can choose shuffle hash join algorithm; What if it's two big tables for join?

Sort Merge Join

Sparksql uses a new algorithm-sort-merge join for two large table joins, as shown in the entire process is divided into three steps:

    1. Shuffle stage: The two large tables are re-partitioned according to the join key, and the two table data are distributed across the cluster for distributed parallel processing.
    2. Sort stage: Sorts the two table data for a single partition node, respectively.
    3. Merge phase: Performs a JOIN operation on two well-ordered partitioned table data. The join operation is simple, traversing two sequential sequences separately, encountering the same join key on the merge output, otherwise taking a smaller side. As shown in the following:

Through the above analysis, it is obvious that the cost relationship of these joins can be obtained: costs (broadcast hash join) < costing (shuffle hash join) < Price (sort merge Join), Data Warehouse design is best to avoid large table and large table join query, Sparksql can also be based on memory resources, The amount of bandwidth resources spark.sql.autoBroadcastJoinThreshold the parameter, so that more joins are actually executed as broadcast hash join.

Summary

Join operations are advanced features in database and big data computing because of their unique complexity, and few students are able to articulate their principles. This paper attempts to bring everyone into the world of join, to understand the common join algorithms and their respective application scenarios. The following two articles will be based on the deepening of the join inside, 1.1 points to uncover its veil, please pay attention!

This article has been published by the author Van Hin Xin authorized NetEase Cloud Community, the original link: sparksql Big Data Combat: Uncover the mystery of join

Sparksql Big Data Combat: Uncover the mystery of join

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.