Introduction to the Database multi-table connection method-hash-join

Last Update:2016-11-12 Source: Internet

Author: User

Tags joins

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview

Hash join is a database processing algorithm for multi-table connection, there are two more common ways for multi-table connection: Sort merge-join and nested loop. To give a clearer introduction to the usage scenarios for hash joins and to introduce such a connection algorithm, here is a brief introduction to the two join methods mentioned above.

The connection is a kind of concept, or why we have and several, for those who do not know the database is probably the beginning of the doubts. Simply put, we have data in different tables, and different tables have their own table structure, different tables can be related, most of the actual use, not just a table of information, such as the need to find Hangzhou from a class table students, Then use this information to retrieve their math scores in the score table, if there is no multi-table connection, it can only manually query the information of the first table as the second table retrieval information to query the final results, it is conceivable how cumbersome this will be.

For a few common databases, like oracle,postgresql they are all supported by Hash-join, and MySQL does not support it. In this respect, Oracle and PG have done relatively well, the implementation of Hash-join itself is not very complex, but it requires the implementation of the optimizer to maximize the advantages of their own, I think this is the hardest place.

The query method of multi-table connection is divided into the following: Inner connection, outer connection and cross connection. Outer joins are divided into: Left outer connection, right outer connection and full outer connection. For different query methods, using the same join algorithm will have a different cost, this is closely related to its implementation, need to consider how different query methods are implemented, for the specific use of which connection method is determined by the optimizer through the cost of measurement, The calculation of the cost of several connection methods is briefly described later. Hashjoin In fact there are a lot of places to be considered and implemented, such as data tilt seriously how to deal with, memory, how to do wood, hash how to deal with conflicts, and so on, these are not the focus of this article, no longer detailed, each take out can be another one.

Nested loop Join

Sort Merge-join

2. Principle and implementation

Simple for two tables, hash-join even if the small table in two tables (called s) as a hash table, and then to scan the other table (called m) of each row of data, the use of the row data based on the connection conditions to map the establishment of the hash table, the hash table is placed in memory, This will quickly get the corresponding S table to match the M table.

For a large result set, the merge-join needs to be sorted efficiently, and the nested loop join is a nested loop that is undoubtedly more unsuitable for large data set connections, and Hash-join is the way to handle this tricky query. Especially for a large table and a small table, basically only need to scan the size of the table once to draw the final result set.

However, the Hash-join only applies to the equivalent connection, for the <, <=, >= such a query connection is still required nested loop such a common connection algorithm to handle. If the connection key is originally ordered or needs to be sorted, then the cost of merge-join may be smaller than the Hash-join, at which point Merge-join will be more advantageous.

All right, that's a lot of crap, let's talk about implementation, take a simple multi-table SQL query statement to raise a chestnut: SELECT * from t1 join t2 on t1.c1 = T2.C1 where t1.c2 > T2.c2 and t1.c1 > 1. So how does a SQL go into a database system and how is it handled and dissected? SQL: Ghosts know what I've been through ...

1. The first step, it needs to undergo lexical and grammatical parsing, this part of the output is a token node with a syntax tree.

Grammar analysis, as the name implies, this part is only a syntactic anatomy, a string of SQL statements processed into a primitive structure of the node tree, each node has its own special identity, but does not analyze and deal with the specific meaning and value of the node.

2. The second step is semantic analysis and rewrite processing.

The process of rewriting different databases may have different processing, some may be with the logical execution process together, and some are separated.

This step to finish the shape of the tree is generally consistent with the syntax analysis tree, but at this time the nodes are carrying some specific information, with the expression behind the where, for example, this infix expression each node has its own type and specific information, do not care what the value is, this step into the rewriting process, Rewriting is a logical optimization method that makes some complex SQL statements simpler or more consistent with the process of database processing.

3. Optimizer processing

Optimizer processing is more complex, but also the most difficult place for the SQL module, optimization is endless, so the optimizer is not optimal only better. The optimizer needs to take into account all the factors necessary to do the general type is very strong but also to ensure a strong optimization and correctness.

The most important role of the optimizer is to choose the path, for multi-table connection how to determine the order and connection of table connection, different database has different processing methods, PG support dynamic programming algorithm, the number of tables when the use of genetic algorithm. The path determination is also dependent on the implementation of the cost model, which maintains some statistical information, such as the maximum, minimum, NDV, and distinct values of the columns, which can be used to calculate the selection rate and further calculate the cost.

Return to the body, which connection method is determined here, hash join on the size of the table is required, so the formation of the plan is T1-T2 or T2-T1 is not the same, each connection method has its own cost calculation mode.

Cost estimates for hash joins:

Introduction to the Database multi-table connection method-hash-join

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More