Parallel computation of relational algebra

Last Update:2014-12-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The parallel execution of SQL queries is derived from Dremel and Impala , so we take this opportunity to learn more about relational databases and the parallel computation of relational algebra.

Speedupand theScaleup

Speedup two times the hardware for half the execution time. Scaleup refers to twice times the hardware in exchange for the same time to perform two times the task. But often things are not that simple, and two times as much hardware can lead to other problems: more CPU -long startup time and communication overhead, and data skew problems associated with parallel computing.

Multi-processor architecture

Shared Memory : Any CPU can access any memory ( global share ) and disk. The advantages are simple, the disadvantage is poor scalability, low availability.

Shared Disk : Any CPU can access any disk, but can only access its main memory. The advantage is that usability and extensibility are good, and the downside is to implement complex and potential performance issues.

do not share : Any CPU can only access its own main memory and disk. The advantages are also extensibility and usability, with the disadvantage of achieving complex and complex balances.

Mixed Type : The system as a whole is a shared nothing schema, but the nodes may be inside other architectures. This blends the benefits of multiple architectures.

Data Partitioning

The purpose of data partitioning is to allow the database to read and write data in parallel, maximizing the potential for I/O . Common partitioning algorithms are:round-robin, range index, hash.

parallelism of relational operations

property of relational algebra itself allows parallelization of relational operations .

Parallel query processing is divided into four main steps:

? translation : A relational algebraic expression is translated into a query tree.

? optimization : Rearrange the join sequence and select different join algorithms to minimize execution overhead.

? parallel : Transforms the query tree into a physical operation tree and loads it into the processor.

? execution : Runs the final execution plan in parallel.

First, a SQL statement is translated into a query tree.

Then, according to the table size, index, and so on, rearrange the join order, and select the appropriate algorithm.

There are several common types of join algorithms:

? Nested loop join: The idea is simple, equivalent to a two-layer loop traversal, the outer layer is the driver table, return the row that satisfies the association condition. Applies to the case where the drive table is small ( after conditional filtering )and the join field on the driver table is indexed. The efficiency is poor when both tables are large.

for each row R1 In the outer table
    for each row R2 in the inner table
        If R1 joins with R2
            return (R1, R2)

&NBSP;&NBSP; sort-merge join join field and then merge sort. When the join field has duplicate values, the equivalent of each duplicate value forms a partition. join field determines the efficiency of sort-merge if the order and the number of duplicates are the same. Applies to situations where both tables are large, especially if the join field has a clustered index equivalent to already ordered , which is highly efficient. The algorithm is primarily consumed on disk.

? hash join: Similar to sort-mergewhen there is a duplicate value condition, it is simply a man-made partition using a hash function. The idea is to scan small tables to build a hash table (build stage, a small table called a build table ), and then scan the large table by row to compare (Probe stage, big table also called probe Table ). For cases where both tables are large and have no indexes, the limit is only applicable to equivalent connections. The algorithm is mainly consumed on the CPU .

In addition, for subqueries there are algorithms such as Semi join and anti join .

Finally, the query tree becomes the physical operation tree, which is the real execution plan. Then, according to the resource situation of the cluster, dispatch to the appropriate node for parallel computation.

References

1 Parallel Query Processing

Parallel computation of relational algebra

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Parallel computation of relational algebra

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Parallel computation of relational algebra

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support