Parallel computation of relational algebra

Source: Internet
Author: User

The parallel execution of SQL queries is derived from Dremel and Impala , so we take this opportunity to learn more about relational databases and the parallel computation of relational algebra.

Speedupand theScaleup

Speedup two times the hardware for half the execution time. Scaleup refers to twice times the hardware in exchange for the same time to perform two times the task. But often things are not that simple, and two times as much hardware can lead to other problems: more CPU -long startup time and communication overhead, and data skew problems associated with parallel computing.


Multi-processor architecture

Shared Memory : Any CPU can access any memory ( global share ) and disk. The advantages are simple, the disadvantage is poor scalability, low availability.


Shared Disk : Any CPU can access any disk, but can only access its main memory. The advantage is that usability and extensibility are good, and the downside is to implement complex and potential performance issues.


do not share : Any CPU can only access its own main memory and disk. The advantages are also extensibility and usability, with the disadvantage of achieving complex and complex balances.


Mixed Type : The system as a whole is a shared nothing schema, but the nodes may be inside other architectures. This blends the benefits of multiple architectures.


Data Partitioning

The purpose of data partitioning is to allow the database to read and write data in parallel, maximizing the potential for I/O . Common partitioning algorithms are:round-robin, range index, hash.


parallelism of relational operations

property of relational algebra itself allows parallelization of relational operations .


Parallel query processing is divided into four main steps:

? translation : A relational algebraic expression is translated into a query tree.

? optimization : Rearrange the join sequence and select different join algorithms to minimize execution overhead.

? parallel : Transforms the query tree into a physical operation tree and loads it into the processor.

? execution : Runs the final execution plan in parallel.

First, a SQL statement is translated into a query tree.


Then, according to the table size, index, and so on, rearrange the join order, and select the appropriate algorithm.


There are several common types of join algorithms:

? Nested loop join: The idea is simple, equivalent to a two-layer loop traversal, the outer layer is the driver table, return the row that satisfies the association condition. Applies to the case where the drive table is small ( after conditional filtering )and the join field on the driver table is indexed. The efficiency is poor when both tables are large.

for each row R1 In the outer table
    for each row R2 in the inner table
        If R1 joins with R2
            return (R1, R2)

   sort-merge join join field and then merge sort. When the join field has duplicate values, the equivalent of each duplicate value forms a partition. join field determines the efficiency of sort-merge if the order and the number of duplicates are the same. Applies to situations where both tables are large, especially if the join field has a clustered index equivalent to already ordered , which is highly efficient. The algorithm is primarily consumed on disk.

? hash join: Similar to sort-mergewhen there is a duplicate value condition, it is simply a man-made partition using a hash function. The idea is to scan small tables to build a hash table (build stage, a small table called a build table ), and then scan the large table by row to compare (Probe stage, big table also called probe Table ). For cases where both tables are large and have no indexes, the limit is only applicable to equivalent connections. The algorithm is mainly consumed on the CPU .


In addition, for subqueries there are algorithms such as Semi join and anti join .

Finally, the query tree becomes the physical operation tree, which is the real execution plan. Then, according to the resource situation of the cluster, dispatch to the appropriate node for parallel computation.


References

1 Parallel Query Processing

Parallel computation of relational algebra

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.