Parallel Computing of relational algebra

Source: Internet
Author: User
The parallel execution of SQL queries has been extended from the learning of Dremel and Impala, so I took this opportunity to learn more about the parallel computing of relational databases and relational algebra. Speedup and ScaleupSpeedup are two times the hardware for half the execution time. Scaleup refers to two times of hardware in exchange for two times of tasks within the same period of time. But often things are not so simple

The parallel execution of SQL queries has been extended from the learning of Dremel and Impala, so I took this opportunity to learn more about the parallel computing of relational databases and relational algebra. Speedup and Scaleup Speedup are two times the hardware for half the execution time. Scaleup refers to two times of hardware in exchange for two times of tasks within the same period of time. But often things are not so simple

The parallel execution of SQL queries has been extended from the learning of Dremel and Impala, so I took this opportunity to learn more about the parallel computing of relational databases and relational algebra.

Speedup and Scaleup

Speedup means that two times of hardware is used for half the execution time. Scaleup refers to two times of hardware in exchange for two times of tasks within the same period of time. But it is often not that simple, and two times of hardware will also bring about other problems: more CPU brings about long start time, open-Sales of communications, and data skew caused by parallel computing.


Multi-processor architecture

Shared Memory <喎?http: www.2cto.com kf ware vc " target="_blank" class="keylink"> Export/ydpd0ns1zagjpc9wpjxwifsawdupq = "center">

Shared Disk: Any CPU can access any disk, but can only access its primary storage. The advantage is that the availability and scalability are good, but the disadvantage is that the implementation is complicated and potential performance problems.


Do not share: Any CPU can only access its primary storage and disk. The advantage is scalability and availability. The disadvantage is to achieve complex and complex balancing.


Hybrid: The system is a shared nothing architecture, but the node may be in another architecture. In this way, the advantages of multiple architectures are mixed.


Data Partition

The purpose of data partitioning is to allow the database to read and write data in parallel to maximize the potential of I/O. Common partition algorithms include: round-robin, range index, and hash.


Parallel relational operations

The attributes of relational algebra allow parallel relational operations.


Parallel query processing is mainly divided into four steps:

?Translation: Translates relational algebra expressions into query trees.

?Optimization: Rearrange the join order and select different join Algorithms to minimize the execution overhead.

?Parallel: Converts a query tree to a physical operation tree and loads it to a processor.

?Run: Run the final execution plan in parallel.

First, translate an SQL statement into a query tree.


Then, sort the join order based on the table size and index, and select an appropriate algorithm.


Join Algorithms are commonly used in the following ways:

?Nested Loop join: The idea is very simple. It is equivalent to two-layer cyclic traversal. The outer layer is the driving table and the rows that meet the association conditions are returned. This method is applicable when the drive table is small (after filtering by conditions) and the join field in the drive table is indexed. The efficiency is poor when both tables are large.

For each row R1 in the outer table
For each row R2 in the inner table
If R1 joins with R2
Return (R1, R2)

?Sort-merge join: The idea is also very simple, that is, sorting by join fields and then sorting by merging. When a join field has duplicate values, each duplicate value forms a partition. The efficiency of sort-merge is determined by the sorting of Join fields and the number of repeated values. This method is applicable to situations where both tables are large, especially when a clustered index exists in the join field (equivalent to having sorted order), which is highly efficient. Algorithms are mainly used on disks.

?Hash join: Similar to sort-merge in the case of repeated values, it is only the use of hash functions for partitioning. The idea is to scan a small table to create a hash table (in the build stage, a small table is also called a build table), and then scan the large table row by row for comparison (in the probe stage, a large table is also called a probe table ). This method is applicable when both tables are large and have no indexes. The limit is only applicable to equijoin. The algorithm is mainly consumed on the CPU.


In addition, for subqueriesSemi joinAndAnti joinAnd other algorithms.

Finally, the query tree is converted into a physical operation tree, that is, a real execution plan. Then, the cluster resources are scheduled to the appropriate node for parallel computing.


References

1 Parallel Query Processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.