Parallel Computing of relational algebra and parallel computing of relational algebra

Last Update:2014-12-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The parallel execution of SQL queries has been extended from the learning of Dremel and Impala, so I took this opportunity to learn more about the parallel computing of relational databases and relational algebra.

Speedup and Scaleup

Speedup means that two times of hardware is used for half the execution time. Scaleup refers to two times of hardware in exchange for two times of tasks within the same period of time. But it is often not that simple, and two times of hardware will also bring about other problems: more CPU brings about long start time, open-Sales of communications, and data skew caused by parallel computing.

Multi-processor architecture

Shared Memory: Any CPU can access any memory (global share) and disk. The advantage is simplicity, but the disadvantage is poor scalability and low availability.

Shared Disk: Any CPU can access any disk, but can only access its primary storage. The advantage is that the availability and scalability are good, but the disadvantage is that the implementation is complicated and potential performance problems.

Do not share: Any CPU can only access its primary storage and disk. The advantage is scalability and availability. The disadvantage is to achieve complex and complex balancing.

Hybrid: The system is a shared nothing architecture, but the node may be in another architecture. In this way, the advantages of multiple architectures are mixed.

Data Partition

The purpose of data partitioning is to allow the database to read and write data in parallel to maximize the potential of I/O. Common partition algorithms include: round-robin, range index, and hash.

Parallel relational operations

The attributes of relational algebra allow parallel relational operations.

Parallel query processing is mainly divided into four steps:

ØTranslation: Translates relational algebra expressions into query trees.

ØOptimization: Rearrange the join order and select different join Algorithms to minimize the execution overhead.

ØParallel: Converts a query tree to a physical operation tree and loads it to a processor.

ØRun: Run the final execution plan in parallel.

First, translate an SQL statement into a query tree.

Then, sort the join order based on the table size and index, and select an appropriate algorithm.

Join Algorithms are commonly used in the following ways:

ØNested Loop join: The idea is very simple. It is equivalent to two-layer cyclic traversal. The outer layer is the driving table and the rows that meet the association conditions are returned. This method is applicable when the drive table is small (after filtering by conditions) and the join field in the drive table is indexed. The efficiency is poor when both tables are large.

For each row R1 in the outer table
For each row R2 in the inner table
If R1 joins with R2
Return (R1, R2)

ØSort-merge join: The idea is also very simple, that is, sorting by join fields and then sorting by merging. When a join field has duplicate values, each duplicate value forms a partition. The efficiency of sort-merge is determined by the sorting of Join fields and the number of repeated values. This method is applicable to situations where both tables are large, especially when a clustered index exists in the join field (equivalent to having sorted order), which is highly efficient. Algorithms are mainly used on disks.

ØHash join: Similar to sort-merge in the case of repeated values, it is only the use of hash functions for partitioning. The idea is to scan a small table to create a hash table (in the build stage, a small table is also called a build table), and then scan the large table row by row for comparison (in the probe stage, a large table is also called a probe table ). This method is applicable when both tables are large and have no indexes. The limit is only applicable to equijoin. The algorithm is mainly consumed on the CPU.

In addition, for subqueriesSemi joinAndAnti joinAnd other algorithms.

Finally, the query tree is converted into a physical operation tree, that is, a real execution plan. Then, the cluster resources are scheduled to the appropriate node for parallel computing.

References

1 Parallel Query Processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Parallel Computing of relational algebra and parallel computing of relational algebra

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Parallel Computing of relational algebra and parallel computing of relational algebra

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support