Parallel Computing of relational algebra and parallel computing of relational algebra
The parallel execution of SQL queries has been extended from the learning of Dremel and Impala, so I took this opportunity to learn more about the parallel computing of relational databases and relational algebra.
Speedup and Scaleup
Speedup means that two times of hardware is used for half the execution time. Scaleup refers to two times of hardware in exchange for two times of tasks within the same period of time. But it is often not that simple, and two times of hardware will also bring about other problems: more CPU brings about long start time, open-Sales of communications, and data skew caused by parallel computing.
Multi-processor architecture
Shared Memory: Any CPU can access any memory (global share) and disk. The advantage is simplicity, but the disadvantage is poor scalability and low availability.
Shared Disk: Any CPU can access any disk, but can only access its primary storage. The advantage is that the availability and scalability are good, but the disadvantage is that the implementation is complicated and potential performance problems.
Do not share: Any CPU can only access its primary storage and disk. The advantage is scalability and availability. The disadvantage is to achieve complex and complex balancing.
Hybrid: The system is a shared nothing architecture, but the node may be in another architecture. In this way, the advantages of multiple architectures are mixed.
Data Partition
The purpose of data partitioning is to allow the database to read and write data in parallel to maximize the potential of I/O. Common partition algorithms include: round-robin, range index, and hash.
Parallel relational operations
The attributes of relational algebra allow parallel relational operations.
Parallel query processing is mainly divided into four steps:
ØTranslation: Translates relational algebra expressions into query trees.
ØOptimization: Rearrange the join order and select different join Algorithms to minimize the execution overhead.
ØParallel: Converts a query tree to a physical operation tree and loads it to a processor.
ØRun: Run the final execution plan in parallel.
First, translate an SQL statement into a query tree.
Then, sort the join order based on the table size and index, and select an appropriate algorithm.
Join Algorithms are commonly used in the following ways:
ØNested Loop join: The idea is very simple. It is equivalent to two-layer cyclic traversal. The outer layer is the driving table and the rows that meet the association conditions are returned. This method is applicable when the drive table is small (after filtering by conditions) and the join field in the drive table is indexed. The efficiency is poor when both tables are large.
For each row R1 in the outer table
For each row R2 in the inner table
If R1 joins with R2
Return (R1, R2)
ØSort-merge join: The idea is also very simple, that is, sorting by join fields and then sorting by merging. When a join field has duplicate values, each duplicate value forms a partition. The efficiency of sort-merge is determined by the sorting of Join fields and the number of repeated values. This method is applicable to situations where both tables are large, especially when a clustered index exists in the join field (equivalent to having sorted order), which is highly efficient. Algorithms are mainly used on disks.
ØHash join: Similar to sort-merge in the case of repeated values, it is only the use of hash functions for partitioning. The idea is to scan a small table to create a hash table (in the build stage, a small table is also called a build table), and then scan the large table row by row for comparison (in the probe stage, a large table is also called a probe table ). This method is applicable when both tables are large and have no indexes. The limit is only applicable to equijoin. The algorithm is mainly consumed on the CPU.
In addition, for subqueriesSemi joinAndAnti joinAnd other algorithms.
Finally, the query tree is converted into a physical operation tree, that is, a real execution plan. Then, the cluster resources are scheduled to the appropriate node for parallel computing.
References
1 Parallel Query Processing