Database-query optimization of the relational database system
Query Optimization
It plays an important role in the relational database system.
Relational query optimization is a key factor affecting RDBMS performance.
Due to the high semantic level of relational expressions, the relational system can analyze query semantics from relational expressions, providing the possibility of performing query optimization.
The advantage of query optimization is not only that users do not have to consider how to best express the query for better efficiency, but also that the system can do better than the "optimization" of user programs.
(1) The optimizer can obtain a lot of statistical information from the data dictionary, which is difficult for user programs to obtain.
(2) If the physical statistics of the database changes, the system can automatically optimize the query to select an appropriate execution plan. Programs must be rewritten in non-relational systems, but rewriting programs are often unlikely in practical applications.
(3) The optimizer can consider hundreds of different execution plans. Generally, programmers can only consider a limited number of possibilities.
(4) The optimizer includes many complex optimization technologies that can be mastered only by the best programmers. Automatic system optimization is equivalent to making everyone have these optimization technologies.
RDBMS calculates the execution cost of various query execution policies based on a certain cost model, and then selects the execution scheme with the minimum cost.
Centralized database
The execution overhead mainly includes:
Number of disk access blocks (I/O cost)
Processing time (CPU cost)
Memory overhead of the query
I/O cost is the most important
Distributed Database
Total cost = I/O cost + CPU cost + memory cost + communication cost
Overall goal of query optimization: select an effective policy to obtain the value of the given relational expression to minimize the query cost (in fact, it is relatively small)
[Example 3] Name of the student taking course 2. SQL statement: SELECT Student. sname FROM Student, SC WHERE Student. sno = SC. sno and SC. cno = '2'; assume that there are 1000 student records in the student-course database and 10000 Course Selection records, of which 50 are course 2 electives.
The system can use multiple equivalent relational algebra expressions to complete this query Q1 = π Sname (σ Student. sno = SC. sno ipvsc. cno = '2' (Student × SC) Q2 = π Sname (σ SC. cno = '2' (Student SC) Q3 = π Sname (Student σ SC. cno = '2' (SC ))
I. First case
Q1 = π Sname (σ Student. Sno = SC. Sno limit SC. Cno = '2' (Student × SC ))
1. Calculate generalized Cartesian Product
The following describes how to connect Student and each tuples of SC:
Load as many pieces of a table (such as a Student table) as possible in the memory, and set aside a set of tuples for storing another table (such as a SC table.
Connect each tuples in SC to each tuples in Student. The connected tuples are filled with one piece and written to the intermediate file.
Read from SC and connect it to the Student in memory until the SC table has finished processing.
Read several Student tuples and one SC tuple.
Repeat the above process until the Student table is processed
2. Select
Read the connected tuples in sequence and select records that meet the requirements according to the selection conditions.
It is assumed that the memory processing time is ignored. It takes 5x104 s to read intermediate files (the same as writing intermediate files ).
Assume that only 50 tuples meet the condition can be stored in the memory.
3. perform projection operations
Output the result of step 1 on the Sname for projection to obtain the final result.
In the first case, the total time for query execution is ≈ 105 + 2 × 5 × 104 ≈ 105 s
All memory processing times are ignored
Ii. Case 2
Q2 = π Sname (σ SC. Cno = '2' (Student SC ))
1. Computing natural connections
Execute a natural connection. The read policy for Student and SC tables remains unchanged. The total number of read blocks is still 2100, and the cost is 105 s.
The result of the natural connection is much less than the first case, which is 104
Write these tuples at 104/10/20 = 50 s, 1‰ of the first case
2. Read the intermediate file block and perform the selection operation. the time consumed is also 50 s.
3. output the result of step 1 by projection.
In the second case, the total execution time is ≈ 105 + 50 + 50 ≈ 205 s.
Iii. Third case
Q3 = π Sname (Student σ SC. Cno = '2' (SC ))
1. Select the SC table first. You only need to read the SC table once and it takes 5 s to access the 100 data blocks. Because only 50 tuples meet the conditions, you do not need to use intermediate files.
2. Read the Student table and connect the read Student tuples with the SC tuples in the memory. You only need to read 100 Student tables once, which takes 5 s.
3. Projection of join results
In the third case, the total execution time is ≈ 5 + 5 ≈ 10 s.
Assume that the Cno field in the SC table has an index.
In the first step, you do not need to read all the SC tuples, but you only need to read those tuples (50) with Cno = '2)
The accessed index blocks and the data blocks that meet the conditions in SC are about 3 ~ Four
If the Student table has an index on Sno
The second step does not need to read all Student tuples.
Because only 50 SC records meet the conditions, a maximum of 50 Student records are involved.
The number of Student tables read can be greatly reduced.
The total access time will be further reduced to several seconds
Convert the algebraic expression Q1 to Q2, Q3,
When there is a select and connection operation, first select the operation, so that the joined tuples can be greatly reduced, which is an algebraic Optimization
In Q3
The SC table selection algorithm includes two methods: full table scan and index scan. After preliminary estimation, the index scan method is better.
For the connection between Student and SC tables, the index join on Student tables is less costly, Which is physical optimization.
Algebra Optimization Strategy: Improves query efficiency through equivalent transformation of relational algebra expressions
Equivalent of a relational Algebra Expression: the result of replacing the corresponding relationship in two expressions with the same relationship is the same.
The two relational expressions E1 and E2 are equivalent.
Common equivalence conversion rules: 1. join and Cartesian product exchange laws where E1 and E2 are relational algebra expressions and F are conditions for join operations, then there are E1 × E2 limit E2 × E1 E1 E2 limit E2 E1 E1 E2 limit E2 E12. join, Cartesian product combination law set E1, E2, E3 is a relational Algebra Expression, f1 and F2 are the conditions for the join operation, and there is (E1 × E2) × E3 × E1 × (E2 × E3) (E1 E2) E3 1_e1 (E2 E3) (E1 E2) E3 E1 (E2 E3) 3. the concatenation law of projection (E) Embedding (E) Here, E is a relational Algebra Expression, Ai (I = 1, 2 ,..., N), Bj (j = 1, 2 ,..., M) is the property name and {A1, A2 ,..., An} constitutes {B1, B2 ,..., A subset of Bm. 4. Select the concatenation Law (E) substring (E) Here, E is a relational Algebra Expression, and F1 and F2 are the selection conditions. The selected concatenation rules indicate that the selection conditions can be merged. In this way, you can check all the conditions. 5. Select the exchange law σ F (E) ≡ (σ F (E) for the projection operation. Select the condition F only involves attributes A1 ,..., An. If F does not belong to A1 ,..., An Attribute B1 ,..., Bm has a more general rule: (σ F (E) Then (σ F (E) 6. select the exchange law with Cartesian product. If all the properties involved in F are those in E1, then (E1 × E2) Then (E1) × E2 if F = F1 then F2, in addition, F1 only applies to the attributes in E1, while F2 only applies to the attributes in E2. The above equivalence transformation rule and 6 introduces (E1 × E2) priority (E1) × (E2) If F1 only involves the attributes in E1 and F2 involves the attributes of E1 and E2, there will still be (E1 × E2) variance (E1) × E2) it enables the part to be selected first before Cartesian product. 7. select the same attribute name as the allocation rule for E = E1 1_e2, E1, and E2, then σ F (E1 1_e2) 1_σ F (E1) ∪ σ F (E2) 8. select the distribution law of the Difference operation. If E1 and E2 have the same attribute name, then σ F (E1-E2) ≡ σ F (E1)-σ F (E2) 9. select the allocation law σ F (E1 E2) σ F (E1) σ F (E2) F for natural connections. Only the common attributes of E1 and E2 are involved. 10. the allocation law of projection and Cartesian Product. E1 and E2 are two relational expressions: A1 ,..., An is the attribute of E1, B1 ,..., If Bm is the property of E2, then (E1 × E2) Then (E1) × (E2) 11. if the projection and parallel allocation laws set E1 and E2 to have the same attribute names, then (E1 then E2) Then (E1) Then (E2)
Heuristic optimization of the query treeTypical heuristic rules:
1. Select an operation first. This is the most important and basic optimization policy.
2. perform projection and selection operations simultaneously (pipelining Technology)
If there are several projection and selection operations, and they all operate on the same link, you can complete all these operations while scanning this link to avoid repeated scanning relationships, it also avoids the storage of intermediate relationships.
3. Combine projection with binocular operations before or after projection (pipelining Technology)
4. combine some choices with the Cartesian Product to be executed before it into a Join Operation
5. Find Common subexpressions
If the results of this repeated subexpression are not very closely related and reading this relationship from external memory is much less time than calculating this subexpression, it is cost-effective to calculate a public subexpression and write the result into the intermediate file.
When a view is queried, the expression defining the view is a public subexpression.
Follow these heuristic rules and apply the equivalent transformation formula 9.3.1 to optimize the relational expression algorithm.
Algorithm: optimization of relational expressions
Input: Query tree of a relational expression
Output: optimized query tree
Method:
(1) Use equivalent transformation rule 4 to take shape like σ F1 when F2 then... ∧ Fn (E) is transformed to σ F1 (σ F2 (... (σ Fn (E ))...)).
(2) For each choice, use the Equivalent Conversion rule 4 ~ 9 try to move it to the leaf end of the tree.
(3) For each projection, use the general form in the equivalence transformation rules, and 11 to move it to the leaf end of the tree as much as possible.
Note:
Equivalent Transformation Rule 3: Make some projections disappear
Rule 5 Splits a projection into two, one of which may be moved to the leaf end of the tree.
(4) use Equivalence Transformation Rules 3 ~ 5. Merge the selection and projection strings into a single selection, a single projection, or a selection followed by a projection. Enables multiple choices or projections to be executed at the same time, or all of them are completed in one scan.
(5) group the nodes in the preceding syntax tree. Each binocular computation (×, cosine,-) is a group of all its direct ancestor (these direct ancestor operations are (σ, π ).
If the child leaves are all single-object operations, they are also incorporated into the group.
However, when the binocular operation is a Cartesian Product (×) and is not followed by an equivalent join with it, the selection cannot be in the same group as this binocular operation, separate these single-object operations into one group
Example [5] query statement: Search the student ID and name of the course named MATH. The relational Algebra Expression of the query statement is as follows: π S #, SNAME (σ CNAME = 'Math' SEX = 'F' (c SC S) in the formula above, the symbol is represented by π, σ, and × operations. The following formula is available: π S #, SNAME (σ CNAME = 'Math' ∧ SEX = 'F' (π L (σ C. C # = SC. C # Sort SC. S # = S.S # (C × SC × S) Here L is all attributes in C, SC, and S, removing duplicate attributes.