First, database management system
Database Management System, DBMS
Measure whether the database is a standard:
ACID: Refers to the four features of a transaction in a database management system (DBMS):
1) atomicity (atomicity)
2) Consistency (consistency)
3) Isolation (isolation)
4) Durability (durability)
1, relational database: the database based on the relational database model, the use of relational algebra and other concepts and methods to process the data in the database, but also is organized into a set of formal descriptive table, the form of the role of the essence is loaded with data items of the special collection of collective, The data in these tables can be accessed or re-convened in many different ways without the need to reorganize the database tables.
2. Relational algebra:
In mathematics, relational algebra is the remaining Boolean algebra that supports the unity operation called Inverse (converse). An example of the excitation relationship algebra is the algebra of all two-tuple relationships on set X, with R-s being interpreted as the usual two-dollar relationship compound.
In the database, relational algebra is a branch of first-order logic and a set of relationships under closed-fit operations. An operation acts on one or more relationships to generate a relationship. Relational algebra is a part of computer science.
The relational algebra in pure mathematics is an algebraic structure of mathematical logic and set theory.
Relational algebra is an abstract query language that uses relational operations to express queries as a mathematical tool for studying relational data languages.
The operands of relational algebra are relations, and the results of operations are also related. The operators used in relational algebra include four classes: set operators, specialized relational operators, arithmetic comparators, and logical operators;
The operation of the database relational algebra: selection, projection, Cartesian product (also called "Cross-product" or "cross-Connection"), and set, difference, rename;
SQL's query language is loosely based on relational algebra, although the operands (tables) in SQL are not completely relational, many useful theories about relational algebra do not hold up in the SQL counterpart.
second, the database tuning
1, query Optimization technology:
database query optimization technology mainly includes query reuse technology, query rewrite rules, query algorithm optimization technology, parallel query optimization technology, distributed Query optimization technology, and other aspects (such as framework structure) optimization technology, these 6 technologies constitute a "Generalized database query optimization" concept.
from the perspective of optimized content, query optimization is divided into algebraic optimization and non-algebraic optimization, or logical optimization or physical optimization . Logical optimization is mainly based on the equivalent transformation of relational algebra to make some logical transformations; physical optimization is mainly based on the data reading, table connection, table connection sequence, sequencing and other techniques to optimize the query. "Query rewrite Rules" is a logical optimization, using relational algebra and heuristic rules, "Query algorithm optimization" is a physical optimization, using a cost-estimating multi-table connection algorithm to solve the minimum cost of technology.
2, Database tuning (db Tuning):
Database tuning enables database applications to run faster, with the goal of higher throughput of the database (the more transactions that are completed in a unit of time), and shorter response times (the shorter the response time of each transaction, the better).
The object being tuned is the overall database management system.
The object that the query statement is tuned to is a query statement.
3, the Database tuning method:
1) Manual tuning: Mainly rely on people, inefficient, requiring the operator to fully understand the principle of common sense depends on, but also the application, database management system, operating system and hardware have a broad and deep understanding.
2) Case-based tuning: summary of the classic case of the database parameters of the recommended configuration values, data logic layer design, so as to provide users with some reference for tuning work. However, this approach ignores the dynamic nature of the system and the differences between the different systems.
3) Self-tuning: Set up a model for database system, according to "factors affecting the performance efficiency of database system", the database system automatically carries out parameter configuration.
4, the database tuning is divided into five stages:
1) The first stage: the demand analysis period, the application situation estimation, the system selection strategy.
2) The second stage: project design period, data model design.
3) The third stage: the development period, SQL design, database function activation.
4) The fourth stage: test and operation, database function activation, model system pre-operation, system monitoring and analysis.
5) Fifth stage: on-line and maintenance, system monitoring and analysis.
5, the database tuning five stages of the main technology:
1) Application of the estimate: the use of the application of the method (the business logic into the database read and write distribution logic, in order to read more than write less or read/write equalization to distinguish between OLTP and OLAP, application to the database concurrency, whether the concurrency can be pooled, etc.), data volume, database pressure, peak pressure, etc.
2) System Selection strategy: Determine what kind of database can be applied to the application requirements, and determine whether the database is using open source or commercial, is the use of group or stand-alone system, while the operating system, middleware, hardware, network and other types of selection.
3) data Model design: mainly based on business logic design, from a few angles to consider the logical structure of the table, the content is as follows:
A) e-r model design: Follow the E-R model design principle. However, the occasional appropriate degree of non-normalization can improve system query performance.
b) Data Logic distribution strategy: The purpose is to reduce the data request of unnecessary data volume, the user needs to return the data, the available technology such as partition, with E-R Model sub-table, etc. (such as the typical use of Internet enterprises, according to the different business, sub-database, sub-table and other operations).
c) Data physical storage strategy: The purpose is to reduce IO, such as enabling compression technology, the index and table data storage separation, the different table data distributed in different table space, different table space distributed on different physical storage (especially the large and high-volume table space distributed on different physical storage), log, Indexes and data are distributed in different physical storage.
d) Index: Establish an appropriate index on the object that is frequently queried so that the positive effect of the index is greater than the negative (the maintenance of the index is consumed).
4) SQL Design: Write the correct, query-efficient SQL statements. This is mainly based on "query rewrite rules", in the process of writing statements to be aware of the need to consciously ensure that SQL can be used to the index.
5) The database function is enabled: the database provides some functions to improve the performance and can be used rationally.
A) query reuse: According to the actual situation to configure, can cache query execution plan, query results and so on.
b) database parameter settings: You can set the appropriate parameters such as data buffer.
6) The model system is pre-run. Simulate the actual operating environment on the standby system, increase the pressure to carry on the system test, detect the problem ahead of time.
7) System Monitoring and analysis: In the industrial environment, to strengthen the operation of the system monitoring and daily analysis work, as follows:
A) Application system performance: The collection of user comments on the application system, the system exists problems.
b) OS environment monitoring: real-time monitoring of CPU, memory, IO, etc., and compare real-time situation and historical normal situation.
c) Database internal status monitoring: Some databases provide system tables, views, tools and other means to provide users with information on the internal conditions of the database operation, such as lock information, which need real-time monitoring, and compare real-time situation and historical normal situation.
d) Log analysis: The log of the database, the operating system to find abnormal events, locate the problem.
third, query optimization
1. Query Reuse
Query reuse refers to the use of previous execution results as much as possible in order to save time in the whole process of query calculation and reduce resource consumption.
Query reuse technology is mainly focused on two aspects:
A) reuse of query results : allocating a block of cache in the buffer, storing the SQL statement text and the final result set, and returning the result directly when the same SQL input is entered. The reuse technology of query results saves the time of query plan generation and reduces the resource consumption in the whole process of query execution.
b) reuse of query plans : Caches the execution plan of a query statement and its corresponding syntax tree structure. Query plan reuse technology reduces the time and resource consumption of query plan generation.
Pros and cons of query reuse technology:
A) Advantages: Save CPU and IO consumption. In the actual use of the process, avoiding the disadvantages, according to the actual situation selection.
b) Cons: A large result set consumes a lot of memory resources, and the same SQL can get different result sets from different users.
2. Query optimization technology type
A) Syntax level: Query Language layer optimization, based on the syntax of optimization.
b) Generation: Queries are optimized using formal logic, using the principle of relational algebra.
c) Semantic level: Based on the integrity constraints, the query statements are semantically understood, inferring some of the operations can be optimized.
d) Physical level: physical optimization technology, based on cost estimation model, compare the cost of various execution methods to the least.
3. Query rewriting
is an equivalent conversion of a query statement that produces the same result for any state of any related pattern.
Two targets for query rewriting:
A) Convert the query into an equivalent, more efficient form, such as converting inefficient predicates to highly efficient predicates, eliminating duplicate conditions, and so on.
b) Try to rewrite the query as an equivalent, simple, and non-table-ordered form, providing more choices for the physical query optimization phase, such as view rewriting, sub-query merge transformations, and so on.
Query overrides are based on:
query rewriting is based on relational algebra .
A) The equivalent transformation rules of relational algebra provide theoretical support for query rewriting.
b) After query rewriting, the query optimizer may generate multiple connection paths, which can be preferred from among the candidates.
Query rewriting is based on the syntax, generation, and semantics of optimization, can be unified to the scope of logic optimization, based on the cost estimation model is the physical level of optimization, is the process of selecting the least cost path from the connection path.
Query rewrite ideas:
A) Convert a procedural query to a descriptive query, such as a view rewrite.
b) Convert complex queries (such as nested subqueries, outer join elimination, nested join elimination) as much as possible into a multi-table join query.
c) Convert inefficient predicates to equivalent high-efficiency predicates (such as equivalent predicate overrides).
d) simplifying where and having conditions by using the nature of equations and inequalities.
4. Query optimization algorithm
Query optimization, the process of solving the efficient execution plan for a given query statement. This process includes multi-seed problem solving. Different sub-problems, corresponding to the different solutions, namely the algorithm.
5. Query plan
A query plan, also known as a query tree , consists of a series of internal operators that form an execution of a query according to a certain number of operational relationships. To put it simply, the A and B tables first connect to the intermediate result, and then connect to the other C table to get a new intermediate mode until all the tables are connected.
- Query plan, different nodes on the binary tree:
A) Single table node:
Consider the data acquisition method of a single table:
1) Get data directly from IO.
2) Get data by index.
3) The location of the data through the index and then the IO to the data block to obtain data.
This is a process that resolves a logical field from physical storage to memory, which conforms to the requirements of the von Neumann architecture ( external memory data is read into memory to be processed )
b) Two table nodes (two table connection nodes)
1) Consider how the two tables are connected, how much the cost is, and what the connection path is . Represents an in-memory tuple, how to make a connection between tuples. At this point, tuples are usually already present in memory and can be used directly. This is a logical operation that accomplishes user semantics, but is only a partial operation, involving only two specific relationships. Complete user semantics (the semantics of user connections) and need to work with multiple table connection sequences.
2) different connection algorithms lead to different connection efficiency, such as large data volume can use nested connections, data if the order can use the merge connection, or the first sort after the use of merge connection .
c) Multi-table node
1) Consider how the multi-table join sequence constitutes the least expensive "execution plan". The decision is whether AB joins first or BC, which is a more expensive operation. If too many connections are judged, it can also lead to efficiency problems.
2) multiple relationships are connected in different order, and the CPU resources and memory resources may vary greatly .
3) Many databases use the Left deep tree, the right deep tree, the dense tree three kinds of ways or some of them to connect multiple tables to get a variety of connection paths.
- Strategies for generating optimal query plans:
A) rule-based optimization: defined as "rules" (such as rules based on relational algebra, rules based on experience), according to experience or some methods that have been known or proven to be effective, using these rules to simplify the query plan generation process in accordance with the simplified operation, use heuristic rules to exclude some obviously bad access path, This is rule-based optimization.
b) Cost-optimized: Based on a cost evaluation model, in the process of generating the query plan, the cost of each access path (the access path mainly consists of the three relational nodes above) is calculated, and then the least cost is chosen as the sub-path, so that all tables are connected to a complete path.
Cost Calculation formula: Total Cost = CPU Cost + IO cost ;
c) Mainstream databases are in use for rules-based and cost-based technologies.
Rule-Based optimization has the advantages of simple operation and quick determination of connection mode, but this method only excludes part of the bad possibility, so the result is not necessarily the best.
Based on cost optimization, it is a quantitative comparison of the various possible situations, so that the least cost can be achieved, but if the combination is more than the time spent on judgment will be much.
The implementation of the query optimizer is mostly a combination of two optimization strategies, such as the MySQL PostgreSQL, which takes a query optimization strategy based on rules and cost estimates.
- Why are query optimizations parallel?
A) in a traditional stand-alone database system, given a query, the query optimization algorithm only needs to find an execution plan with minimal execution cost for the query, so the plan must have the fastest response time.
b) In parallel database systems, the goal of query optimization is to find the query execution plan with the minimum response time, which needs to decompose the query work into some sub-work that can run in parallel. Some business databases provide the capability of parallel queries to optimize query execution.
- Criteria for query optimization parallelism
A) available resources in the system (e.g. memory, amount of data in cache, etc.)
b) Number of CPUs
c) Specific algebraic operators in the operation
such as a, B, C, d four tables to connect, each table of a single-table scan can be parallel, in the generation of four table connections in the query plan process, you can choose a and B connections at the same time C and D to connect, so that the connection operation can run in parallel (inter-operation parallelism).
- Within the same SQL, query parallelism can be divided into:
A) in-operation parallelism: The same operation, such as single-table scan operations, two table join operations, sorting operations, and other separate sub-operations, by different CPUs simultaneously executed.
b) parallelism between operations: An SQL query statement can be decomposed into multiple sub-operations, executed by multiple CPUs.
- In distributed database system, query strategy optimization is the focus of query optimization.
Mainly is the data transmission policy, (a, B two nodes of the connection, is a node transmission to the B node or from B to A or the first to filter and then transfer, etc.) and local processing optimization (traditional single-node database query optimization technology)
In the query optimization strategy, the communication cost of the data is the main factor to be considered in the optimization algorithm. Distributed query optimization to reduce the number of transfers and the amount of data as the target of query optimization. In addition to considering CPU cost and IO cost, the cost estimation model in distributed database system also considers the cost of transferring data between nodes through the network. This is the most difference between the Distributed Parallel query optimization technology and the traditional single node database system.
In a distributed database system, the cost estimation model is: total Cost = IO Cost + CPU Cost + communication cost (communication cost is usually greater than IO cost and CPU cost)
four, logical query optimization
1. Basic operation of the query
A) Select operation : the corresponding constraint (=,,, <, etc.), the action object is a row in a two-dimensional table.
Optimization mode: Select the action push-down (such as A.A=B.A and a.a=1 can be converted to a.a=1 and b.a=1, so that the selection of a, B table to the respective table).
Objective: To minimize the number of tuples before the connection operation, so that the intermediate temporary relationship is as small as possible (the number of tuples is few and the numbers of the tuple are less).
Benefits: This may reduce IO and CPU consumption, saving memory space.
b) projection operation : The destination Column object for the corresponding select query.
Optimization method: Projection operation push down
Objective: To minimize the number of columns before the connection operation, so that the intermediate temporary relationship as small as possible (note: Select the operation is to make the number of tuples "as little as possible", the projection operation is to make a tuple "as small as possible").
Advantage: This does not reduce the IO (most of the database storage is row storage, tuples are the most basic unit of reading, so to manipulate the column must read a row of data), but can reduce the concatenated relationship between the tuple size, saving memory space.
c) connection operation : The corresponding connection condition, which represents the condition of two table connections.
The connection operation involves two sub-issues:
1) The order in which each table is connected in a multi-table connection determines the efficiency: If a query statement has only one table, such a statement is simple, but if you have more than one table, it involves the order in which the tables are connected in the most efficient way.
2) Multi-table connection each table is connected in the order of user semantics defined: Query statement multi-table connection has different semantics (such as Cartesian product, internal connection, left and right connections, etc.), which determines that the table between the pre-and post-connection order is not arbitrarily replaced, otherwise, the result set data is different. Therefore, the pre-and post-connection order of a table cannot be exchanged arbitrarily.
2. use explain to get details of select
Syntax: EXPLAIN [Explain_type] <select expression>; Explain_type can be as follows: Extended | partitions | (format=traditional | json);
The explain command, which displays the query execution plan for the SQL statement. Explain returns a row of information for each table used in the SELECT statement.
EXPLAIN Extended Command: Displays a detailed query execution plan for the SQL statement, followed by the show WARNINGS command to view the details.
EXPLAIN partitions command: Displays the query execution plan with partitioned table information for the SQL statement.
Output format of the explain command:
Traditional: Traditional type, isolated by row, each identifying a self-action.
Json:json format.
3. Techniques included in logical query optimization
A) sub-query optimization b) view rewrite c) equivalence predicate rewrite D) conditional simplified e) outer join elimination f) Nesting connection elimination g) Connection elimination h) Semantic optimization i) Non-SPJ optimization
MySQL Learning (v)--Theory of database optimization (i)