Database optimization Design Scheme

Last Update:2015-06-15 Source: Internet
Author: User
Tags sybase
Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
This paper first discusses the basic design of database table based on the third paradigm, emphatically discusses the strategy and scheme of establishing primary key and index, and then summarizes the optimization scheme of database management system from the view of the extension design of database table and the placement of library Table object.
Keywords: optimization (optimizing) Third normal form (3NF) redundant data (redundant data) index (index) Data partitioning object placement (object Placement)
1 Introduction
The goal of database optimization is to avoid disk I/O bottlenecks, reduce CPU utilization, and reduce resource contention. To facilitate readers ' reading and understanding, the authors refer to large database systems such as Sybase, Informix and Oracle, based on years of engineering experience, from basic table design, extended design, and database table object placement, to discuss how to avoid disk i/ o Bottlenecks and reduce resource competition, I believe readers will be at a glance.
2 basic table design based on the third paradigm
In the table-driven information management system (MIS), the design specification of the basic table is the third paradigm (3NF). The basic feature of the third paradigm is that the non-primary key attribute depends only on the primary key attribute. The design of database table based on the third paradigm has many advantages: one is to eliminate redundant data, save disk storage space, second, there is good data integrity limit, that is, based on the primary foreign key referential complete limit and the entity integrity limit based on primary key, which makes the data easy to maintain, also easy to transplant and update; third, the reversibility of the data In connection (join) query or merge table is not omitted, and does not repeat; Four is because redundant data (redundant columns), in the query (Select), the number of data rows per data page, which effectively reduce the logical I/O, each cash save more pages, but also reduce physical I/O ; Five is good for most transactions (Transaction), and six is the physical design (physical designed) with greater maneuverability to meet the growing demands of users.
In the basic table design, the table's primary key, foreign key, index design occupies very important position, but the system designers often only focus on satisfies the user request, but does not have from the system optimization height to recognize and value them. In fact, they are closely related to the operational performance of the system. Now discuss these basic concepts and their significance from the perspective of System Database optimization:
(1) Primary key (Primary key): When a primary key is used in a complex SQL statement, it is frequently used in data access. A table has only one primary key. The primary key should have a fixed value (cannot be null or default value, want to have relative stability), without code information, easy access. It makes sense to use common (well-known) columns as primary keys. The short primary key is best (less than 25bytes), the length of the primary key affects the size of the index, and the size of the index affects the size of the index page, which affects disk I/O. Primary keys are divided into natural primary keys and people as primary keys. The natural primary key is composed of the attribute of the entity, the natural primary key can be compound, the primary key column cannot be too many when the compound primary key is formed, and the compound primary key makes the join* complex, and also increases the size of the foreign key table. Man is the main key, when there is no suitable natural attribute key, or natural properties complex or high sensitivity, artificial formation. A person's primary key is generally an integer value (which satisfies the minimum requirement), has no practical meaning, and slightly increases the size of the table, but reduces the size of the table that takes it as a foreign key.
(2) foreign key (Foreign key): The role of the foreign key is to establish a relational database of the relationship between tables (referential integrity), the primary key can only be migrated from a separate entity to a non-independent entity, the latter is a property, known as a foreign key.
(3) Index: Optimizing system performance with an index is obvious, and all columns used in the WHERE clause in the query and all columns for sorting are indexed, avoiding whole table scans or accesses, and directly accessing specific data columns without altering the physical structure of the table. This reduces data access time by using an index to optimize or exclude time-consuming classification *, to distribute data across different pages, to scatter inserted data, and to automatically establish a unique index, so that a unique index ensures data uniqueness (that is, entity integrity); The smaller the index code, the more direct the positioning New indexes are the best, so it is necessary to update the indexes regularly. The index also has a price: there is space overhead, and it takes time to build it, and there is a maintenance cost when it comes to insert, delete, and update*. There are two types of indexes: Clustered index and non-clustered index. A table can have only one clustered index, and there can be multiple non-clustered indexes. Querying data using a clustered index is faster than using a nonclustered index. The database system functions should be used to estimate the size of the index before the index is under construction.
① Clustered index (Clustered index): The data page of the clustered index is stored in a physical order and occupies little space. The selection policy is the column that is used for the WHERE clause: includes a range query, a fuzzy query, or a highly duplicated column (continuous disk scan), a column used to connect join*, and a column used for the order BY and GROUP BY clauses. Clustering index is not conducive to inserting *, there is no need to use primary key to build family index.
② non-clustered index (nonclustered index): occupies a large space and is inefficient compared to a clustered index. The selection policy is the column that is used for the WHERE clause: includes a range query, a fuzzy query (when there is no clustered index), a primary key or foreign key column, a point (pointer Class), or a small range (20% of the returned result field is less than the entire table data); Used to connect join* columns, primary key columns (range query) ; The column to be used for the order BY and GROUP BY clauses; It is advantageous to build multiple non-clustered indexes on read-only tables. The index also has its drawbacks, one is the time to create the index, the second is the index to occupy a large amount of disk space, and three is to increase the maintenance cost (when modifying indexed data columns, the index slows the modification speed). So, in which case does the index not be built? For small tables (data less than 5 pages), Small to medium tables (without direct access to single-row data or result sets without sorting), single-domain (return-value-dense), indexed column values too long (greater than 20bitys), easily variable columns, highly duplicated columns, null value columns, Columns that are not used for where and join queries cannot be indexed. In addition, the main use for data entry, as little as possible to build indexes. Of course, it is also necessary to prevent invalid indexes, and when more than 5 conditions are in a where statement, the cost of maintaining the index is greater than the efficiency of the index, and it is more efficient to establish a temporary table to store the data.
Considerations when importing data in bulk: In real-world applications, large quantities of computing (such as telecom Bill Billing) are done in C language programs, which are based on bulk data (text files) calculated from the primary foreign key relational data, and can be quickly and bulk imported using the system's own function functions (such as Sybase's BCP command). When you import a database table, you can first delete the index of the corresponding library table, which facilitates faster import and less import time. Rebuild the index after importing to refine the query.
(4) Lock: Lock is an important mechanism of parallel processing, can maintain the consistency of data concurrency, that is, processing by transactions, the system uses locks to ensure data integrity. Therefore, we can not avoid deadlocks, but in the design of the full consideration of how to avoid long transactions, reduce the lock time, reduce the interaction with the user in the transaction, to prevent the user to control the length of the transaction, to avoid the simultaneous execution of bulk data, especially time-consuming and the same data table. Lock requisition: A table can only have one row lock at a time, while a user is waiting for another user. If the number of users increased, then the performance of the server decreased, there is "suspended animation" phenomenon. How to avoid deadlocks? From page-level lock to row-level lock, reduce the lock requisition, add invalid records to the small table, from page-level lock to row-level lock has no effect, if the competition has an impact on the same page, you can choose the appropriate cluster index to assign data to different pages; create redundant tables; keep transactions short; the same batch should have no network interaction
(5) Query optimization rule: When accessing data from a database table, avoid sorting (sort), join, and related subquery * as much as possible. Experience tells us that when optimizing queries, you must:
① as few rows as possible;
② avoid sorting or sorting as few rows as possible, to do a lot of data sorting, it is best to put the relevant data in a temporary table *; Sort by simple key (column), such as Integer or short string;
③ Avoid related sub-queries in the table;
④ avoid using a complex expression or a non-starter substring in the WHERE clause, connecting with a long string;
⑤ use "with" (and) more in the WHERE clause, using less "or" (or) connection;
⑥ uses the staging database. When querying multiple tables, multiple connections, complex queries, and data filtering, temporal tables (indexes) can be built to reduce I/O. But the disadvantage is that it increases the space overhead.
Unless there is index support for each column, find two dynamic indexes on a connected query and reorder them on the worksheet.
3 Basic Table Extension design
Although the library table based on the third paradigm design has its advantages (see the first part of this article), it is sometimes not conducive to the optimization of the performance of the system in practical application: If you need some data to scan the whole table at the same time, many processes simultaneously compete for the same data, and repeat the same result with the same peers, which causes a large number * Make, when the data comes from multiple tables when the connection is made; This consumes disk I/O and CPU time.
In particular, we want to extend the design of the base table when we encounter the following situations: Many processes require frequent access to a table, subset of data access, repeated calculations, and redundant data, sometimes requiring some process-first or low response times.
How can we avoid these disadvantages? It is an effective way to overcome these unfavorable factors and optimize the operation of the system according to the frequent degree of access to divide the related tables, store redundant data, store derivative columns, and merge related table processing.
3.1 Splitting tables or storing redundant data
The split table is divided into two types: horizontal split table and vertical split table. Partitioned tables increase the cost of maintaining data integrity.
Horizontal split table: One is when multiple processes frequently access the different rows of the data table, horizontally split the table, and eliminate redundant data columns in the new table, if the individual process to access the entire data, then use the connection *, which is also the case of the partition table; The other is that when the main process is repeatedly accessing portions of a row, it is best to separate the rows that are repeatedly accessed into subsets of tables (redundant storage), which is important regardless of disk space overhead, but after partitioning the table, the maintenance is more difficult to update immediately with a trigger, or a stored procedure or application code batch update. This also increases the additional disk I/O overhead.
Split table vertically (without breaking the third paradigm), one is that when multiple processes frequently access different columns of a table, the table can be divided vertically into several tables, reducing disk I/O (the number of columns per row, data rows per page, and fewer pages), update without regard to locks, no redundant data. The disadvantage is to consider the integrity of data when inserting or deleting data, and to maintain it with stored procedures. The other is that when the main process repeatedly accesses part of the column, it is best to save this part of the frequently accessed column data as a subset table (redundant storage), which is very important regardless of disk space overhead, but this increases the difficulty of maintenance of overlapping columns, to be updated with triggers immediately, or stored procedures or application code batch update, This also increases the additional disk I/O overhead. The vertical partition table can be used to maximize the use of the cache.
In short, the method for dividing tables for the main process is applicable to: each process requires a subset of tables, each process requires a subset of tables, the main process of high frequency access does not need the whole table. A redundant subset of tables is generated when the primary, frequently accessed primary table requires a subset of tables and the other mostly frequently accessed processes require an entire table.
Note that after splitting the table, consider re-establishing the index.
3.2 Storing derivative data
For some of the process of doing a lot of repetitive calculations, if the repeated calculation process results in the same (the source column data is stable, so the calculation results are not changed), or to calculate the additional disk I/O cost involved in multiple rows of data, or the complexity of the computation requires a lot of CPU time, consider storage calculation results (redundant storage). It is now classified as follows:
If you repeat the calculation in a row, the column storage results are increased within the table. However, if the column that participates in the calculation is updated, you must update the new column with a trigger.
If you repeat a table by class, add a new table (in general, the store class and the result are two columns) to store the related results. However, if the column that participates in the calculation is updated, you must update the new table in batches with the trigger immediate update, or the stored procedure or application code.
If you are repeating calculations for multiple rows (such as ranking), add Columnstore results to the table. However, if the column that participates in the calculation is updated, you must update the new column with a trigger or stored procedure.
In summary, storing redundant data facilitates faster access, but violates the third paradigm, which increases the cost of maintaining data integrity and must be updated immediately with triggers, or stored procedures or application code bulk updates to maintain data integrity.
3.3 Elimination of expensive binding
For the main process of frequent simultaneous access to multiple tables, consider storing redundant data in the primary table, that is, storing redundant columns or derived columns (it does not depend on the primary key), but destroys the third paradigm and increases maintenance difficulty. This redundant column must be updated with a trigger or stored procedure when the related column of the source table changes. Tables can be combined when the main process always accesses two tables at the same time, which reduces disk i/o*, but destroys the third paradigm and makes maintenance more difficult. There is a difference between a parent-child table and a 1:1-relational-Table merge method: When a parent-child table is merged, a redundant table is created, and when the 1:1 relationship table is merged, redundant data is generated.
4 Placement policy for database objects
The placement strategy for database objects is to evenly distribute the data across the system's disks, balancing I/O access and avoiding I/O bottlenecks.
⑴ access is decentralized to different disks, even if user data spans multiple devices as much as possible, avoiding I/O competition, overcoming access bottlenecks, and placing random access and continuous access data, respectively.
⑵ Separate system database I/O and application database I/O. Put the system Audit table and the Temporary library table on a disk that is not busy.
⑶ puts the transaction log on a separate disk, reducing disk I/O overhead, which also facilitates recovery after an obstacle and improves the security of the system.
⑷ the frequently accessed "active" table on a different disk, put the frequently used tables, frequent join* tables on separate disks, and even put the frequently accessed table fields on different disks, spread the access to different disks, to avoid I/O contention;
⑸ uses segments to separate frequently accessed tables and their indexes (non-clustered), detached text, and image data. The purpose of the segment is to balance I/O, avoid bottlenecks, increase throughput, enable parallel scanning, increase concurrency, and maximize disk throughput. Use the logic segment function to place the active table and its non-clustered indexes to balance I/O, respectively. It is of course best to take advantage of the system default segment. In addition, the use of segments can make backup and recovery data more flexible, making system licensing more flexible.
Database optimization Design Scheme
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free
Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More