Database optimization design considerations

Last Update:2018-12-07 Source: Internet

Author: User

Tags sybase

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article first discusses the basic design of database tables based on the third paradigm, and focuses on the strategies and solutions for establishing primary keys and indexes, then, the optimization solution of the database management system is outlined from the perspective of the extended design of database tables and the placement of database table objects.
Keywords: optimization (optimizing) 3nf redundant data (redundant data) index data partition object placement)
1 Introduction
The goal of database optimization is to avoid disk I/O bottlenecks, reduce CPU utilization, and reduce resource competition. For ease of reading and understanding, I have read references for large database systems such as Sybase, Informix, and Oracle. Based on years of practical engineering experience, this article discusses basic table design, extended design, and database table object placement. It focuses on how to avoid disk I/O bottlenecks and reduce resource competition. We believe that readers will be clear at a glance.
2. Basic Table Design Based on the third paradigm
In a table-driven Information Management System (MIS), the design specification of basic tables is the third paradigm (3nf ). The basic feature of the third paradigm is that the non-primary key attribute only depends on the primary key attribute. The Database Table Design Based on the third paradigm has many advantages: First, it eliminates redundant data and saves disk storage space; second, it has good data integrity restrictions, that is, the complete limitations based on the reference of the primary and Foreign keys and the entity integrity restrictions based on the primary keys make the data easy to maintain, migrate, and update. Third, the data is reversible, when performing join queries or merging tables, there are no omissions or duplicates. The fourth reason is that redundant data (redundant columns) is eliminated) when each data page has more data rows, this effectively reduces the logical I/O, each Cash has more pages, and also reduces physical I/O; fifth, for most transactions, the operation performance is good; sixth, the physical design is highly mobile and can meet the increasing user needs.
In the basic table design, the primary key, foreign key, and index design of the table plays a very important role. However, system designers only focus on meeting user requirements, instead of recognizing and attaching importance to them from the perspective of system optimization. In fact, they are closely related to the operating performance of the system. These basic concepts and their significance are discussed from the perspective of System database optimization:
(1) primary key: When a primary key is used for complex SQL statements, it is frequently used for data access. A table has only one primary key. The primary key should have a fixed value (cannot be null or the default value, must have relative stability), excluding Code Information, easy to access. It makes sense to use frequently-used (well-known) columns as primary keys. The short primary key is optimal (less than 25 bytes). The length of the primary key affects the index size. The index size affects the index page size and disk I/O. Primary keys are divided into natural and artificial primary keys. A natural primary key consists of attributes of an object. A natural primary key can be composite. When a composite primary key is formed, the primary key column cannot be too many, the composite primary key complicate the join * operation and increases the size of the foreign key table. Artificial primary keys are manually formed when there is no suitable natural attribute key, or when the natural attribute is complex or sensitive. The artificial primary key is generally an integer value (meeting the minimum requirement), which has no practical significance and slightly increases the table size. However, it reduces the size of the table that uses it as the foreign key.
(2) Foreign key (foreign key): The foreign key is used to establish the relationship between tables in a relational database (reference integrity). Primary keys can only be migrated from independent entities to non-independent entities, A Property of the latter, called a foreign key.
(3) index: using the index to optimize the system performance is obvious. You can create indexes for all the columns commonly used in the WHERE clause in the query and all the columns used for sorting, it can avoid whole table scanning or access, and directly access specific data columns without changing the physical structure of the table, thus reducing the data access time; the index can be used to optimize or exclude time-consuming Classification *. Data is distributed to different pages to distribute the inserted data. The primary key automatically creates a unique index, therefore, the unique index can also ensure the uniqueness of the data (that is, the entity integrity); the smaller the index code, the more direct the positioning; the best efficiency of the new index, so it is necessary to regularly update the index. There is also a cost for indexing: there is a space overhead, and it also takes time to create an index. During insert, delete, and update * operations, there is also a maintenance cost. There are two types of indexes: clustered index and non-clustered index. A table can have only one clustered index and multiple non-clustered indexes. Querying data using a clustered index is faster than using a non-clustered index. Before creating an index, you should use the database system function to estimate the index size.
① Tered index: the data pages of the cluster index are stored physically and orderly, with a small footprint. The selection policy is the Column Used for the WHERE clause: including the range query, fuzzy query, or highly repeated column (continuous disk scan); the Column Used for join; columns used in the order by and group by clauses. The clustered index is not conducive to insertion *, and it is not necessary to use the primary key to create the clustered index.
② Nonclustered index: Compared with the clustered index, the index occupies a large amount of space and is less efficient. The selection policy is the Column Used for the WHERE clause: including range query, fuzzy query (when the clustered index is not available), primary key or foreign key column, point (pointer class) or a small range (the returned result domain is less than 20% of the data in the whole table). It is used to join the column and primary key column for join * (range query ); columns used in the order by and group by clauses; columns to be overwritten. It is advantageous for creating multiple non-clustered indexes for read-only tables. Indexes also have their drawbacks. One is that creating indexes takes time, and the other is that indexes occupy a large amount of disk space, third, the maintenance cost is increased (the index will slow down the modification speed when the data column with the index is modified ). In which case do you not create an index? For small tables (Data smaller than 5 pages), small to medium tables (single row data is not directly accessed or the result set does not need to be sorted), single-Value domains (intensive return values) the index column value is too long (greater than 20 bitys), columns that are easy to change, highly repeated columns, and null value columns, indexes cannot be created for columns not used in the where substatement or join query. In addition, for data entry, try to create as few indexes as possible. Of course, it is also necessary to prevent the establishment of invalid indexes. When there are more than five conditions in the where statement, the cost of maintaining indexes is greater than the efficiency of indexes. In this case, it is more effective to create a temporary table to store data.
Notes for Batch Data Import: in practical applications, large volumes of computing (such as telecom Bill billing) are calculated in C language. Program This type of batch data (text files) calculated based on the primary and foreign key relationship data can be quickly imported using the system's own functions (such as Sybase's BCP command, when importing a database table, you can first Delete the index of the corresponding database table, which helps speed up the import and reduce the import time. Re-create the index after the import to optimize the query.
(4) Lock: the lock is an important mechanism for parallel processing. It can maintain the consistency of data concurrency, that is, processing by transaction. The system uses the lock to ensure data integrity. Therefore, we cannot avoid deadlocks. However, we can fully consider how to avoid long transactions, reduce the lock time, reduce interaction with users in transactions, and prevent users from controlling the length of transactions; avoid Batch Data Execution At the same time, especially time-consuming data tables. Lock requisition: A table can have only one exclusive lock at the same time. When a user is in use, other users are waiting. If the number of users increases, the performance of the server will decrease, resulting in "false positives. How can we avoid deadlocks? From page-level locks to row-level locks, the acquisition of locks is reduced; adding invalid records to small tables does not affect page-level locks to row-level locks. If competition within the same page is affected, you can select an appropriate Cluster Index to distribute data to different pages, create redundant tables, and keep transactions short. The same batch of processing should have no network interaction.
(5) query optimization rules: when accessing the database table data (access data), try to avoid sorting, join and related subqueries as much as possible. Experience tells us that we must optimize the query:
① As few rows as possible;
② Avoid sorting or sort as few rows as possible. If you want to sort a large amount of data, you 'd better put the relevant data in a temporary table * for sorting; use a simple key (column) for sorting, such as sorting integer or short string;
③ Avoid related subqueries in the table;
④ Avoid using complex expressions or non-starting substrings in the where clause and using long strings for connection;
⑤ Use "and" in the WHERE clause, and use "or" (OR) less;
6. Use a temporary database. When querying multiple tables, multiple connections, complex queries, and data filtering, you can create a temporary table (INDEX) to reduce I/O. However, the disadvantage is that it increases the space overhead.
Unless each column has an index, two dynamic indexes are found for connected queries and placed in the worksheet for re-sorting.
3. Basic table extension Design
Although the database table designed based on the third paradigm has its advantages (see the first part of this article), it is sometimes not conducive to system performance optimization in practical applications: if you want to scan the entire table when you need some data, many processes compete for the same data at the same time and repeatedly use the same row to calculate the same result. When the process obtains data from multiple tables, a large number of connections are triggered, when the data comes from multiple tables, the connection * consumes disk I/O and CPU time.
In particular, in the following situations, we need to extend the design of Basic Tables: many processes require frequent access to a table, subset data access, repeated computing, and redundant data, sometimes users require a process with priority or a low response time.
How can we avoid these adverse factors? Table Partitioning, storing redundant data, storing derivative columns, and merging related tables are all effective ways to overcome these unfavorable factors and optimize system operation.
3.1 split tables or store redundant data
A table can be divided into two types: horizontal table and vertical table. Splitting tables increases the cost of maintaining data integrity.
Horizontal Table Partitioning: when multiple processes frequently access different rows of a data table, the table is divided horizontally and redundant data columns in the new table are eliminated, you need to use the connection *, which makes it possible to split the table. A typical case is that the telecom phone bill is stored on a monthly basis. The other is that when the primary process needs to repeatedly access some rows, it is best to separate the rows that have been repeatedly Accessed Into a subset table (redundant storage ), this is very important when you do not consider disk space overhead. However, after splitting a table, maintenance becomes more difficult. You must use a trigger to immediately update the table or batch update the stored procedure or application code, this will also increase the additional disk I/O overhead.
Vertical Table Partitioning (without disrupting the third paradigm). When multiple processes frequently access different columns of a table, the table can be vertically divided into several tables, reduce disk I/O (there are fewer data columns in each row, more data rows are saved on each page, and fewer pages are occupied). When updating, you do not need to consider locking and no redundant data. The disadvantage is that data integrity should be taken into account during data insertion or deletion and maintained using the stored procedure. The other is that when some columns are accessed repeatedly in the main process, it is best to save the frequently accessed column data as a subset table (redundant storage ), this is important when disk space overhead is not taken into account. However, this increases the difficulty of maintaining overlapping columns and requires immediate update using triggers, or batch update of stored procedures or application code, this will also increase the additional disk I/O overhead. Vertical table segmentation can maximize the use of cache.
In short, the table partitioning method is suitable for: each process requires a non-join subset of the table, each process requires a subset of the table, and the main process with High Access Frequency does not need an entire table. A redundant subset table is generated when the primary and frequently accessed primary tables require a subset of the table while other primary and frequently accessed tables require an integral table.
Note: after splitting the table, you must consider re-indexing the table.
3.2 store derivative data
For a process that requires a large number of repetitive calculations, if the results of the repeated calculation process are the same (the source column data is stable, so the calculation results remain unchanged ), or computing involves multi-row data requires additional disk I/O overhead, or complicated computing requires a large amount of CPU time, consider storage of computing results (redundant storage ). The following sections describe the categories:
If repeated computation is performed in a row, column storage results are added to the table. However, if the column involved in calculation is updated, you must use a trigger to update the new column.
If the table is computed repeatedly by class, a new table is added (generally, it is enough to store the two columns of the class and result) to store the relevant results. However, if the columns involved in calculation are updated, you must use the trigger to update the new table immediately, or use the stored procedure or application code to update the table in batches.
If multiple rows are computed repeatedly (such as rankings), column storage results are added to the table. However, if the column involved in calculation is updated, you must use a trigger or stored procedure to update the new column.
In short, the storage of redundant data is conducive to accelerating access, but in violation of the third paradigm, this will increase the cost of maintaining data integrity, must be updated immediately with a trigger, or storage process or application code batch update, to maintain data integrity.
3.3 eliminate expensive combinations
For some main processes that frequently access multiple tables at the same time, we consider storing redundant data in the primary table, that is, storing redundant columns or derivative columns (which do not depend on primary keys), but undermining the third paradigm, it also increases the maintenance difficulty. When the related columns in the source table change, you must use triggers or stored procedures to update the redundant columns. When the main process accesses two tables at the same time, the tables can be merged, which reduces disk I/O * operations, but destroys the third paradigm and increases the maintenance difficulty. The methods for merging Parent and Child tables and 1:1 Relational Tables are different: redundant tables are generated after the Parent and Child tables are merged, and redundant data is generated in the table after the 1:1 Relational Tables are merged.
4. database object placement policy
The database object placement policy distributes data evenly on the system disk to balance I/O access and avoid I/O bottlenecks.
(1) access is distributed to different disks. Even if user data spans multiple devices as much as possible, multiple I/O operations avoid I/O competition and overcome access bottlenecks; place random access and continuous access data respectively.
(2) separate system database I/O and application database I/O. Place the system audit table and temporary database table on a hard disk.
(3) Place transaction logs on a separate disk to reduce disk I/O overhead, which facilitates recovery after obstacles and improves system security.
(4) place frequently accessed "active" tables on different disks; place frequently used tables and frequently joined tables on separate disks, even place fields of frequently accessed tables on different disks to distribute access to different disks to avoid I/O contention;
Segment separates frequently accessed tables and their indexes (non-clustered), and separates text and image data. Segment aims to balance I/O, avoid bottlenecks, increase throughput, achieve parallel scanning, increase concurrency, and maximize Disk Throughput. Use the logical segment function to place active tables and their non-clustered indexes to balance I/O. Of course, it is best to use the default segment of the system. In addition, the segment can make the backup and recovery data more flexible, and make the system authorization more flexible.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More