Database optimization design considerations

Last Update:2013-12-15 Source: Internet

Author: User

Tags sybase

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article first discusses the basic design of database tables based on the third paradigm, and focuses on the strategies and solutions for establishing primary keys and indexes, then, the optimization solution of the database management system is outlined from the perspective of the extended design of database tables and the placement of database table objects.

1 Introduction

The goal of database optimization is to avoid disk I/O bottlenecks, reduce CPU utilization, and reduce resource competition. For ease of reading and understanding, I have read references for large database systems such as Sybase, Informix, and Oracle. Based on years of practical engineering experience, this article discusses basic table design, extended design, and database table object placement. It focuses on how to avoid disk I/O bottlenecks and reduce resource competition. We believe that readers will be clear at a glance.

2. Basic Table Design Based on the third paradigm

In the Table-driven Information Management System (MIS), the design specification of basic tables is the third paradigm 3NF ). The basic feature of the third paradigm is that the non-primary key attribute only depends on the primary key attribute. Database Table Design Based on the third paradigm has many advantages:

This eliminates redundant data and saves disk storage space;
There are good restrictions on data integrity, that is, the complete restrictions on reference based on the primary and Foreign keys and the entity integrity restrictions based on the primary keys, which makes data easy to maintain and be migrated and updated;
Data is reversible, and Join is performed.) When querying or merging tables, no omission or repetition is required;
Because redundant data redundancy columns are eliminated), when querying the Select statement, each data page stores more data rows, which effectively reduces the logical I/O, each Cash has more pages and reduces physical I/O;
For most transactions (transactions), the operation performance is good;
Physical Design is highly mobile and can meet increasing user needs.

In the basic table design, the primary key, foreign key, and index design of the table plays a very important role. However, system designers only focus on meeting user requirements, instead of recognizing and attaching importance to them from the perspective of system optimization. In fact, they are closely related to the operating performance of the system. These basic concepts and their significance are discussed from the perspective of System database optimization:

Primary Key ):When primary keys are used in complex SQL statements, they are frequently used for data access. A table has only one primary key. The primary key should have a fixed value, which cannot be Null or the default value and must be relatively stable. It does not contain code information and is easy to access. It makes sense to use columns that are commonly known as) as primary keys. The maximum size of a short primary key is less than 25 bytes. The length of the primary key affects the index size, which affects disk I/O. Primary keys are divided into natural and artificial primary keys. A natural primary key consists of attributes of an object. A natural primary key can be composite. When a composite primary key is formed, the primary key column cannot be too many, the composite primary key complicate the Join * operation and increases the size of the foreign key table. Artificial primary keys are manually formed when there is no suitable natural attribute key, or when the natural attribute is complex or sensitive. Artificial primary key is generally an integer value that meets the minimum requirement). It has no practical significance and slightly increases the table size. However, it reduces the size of the table that uses it as the foreign key.

Foreign Key ):The foreign key is used to establish the integrity of the relationship between tables in the relational database. The primary key can only be migrated from an independent entity to a non-independent entity. It becomes an attribute of the latter and is called a foreign key.

Index ):Using indexes to optimize system performance is obvious. Creating indexes for all columns commonly used in the Where clause in queries and all columns used for sorting can avoid full table scanning or access, without changing the physical structure of a table, you can directly access specific data columns to reduce the data access time. You can use indexes to optimize or exclude time-consuming Classes; when data is distributed to different pages, the inserted data is dispersed. The primary key automatically creates a unique index, so the unique index can also ensure the uniqueness of the data, that is, the entity integrity ); the smaller the index code, the more direct the positioning; the best efficiency of the new index, so it is necessary to regularly update the index. There is also a cost for indexing: there is a space overhead, and it also takes time to create an index. During Insert, Delete, and Update * operations, there is also a maintenance cost. There are two types of indexes: clustered index and non-clustered index. A table can have only one clustered index and multiple non-clustered indexes. Querying data using a clustered index is faster than using a non-clustered index. Before creating an index, you should use the database system function to estimate the index size.

Cluster Index Clustered Index ):Data pages of a cluster index are stored in a physical order, occupying a small amount of space. The selection policy is used for columns in the Where clause, including range queries, fuzzy queries, or highly repeated column continuous disk scans), and columns used for Join; columns used in the Order by and Group by clauses. The clustered index is not conducive to insertion *, and it is not necessary to use the primary key to create the clustered index.
Non-clustered Index Nonclustered Index ):Compared with clustered indexes, it occupies a large amount of space and is less efficient. The selection policy is the Column Used for the Where clause, including range query, fuzzy query without a clustered index), primary key or foreign key column, and point pointer class) or a small range of returned result fields less than 20% of the data in the whole table); used for Join * column and primary key column range query); used for Order by and Group by clause columns; columns to be overwritten. It is advantageous for creating multiple non-clustered indexes for read-only tables.

Indexes also have their drawbacks. One is that creating indexes takes time, and the other is that indexes occupy a large amount of disk space, third, it increases the maintenance cost. When you modify a data column with an index, the index will slow down the modification speed ).

In which case do you not create an index? For small tables with less than 5 pages), small to medium tables do not directly access Single Row data or result sets do not need to be sorted), single-value returned values are intensive), index column values are too large to 20 bitys) columns that are easy to change, highly repeated columns, and Null value Columns cannot be indexed for columns not used in the Where substatement or Join query. In addition, for data entry, try to create as few indexes as possible. Of course, it is also necessary to prevent the establishment of invalid indexes. When there are more than five conditions in the Where statement, the cost of maintaining indexes is greater than the efficiency of indexes. In this case, it is more effective to create a temporary table to store data.

Notes for Batch Data Import: in practical applications, large volumes of computing, such as telecom Bill billing, are implemented using C language programs, this type of batch data text files calculated based on the primary-foreign key relationship data) can be quickly imported in batches using the system's own function functions, such as Sybase's BCP command. When importing database tables, you can delete the index of the corresponding database table first, which helps speed up the import and reduce the import time. Re-create the index after the import to optimize the query.

Lock:The lock is an important mechanism for parallel processing. It can maintain the consistency of data concurrency, that is, processing by transaction. The system uses the lock to ensure data integrity. Therefore, we cannot avoid deadlocks. However, we can fully consider how to avoid long transactions, reduce the lock time, reduce interaction with users in transactions, and prevent users from controlling the length of transactions; avoid Batch Data Execution At the same time, especially time-consuming data tables. Lock requisition: A table can have only one exclusive lock at the same time. When a user is in use, other users are waiting. If the number of users increases, the performance of the Server will decrease, resulting in "false positives. How can we avoid deadlocks? From page-level locks to row-level locks, the acquisition of locks is reduced; adding invalid records to small tables does not affect page-level locks to row-level locks. If competition within the same page is affected, you can select an appropriate Cluster Index to distribute data to different pages, create redundant tables, and keep transactions short. The same batch of processing should have no network interaction.

Query Optimization rules:When accessing Data in a database table, try to avoid sorting Sort, Join, and related subqueries. Experience tells us that we must optimize the query:

As few rows as possible;
Avoid sorting or sort as few rows as possible. If you want to sort a large amount of data, you 'd better put the relevant data in a temporary table * for sorting; use a simple key column, such as sorting integer or short string;
Avoid related subqueries in the table;
Avoid using complex expressions or non-starting substrings in the Where clause and using long strings for connection;
In the Where clause, use "And" to connect more, And use "Or" (Or) less;
Use a temporary database. You can create a temporary table index when querying multiple tables, multiple connections, complex queries, and data to be filtered) to reduce I/O. However, the disadvantage is that it increases the space overhead.

Unless each column has an index, two dynamic indexes are found for connected queries and placed in the worksheet for re-sorting.

3. Basic table extension Design

Although the database table designed based on the third paradigm has its advantages, see the first part of this article). However, in practical applications, it is sometimes not conducive to system performance optimization: If you need some data, you need to scan the entire table, many processes compete for the same data at the same time, and repeatedly use the same row to calculate the same result. When the process acquires data from multiple tables, a large number of connections are triggered, when the data comes from multiple tables, the connection * consumes disk I/O and CPU time.
In particular, in the following situations, we need to extend the design of Basic Tables: many processes require frequent access to a table, subset data access, repeated computing, and redundant data, sometimes users require a process with priority or a low response time.
How can we avoid these adverse factors? Table Partitioning, storing redundant data, storing derivative columns, and merging related tables are all effective ways to overcome these unfavorable factors and optimize system operation.

3.1 split tables or store redundant data
A table can be divided into two types: horizontal table and vertical table. Splitting tables increases the cost of maintaining data integrity.
Horizontal Table Partitioning: when multiple processes frequently access different rows of a data table, the table is divided horizontally and redundant data columns in the new table are eliminated, you need to use the connection *, which makes it possible to split the table. A typical case is that the telecom phone bill is stored on a monthly basis. The other is that when the primary process needs to repeatedly access some rows, it is best to separate the rows that have been repeatedly Accessed Into a subset table for redundant storage ), this is very important when you do not consider disk space overhead. However, after splitting a table, maintenance becomes more difficult. You must use a trigger to immediately update the table or batch update the stored procedure or application code, this will also increase the additional disk I/O overhead.
Vertical Split tables do not break the third paradigm. One is that when multiple processes frequently access different columns of a table, the table can be vertically divided into several tables, reduce the number of data columns in each row of disk I/O, store more data rows on each page, and occupy fewer pages). When updating, you do not need to consider locking and no redundant data. The disadvantage is that data integrity should be taken into account during data insertion or deletion and maintained using the stored procedure. The other is that when some columns are accessed repeatedly in the main process, it is best to save the frequently accessed column data separately as a subset table for redundant storage ), this is important when disk space overhead is not taken into account. However, this increases the difficulty of maintaining overlapping columns and requires immediate update using triggers, or batch update of stored procedures or application code, this will also increase the additional disk I/O overhead. Vertical table segmentation can maximize the use of Cache.
In short, the table partitioning method is suitable for: each process requires a non-join subset of the table, each process requires a subset of the table, and the main process with High Access Frequency does not need an entire table. A redundant subset table is generated when the primary and frequently accessed primary tables require a subset of the table while other primary and frequently accessed tables require an integral table.
Note: after splitting the table, you must consider re-indexing the table.

3.2 store derivative data
For some processes that require a large number of repetitive calculations, if the results obtained by the repeated calculation process are consistent with the data in the same-source column, the calculation results remain unchanged ), or computing involves multi-row data requires additional disk I/O overhead, or complicated computing requires a large amount of CPU time, consider storing computing results for redundant storage ). The following sections describe the categories:
If repeated computation is performed in a row, column storage results are added to the table. However, if the column involved in calculation is updated, you must use a trigger to update the new column.
If the table is computed repeatedly by class, you can add a new table. Generally, you can store two columns: class and result.) store related results. However, if the columns involved in calculation are updated, you must use the trigger to update the new table immediately, or use the stored procedure or application code to update the table in batches.
If repeated calculation is performed on multiple rows, such as the ranking), column storage results are added to the table. However, if the column involved in calculation is updated, you must use a trigger or stored procedure to update the new column.

In short, the storage of redundant data is conducive to accelerating access, but in violation of the third paradigm, this will increase the cost of maintaining data integrity, must be updated immediately with a trigger, or storage process or application code batch update, to maintain data integrity.

3.3 eliminate expensive combinations
For some main processes that frequently access multiple tables at the same time, we consider storing redundant data in the primary table, that is, storing redundant columns or derivative columns which do not depend on the primary key), but destroys the third paradigm, it also increases the maintenance difficulty. When the related columns in the source table change, you must use triggers or stored procedures to update the redundant columns. When the main process accesses two tables at the same time, the tables can be merged, which reduces disk I/O * operations, but destroys the third paradigm and increases the maintenance difficulty. The methods for merging Parent and Child tables and 1:1 Relational Tables are different: redundant tables are generated after the Parent and Child tables are merged, and redundant data is generated in the table after the 1:1 Relational Tables are merged.

4. database object placement policy

The database object placement policy distributes data evenly on the system disk to balance I/O access and avoid I/O bottlenecks.

Access is distributed to different disks. Even if user data spans multiple devices as much as possible, multiple I/O operations avoid I/O competition and overcome access bottlenecks; place random access and continuous access data respectively.
Separate system database I/O and application database I/O. Place the system audit table and temporary database table on a hard disk.
Placing transaction logs on a separate disk reduces disk I/O overhead, which facilitates recovery after obstacles and improves system security.
Place frequently accessed "active" tables on different disks, and place frequently used tables and frequently joined tables on separate disks, even place fields of frequently accessed tables on different disks to distribute access to different disks to avoid I/O contention;
Use segments to separate frequently accessed tables and their indexed non-aggregated families), and separate text and image data. Segment aims to balance I/O, avoid bottlenecks, increase throughput, achieve parallel scanning, increase concurrency, and maximize Disk Throughput. Use the logical segment function to place active tables and their non-clustered indexes to balance I/O. Of course, it is best to use the default segment of the system. In addition, the segment can make the backup and recovery data more flexible, and make the system authorization more flexible.

The above is the optimization solution provided by the author based on his own experience in database management. I have not provided specific software for use. You can try it based on your own situation.

Analysis on Database optimization with large data volume and high concurrency
How to optimize the database
Summary of RDS Performance Optimization
9 specific solutions for optimizing large Oracle databases
Summary of practical experience in Oracle10g Database Optimization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More