Database optimization collection

Source: Internet
Author: User
Source: Http://zhidao.baidu.com/question/1236568.html

This article first discusses the basic design of database tables based on the third paradigm, and focuses on the strategies and solutions for establishing primary keys and indexes, then, the optimization solution of the database management system is outlined from the perspective of the extended design of database tables and the placement of database table objects.
Keywords: optimization (optimizing) 3nf redundant data (redundant data) index data partition object placement)
1 Introduction
The goal of database optimization is to avoid disk I/O bottlenecks, reduce CPU utilization, and reduce resource competition. For ease of reading and understanding, I have read references for large database systems such as Sybase, Informix, and Oracle. Based on years of practical engineering experience, this article discusses basic table design, extended design, and database table object placement. It focuses on how to avoid disk I/O bottlenecks and reduce resource competition. We believe that readers will be clear at a glance.
2 Basic Table Design Based on the third paradigm
In a table-driven Information Management System (MIS), the design specification of basic tables is the third paradigm (3nf ). The basic feature of the third paradigm is that the non-primary key attribute only depends on the primary key attribute. Database Table Design Based on the third paradigm has many advantages:
(1) eliminating redundant data and saving disk storage space;
(2) There are good restrictions on data integrity, that is, the complete restrictions on reference based on the primary foreign key and the entity integrity restrictions based on the primary key, which makes the data easy to maintain, and easy to transplant and update;
(3) data is reversible, and there are no omissions or duplicates when performing join queries or merging tables;
(4) Because redundant data (redundant columns) is eliminated, multiple data rows exist on each data page during query (select, in this way, logical I/O is effectively reduced, and each Cash stores more pages and physical I/O is also reduced;
(5) For most transactions, the operation performance is good;
(6) physical design is highly mobile and can meet the increasing user needs.
In the basic table design, the primary key, foreign key, and index design of the table plays a very important role. However, system designers only focus on meeting user requirements, instead of recognizing and attaching importance to them from the perspective of system optimization. In fact, they are closely related to the operating performance of the system. These basic concepts and their significance are discussed from the perspective of System database optimization:
(1) primary key: When a primary key is used for complex SQL statements, it is frequently used for data access. A table has only one primary key. The primary key should have a fixed value (cannot be null or the default value, must have relative stability), excluding Code Information, easy to access. It makes sense to use frequently-used (well-known) columns as primary keys. The short primary key is optimal (less than 25 bytes). The length of the primary key affects the index size. The index size affects the index page size and disk I/O. Primary keys are divided into natural and artificial primary keys. A natural primary key is composed of attributes of an object. A natural primary key can be compound, when a composite primary key is formed, there cannot be too many primary key columns, the composite primary key complicate the join * operation and increases the size of the foreign key table. Artificial primary keys are manually formed when there is no suitable natural attribute key, or when the natural attribute is complex or sensitive. The artificial primary key is generally an integer value (meeting the minimum requirement), which has no practical significance and slightly increases the table size. However, it reduces the size of the table that uses it as the foreign key.
(2) Foreign key (foreign key): The foreign key is used to establish the relationship between tables in a relational database (reference integrity). Primary keys can only be migrated from independent entities to non-independent entities, A Property of the latter, called a foreign key.
(3) index: using the index to optimize the system performance is obvious. You can create indexes for all the columns commonly used in the WHERE clause in the query and all the columns used for sorting, it can avoid whole table scanning or access, and directly access specific data columns without changing the physical structure of the table, thus reducing the data access time; the index can be used to optimize or exclude time-consuming Classification *. Data is distributed to different pages to distribute the inserted data. The primary key automatically creates a unique index, therefore, the unique index can also ensure the uniqueness of the data (that is, the entity integrity); the smaller the index code, the more direct the positioning; the best efficiency of the new index, so it is necessary to regularly update the index. There is also a cost for indexing: there is a space overhead, and it also takes time to create an index. During insert, delete, and update * operations, there is also a maintenance cost. There are two types of indexes: clustered index and non-clustered index. A table can have only one clustered index and multiple non-clustered indexes. Querying data using a clustered index is faster than using a non-clustered index. Before creating an index, you should use the database system function to estimate the index size.
① Tered index: the data pages of the cluster index are stored physically and orderly, with a small footprint. The selection policy is the Column Used for the WHERE clause: including the range query, fuzzy query, or highly repeated column (continuous disk scan); the Column Used for join; columns used in the order by and group by clauses. The clustered index is not conducive to insertion *, and it is not necessary to use the primary key to create the clustered index.
② Nonclustered index: Compared with the clustered index, the index occupies a large amount of space and is less efficient. The selection policy is the Column Used for the WHERE clause: including range query, fuzzy query (when the clustered index is not available), primary key or foreign key column, point (pointer class) or a small range (the returned result domain is less than 20% of the data in the whole table). It is used to join the column and primary key column for join * (range query ); columns used in the order by and group by clauses; columns to be overwritten. It is advantageous for creating multiple non-clustered indexes for read-only tables. Indexes also have their drawbacks. One is that creating indexes takes time, and the other is that indexes occupy a large amount of disk space, third, the maintenance cost is increased (the index will slow down the modification speed when the data column with the index is modified ). In which case do you not create an index? For small tables (Data smaller than 5 pages), small to medium tables (single row data is not directly accessed or the result set does not need to be sorted), single-Value domains (intensive return values) the index column value is too long (greater than 20 bitys), columns that are easy to change, highly repeated columns, and null value columns, indexes cannot be created for columns not used in the where substatement or join query. In addition, for data entry, try to create as few indexes as possible. Of course, it is also necessary to prevent the establishment of invalid indexes. When there are more than five conditions in the where statement, the cost of maintaining indexes is greater than the efficiency of indexes. In this case, it is more effective to create a temporary table to store data.
Notes for Batch Data Import: in practical applications, large volumes of computing (such as telecom Bill billing) are in C Language Program This type of batch data (text files) calculated based on the primary and foreign key relationship data can be quickly imported using the system's own functions (such as Sybase's BCP command, when importing a database table, you can first Delete the index of the corresponding database table, which helps speed up the import and reduce the import time. Re-create the index after the import to optimize the query.
(4) Lock: the lock is an important mechanism for parallel processing. It can maintain the consistency of data concurrency, that is, processing by transaction. The system uses the lock to ensure data integrity. Therefore, we cannot avoid deadlocks. However, we can fully consider how to avoid long transactions, reduce the lock time, reduce interaction with users in transactions, and prevent users from controlling the length of transactions; avoid Batch Data Execution At the same time, especially time-consuming data tables. Lock requisition: A table can have only one exclusive lock at the same time. When a user is in use, other users are waiting. If the number of users increases, the performance of the server will decrease, resulting in "false positives. How can we avoid deadlocks? From page-level locks to row-level locks, the acquisition of locks is reduced; adding invalid records to small tables does not affect page-level locks to row-level locks. If competition within the same page is affected, you can select an appropriate Cluster Index to distribute data to different pages, create redundant tables, and keep transactions short. The same batch of processing should have no network interaction.
(5) query optimization rules: when accessing the database table data (access data), try to avoid sorting, join and related subqueries as much as possible. Experience tells us that we must optimize the query:
① As few rows as possible;
② Avoid sorting or sort as few rows as possible. If you want to sort a large amount of data, you 'd better put the relevant data in a temporary table * for sorting; use a simple key (column) for sorting, such as sorting integer or short string;
③ Avoid related subqueries in the table;
④ Avoid using complex expressions or non-starting substrings in the where clause and using long strings for connection;
⑤ Use "and" in the WHERE clause, and use "or" (OR) less;
6. Use a temporary database. When querying multiple tables, multiple connections, complex queries, and data filtering, you can create a temporary table (INDEX) to reduce I/O. However, the disadvantage is that it increases the space overhead.
Unless each column has an index, two dynamic indexes are found for connected queries and placed in the worksheet for re-sorting.
3, Basic table extension Design
Although the database table designed based on the third paradigm has its advantages (see the first part of this article), it is sometimes not conducive to system performance optimization in practical applications: if you want to scan the entire table when you need some data, many processes compete for the same data at the same time and repeatedly use the same row to calculate the same result. When the process obtains data from multiple tables, a large number of connections are triggered, when the data comes from multiple tables, the connection * consumes disk I/O and CPU time.
In particular, in the following situations, we need to extend the design of Basic Tables: many processes require frequent access to a table, subset data access, repeated computing, and redundant data, sometimes users require a process with priority or a low response time.
How can we avoid these adverse factors? Table Partitioning, storing redundant data, storing derivative columns, and merging related tables are all effective ways to overcome these unfavorable factors and optimize system operation.
3.1 split tables or store redundant data
A table can be divided into two types: horizontal table and vertical table. Splitting tables increases the cost of maintaining data integrity.
Horizontal Table Partitioning: when multiple processes frequently access different rows of a data table, the table is divided horizontally and redundant data columns in the new table are eliminated, you need to use the connection *, which makes it possible to split the table. A typical case is that the telecom phone bill is stored on a monthly basis. The other is that when the primary process needs to repeatedly access some rows, it is best to separate the rows that have been repeatedly Accessed Into a subset table (redundant storage ), this is very important when you do not consider disk space overhead. However, after splitting a table, maintenance becomes more difficult. You must use a trigger to immediately update the table or batch update the stored procedure or application code, this will also increase the additional disk I/O overhead.
Vertical Table Partitioning (without disrupting the third paradigm). When multiple processes frequently access different columns of a table, the table can be vertically divided into several tables, reduce disk I/O (there are fewer data columns in each row, more data rows are saved on each page, and fewer pages are occupied). When updating, you do not need to consider locking and no redundant data. The disadvantage is that data integrity should be taken into account during data insertion or deletion and maintained using the stored procedure. The other is that when some columns are accessed repeatedly in the main process, it is best to save the frequently accessed column data as a subset table (redundant storage ), this is important when disk space overhead is not taken into account. However, this increases the difficulty of maintaining overlapping columns and requires immediate update using triggers, or batch update of stored procedures or application code, this will also increase the additional disk I/O overhead. Vertical table segmentation can maximize the use of cache.
In short, the table partitioning method is suitable for: each process requires a non-join subset of the table, each process requires a subset of the table, and the main process with High Access Frequency does not need an entire table. A redundant subset table is generated when the primary and frequently accessed primary tables require a subset of the table while other primary and frequently accessed tables require an integral table.
Note: after splitting the table, you must consider re-indexing the table.
3.2 store derivative data
For a process that requires a large number of repetitive calculations, if the results of the repeated calculation process are the same (the source column data is stable, so the calculation results remain unchanged ), or computing involves multi-row data requires additional disk I/O overhead, or complicated computing requires a large amount of CPU time, consider storage of computing results (redundant storage ). The following sections describe the categories:
If repeated computation is performed in a row, column storage results are added to the table. However, if the column involved in calculation is updated, you must use a trigger to update the new column.
If the table is computed repeatedly by class, a new table is added (generally, it is enough to store the two columns of the class and result) to store the relevant results. However, if the columns involved in calculation are updated, you must use the trigger to update the new table immediately, or use the stored procedure or application code to update the table in batches.
If multiple rows are computed repeatedly (such as rankings), column storage results are added to the table. However, if the column involved in calculation is updated, you must use a trigger or stored procedure to update the new column.
In short, the storage of redundant data is conducive to accelerating access, but in violation of the third paradigm, this will increase the cost of maintaining data integrity, must be updated immediately with a trigger, or storage process or application code batch update, to maintain data integrity.
3.3 eliminate expensive combinations
For some main processes that frequently access multiple tables at the same time, we consider storing redundant data in the primary table, that is, storing redundant columns or derivative columns (which do not depend on primary keys), but undermining the third paradigm, it also increases the maintenance difficulty. When the related columns in the source table change, you must use triggers or stored procedures to update the redundant columns. When the main process accesses two tables at the same time, the tables can be merged, which reduces disk I/O * operations, but destroys the third paradigm and increases the maintenance difficulty. The methods for merging Parent and Child tables and 1:1 Relational Tables are different: redundant tables are generated after the Parent and Child tables are merged, and redundant data is generated in the table after the 1:1 Relational Tables are merged.
4 , Database object placement policy
The database object placement policy distributes data evenly on the system disk to balance I/O access and avoid I/O bottlenecks.
(1) access is distributed to different disks. Even if user data spans multiple devices as much as possible, multiple I/O operations avoid I/O competition and overcome access bottlenecks; place random access and continuous access data respectively.
(2) separate system database I/O and application database I/O. Place the system audit table and temporary database table on a hard disk.
(3) Place transaction logs on a separate disk to reduce disk I/O overhead, which facilitates recovery after obstacles and improves system security.
(4) place frequently accessed "active" tables on different disks; place frequently used tables and frequently joined tables on separate disks, even place fields of frequently accessed tables on different disks to distribute access to different disks to avoid I/O contention;
Segment separates frequently accessed tables and their indexes (non-clustered), and separates text and image data. Segment aims to balance I/O, avoid bottlenecks, increase throughput, achieve parallel scanning, increase concurrency, and maximize Disk Throughput. Use the logical segment function to place active tables and their non-clustered indexes to balance I/O. Of course, it is best to use the default segment of the system. In addition, the segment can make the backup and recovery data more flexible, and make the system authorization more flexible.

Experience in optimizing SQL Server databases

From: http://www.enet.com.cn/article/2008/0108/A20080108997121.shtml

Database optimization considerations:

1. Create an index for the key fields.

2. Using Stored Procedures makes SQL more flexible and efficient.

3. Back up the database and clear junk data.

4. SQL statement syntax optimization. (You can use Sybase SQL expert. Unfortunately, I did not find the unexpired serial number)

5. Clear and delete logs.

Basic principles for SQL statement optimization:

1. Use indexes to traverse tables faster.

The index created by default is a non-clustered index, but sometimes it is not optimal. In a non-clustered index, data is physically stored on the data page randomly. Reasonable index design should be based on the analysis and prediction of various queries. Generally speaking: ①. you can create a cluster index for columns with a large number of duplicate values and frequent range queries (between, >,<<=, <=), order by, and group; ②. multiple columns are frequently accessed at the same time, and each column contains duplicate values. You can create a composite index. composite indexes should try to overwrite key queries. The leading column must be the most frequently used column.

2. Is null and is not null

Null cannot be used as an index. Any column containing null values will not be included in the index. Even if there are multiple columns in the index, as long as one of these columns contains null, this column will be excluded from the index. That is to say, if a column has a null value, even if the column is indexed, the performance will not be improved. Any statement optimizer that uses is null or is not null in the WHERE clause cannot use indexes.

3. In and exists

Exists is far more efficient than in. It is related to full table scan and range scan. Almost all in operator subqueries are rewritten to subqueries using exists.

4. Use as few formats as possible for massive queries.

5. In SQL Server 2000, if a stored procedure has only one parameter and is of the output type, an initial value must be provided to the stored procedure when it is called, otherwise, a call error occurs.

6. Order by and gropu

Using the 'ORDER BY' and 'group by' phrases, any index can improve select performance. Note: If the index column contains a null value, optimizer cannot optimize it.

7. Any operation on a column will cause a table scan, including database functions and calculation expressions. During query, try to move the operation to the right of the equal sign.

8. In And or clauses usually use worksheets to invalidate indexes. If a large number of duplicate values are not generated, consider splitting the clause. The split clause should contain the index.

9. Set showplan_all on to view the execution plan. DBCC checks database data integrity. DBCC (Database Consistency Checker) is a set of programs used to verify the integrity of the SQL Server database.

10. Use cursors with caution

In some cases where a cursor must be used, you can consider transferring qualified data rows to a temporary table and then defining the cursor on the temporary table, which can significantly improve the performance.

Note: the so-called optimization means that the WHERE clause uses the index. If the index is not optimized, table scanning or additional overhead occurs. Experience shows that the greatest improvement in SQL Server performance is due to the logical database design, index design, and query design. Conversely, the biggest performance problem is often caused by these deficiencies in the same aspect. In fact, the essence of SQL optimization is to use the statements that can be identified by the optimizer on the premise that the results are correct to fully utilize the index, reducing the number of I/O scans on the table, avoid table search as much as possible. In fact, SQL Performance optimization is a complex process. The above is only a manifestation of the application layer, in-depth research will also involve resource configuration at the database layer, traffic control at the network layer, and the overall design of the operating system layer.

14 skills in Database Design

As computer technology is increasingly widely used in various fields of the national economy, while computer hardware is constantly miniaturization, application systems are evolving towards a more complex and large-scale direction. Database is the core of the entire system. Its design is directly related to the efficiency of system execution and system stability. Therefore, in software system development, database design should follow the necessary database Paradigm Theory to reduce redundancy and ensure data integrity and correctness. Only by designing a proper database model on a suitable database product can the programming and maintenance of the entire system be reduced and the actual operating efficiency of the system be improved. Although developers of small or medium-sized projects can easily use the Paradigm Theory to design a set of qualified databases, for a software project that contains large databases, you must have a complete set of design principles and skills.

1. Set up a Data Group
 
Large Databases have many data elements, so it is necessary to set up a special data group in design. Because the Database Designer is not necessarily a user, it is impossible to fully consider the data elements in the system design. After the database is designed, it is often difficult to find the required database table, therefore, it is best for a data group to be composed of a project backbone familiar with the business.
 
The function of a Data Group is not to design a database, but to extract the basic data elements of the system based on the Requirements Analysis of other similar systems and review the database. The review includes checking whether new database elements are complete and whether all business needs can be fulfilled; analyzing and transforming old databases (if existing systems exist; review, control, and necessary adjustments of database design.

Ii. Design Principles

1. standard naming. All Database names, table names, and domain names must follow the unified naming rules and must be described to facilitate the design, maintenance, and query.
 
2. Reference of control fields. During the design, you can select appropriate database design management tools to facilitate the distributed design of developers and centralized audit management of data groups. Uniform Naming rules are adopted. If the design field already exists, it can be directly referenced; otherwise, it should be re-designed.
 
3. Duplicate control of database tables. During the design process, if most fields are found to already exist, the developer should doubt whether the database table designed already exists. By querying the database table where the field is located and the corresponding designer, you can check whether the database table is indeed repeated.
 
4. Concurrency Control. Concurrency Control should be implemented in the design, that is, for the same database and table, only one person has control at the same time, and others can only query.
 
5. necessary discussions. After the database design is complete, the data team should discuss with relevant personnel and familiarize themselves with the database through the discussion, so as to control the existing problems in the design or obtain the necessary information for the database design.
 
6. Review by the data team. The revision and modification of database tables must be reviewed by the data team to ensure they meet the necessary requirements.
 
7. header file processing. After each data modification, the data team should modify the corresponding header file (which can be automatically completed by the management software) and notify the relevant developers to make corresponding program modifications.

Iii. design skills

1. Split tables with large data volume by category. For frequently-used tables (such as some parameter tables or code comparison tables), we should minimize the number of records in the table because of their high usage frequency. For example, the Bank's account owner table was originally designed as a table, although it can facilitate program design and maintenance, but after analysis, it is found that because of the large amount of data, it will affect the Rapid Positioning of data. If the owner account table is designed as the current owner account, the regular owner account, and the public owner account, the query efficiency can be greatly improved.

2. index design. For large database tables, reasonable indexes can improve the operation efficiency of the entire database. In the index design, fields with fewer duplicate values should be selected for index fields. When searching fields with compound indexes, you should pay attention to the order in which compound index fields are created. For example, if you create a composite index for a flow meter with more than 50 thousand records in order of date and serial number, because the number of duplicate values of dates in the table is close to the number of records in the table, the time taken to query with a sequential number is close to 3 seconds. If an index is created using a sequential number as the index field for the same query, it takes less than 1 second. Therefore, in the design of large databases, only reasonable selection of index fields can effectively improve the operational efficiency of the entire database.

3. Optimize data operations. In large databases, it is worth noting how to improve data operation efficiency. For example, every time you add a service to the database flow table, You must retrieve the flow number from the flow control table and add the value of the flow number to one. Under normal circumstances, the response speed of a single operation is normal, but when it is used for batch business processing, the speed will be significantly slowed down. After analysis, we found that the table should be locked every time the flow number value in the flow control table is added for a while, and the table is the core of the entire system operation, and may be locked by other processes during the operation, this slows down the entire transaction operation. The solution to this problem is to apply for a serial number in batches based on the total number of batch businesses and update the flow control tabulation to speed up batch business processing. Another example is table insertion optimization. For a large number of business processing, if you use a common insert statement when inserting a database table, the speed will be very slow. The reason is that it takes a long time to insert an I/O operation to a table. After improvement, I/O operations can be performed after the pages are full in the buffer format such as put statement to improve efficiency. When a large database table is deleted, the delete statement is usually used directly. Although this statement can be used for small table operations, however, deletion of large tables may be slow or even fail due to large transactions. The solution is to remove the transaction, but the more effective method is to perform the drop operation and then perform reconstruction.

4. Adjust database parameters. Adjusting database parameters is a process of accumulating experience and should be completed by experienced system administrators. Taking the Informix database as an example, too few locks will cause the lock table to fail; too few logical logs will cause the insertion of large tables to fail, all these problems should be adjusted according to the actual situation.

5. Necessary tools. In the whole database development and design process, you can first develop some small application tools, for example, the header file of the database table, the initialization of inserted data, the function encapsulation of data insertion, Error Tracking, or automatic display are automatically generated to improve the efficiency of database design and development.

6. Avoid long transactions. Deleting or inserting a single large table will lead to large transactions. The solution is to adjust the parameters or split the files during insertion. For a long transaction that consists of a series of small transaction sequence operations (such as the day-end transaction of the bank transaction system), a series of operations can complete the entire transaction, however, the disadvantage is that the entire transaction may be too large to be completed, or it may take too long to redo the transaction due to unexpected events. A better solution is to break down the entire transaction into several smaller transactions, and then the application controls the entire system process. In this way, if a transaction fails, you only need to redo the transaction, which can save time and avoid long transactions.

7. Advance as appropriate. With the rapid development of computer technology, database design must be proactive, not only to meet the current application requirements, but also to consider future business development, at the same time must be conducive to the expansion or increase of application system processing functions.

Compared with small and medium databases, the design and development of large databases are much more complex. Therefore, in the design and development process, in addition to following the database Paradigm Theory and increasing system consistency and integrity, distributed design should also be carried out according to specific situations in general, and the basic principles of centralized control and unified audit should be closely grasped to ensure the compact database design structure, balanced distribution, and rapid positioning. In terms of database operations, certain skills should be used to improve the execution efficiency of the entire application system, and appropriate advances should be taken to adapt to the changing application and system development requirements.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.