"Important" for database optimization with large data volume and high concurrency
I. Design of database structure
If we can't design a reasonable database model, it will not only increase the difficulty of programming and maintaining the client and server segment program, but also will affect the performance of the system in actual operation. Therefore, it is necessary to design a complete database model before a system begins to be implemented.
In a system analysis, design phase, because of the small amount of data, low load. We tend to notice only the realization of the function, it is difficult to notice the weakness of the performance, wait until the system put into actual operation for a period of time, only to find that the performance of the system is reduced, then to consider improving system performance will cost more human and material resources, and the entire system also inevitably formed a patching project.
So when considering the process of the whole system, we have to consider, in the case of high concurrency and large amount of data access, our system will not appear extreme situation. (For example: the external statistical system in the case of data anomalies on July 16, concurrent large data access caused by the database response time can not keep up with the speed of data refresh.) When the date is critical (00:00:00), determine if there is a record of the current date in the database, or insert a record of the current date. In the case of low concurrent access, the problem does not occur, but when the date critical traffic is quite large, when this judgment, there will be multiple conditions, then the database will be inserted into several current date records, resulting in data errors. After the database model is determined, it is necessary to do a data flow diagram in the system to analyze the possible bottlenecks.
In order to ensure the consistency and integrity of the database, it is often designed to design too many table associations and reduce the redundancy of data as much as possible. (for example, in the area of the user table, we can place the area in a separate table) if the data redundancy is low, the integrity of the data is guaranteed, the data throughput speed is ensured, the data integrity is guaranteed, and the relationship between the data elements is clearly expressed. and for the association query between multiple tables (especially large data tables), its performance will be reduced, but also improve the programming of the client program, so the physical design needs to compromise, according to business rules, determine the size of the associated table data, access frequency of data items, In order to improve the response time of the system, it is necessary to improve the data redundancy design, but also increase the operation of the connection query between tables, and make the program more complicated. Designers should be considered in the design phase according to the type and frequency of the system operation.
Also, it is best not to associate a child table with a self-added property field as a primary key. Does not facilitate system migration and data recovery. The mapping relation of external statistic system is lost (******************).
The original form must be rebuilt by the form it separates from. The advantage of using this rule is that you can make sure that no extra columns are introduced into the detached table, and that all of the table structures you create are as big as their actual needs. It's a good habit to apply this rule, but you won't need to use it unless you're dealing with a very large data. (for example, a pass system, I can userid,username,userpassword, separate out a table, and then the UserID as a foreign key to other tables)
The design of the table specific attention to the problem:
1, the length of the data line do not exceed 8020 bytes, if more than this length in the physical page of this data will occupy two rows, resulting in storage fragmentation, reduce query efficiency.
2, the ability to use numeric types of fields to choose the number type rather than the string type (phone number), which will reduce the performance of queries and connections, and increase storage overhead. This is because the engine compares each character of the string in the processing of the query and the connection back, and it is enough for the numeric type to compare it once.
3, for the Non-volatile character type char and variable character type varchar are 8000 bytes, char query fast, but the storage space, varchar query relatively slow but save storage space. In the design of the field can be flexible choice, such as user name, password, such as the length of the field can choose Char, for comments and other changes in the length of the field can choose varchar.
4, the length of the field in the maximum to meet the possible needs of the premise, should be as short as possible, so as to improve the efficiency of the query, and in the establishment of the index can also reduce the consumption of resources.
Two, optimization of the query
Ensure that on the basis of implementation of the function, to minimize the number of visits to the database, through the search parameters, minimizing the number of access to the table, minimizing the result set, thereby reducing network burden; Can separate the operation as far as possible separate processing, improve each response speed; When you use SQL in a data window Try to put the index in the selected first column, the structure of the algorithm as simple as possible, when querying, do not use wildcard characters such as SELECT * FROM T1 statements, use a few columns to select a few columns such as: Select Col1,col2 from T1 , as far as possible, limit the number of result set rows, such as: SELECT top-col1,col2,col3 from T1, because in some cases users do not need so much data.
In the absence of an index, the database looks for a piece of data, it must be a full table scan, all the data to traverse, to find out the conditions of the record. In the case of a small amount of data, there may not be a significant difference, but when the volume of data is large, the situation is extremely bad.
How SQL statements are executed in SQL Server, fearing that their written SQL statements would be misunderstood by SQL Server. Like what:
SELECT * FROM table1 where name= ' Zhangsan ' and TID > 10000
and implementation:
SELECT * FROM table1 where TID > 10000 and name= ' Zhangsan '
Some people do not know the execution efficiency of the above two statements is the same, because if it is simple to read from the statement, the two statements are indeed different, if the TID is an aggregation index, then the following sentence only from the table after 10,000 records to find the line , and the previous sentence to look up from the whole table to see a few name= ' Zhangsan ', and then based on the constraints of the conditions tid>10000 to provide query results.
In fact, such fears are unnecessary. There is a query analysis optimizer in SQL Server that calculates the search criteria in the WHERE clause and determines which index reduces the search space for the table scan, that is, it enables automatic optimization. Although the query optimizer can automatically query optimization based on the WHERE clause, sometimes the query optimizer does not make quick queries according to your intent.
During the query analysis phase, the query optimizer looks at each stage of the query and determines whether it is useful to limit the amount of data that needs to be scanned. If a phase can be used as a scan parameter (SARG), it is called an optimization and can be used to quickly obtain the required data.
Sarg definition: An operation that restricts the search because it usually refers to a specific match, a match within a range of values, or a two or more conditional and join. The form is as follows:
Column name operator < constant or variable > or < constant or variable > operator column name
The column name can appear on one side of the operator, and the constant or variable appears on the other side of the operator. Such as:
Name= ' John '
Price >5000
5000< Price
Name= ' John ' and Price >5000
If an expression does not satisfy the Sarg form, it cannot limit the scope of the search, that is, SQL Server must determine for each row whether it satisfies all the conditions in the WHERE clause. So an index is useless for expressions that do not satisfy the Sarg form.
Therefore, the most important thing to optimize a query is to make the statement conform to the rules of the query optimizer to avoid full table scans and use index queries.
Specifically to note:
1. The null value of the field in the WHERE clause should be avoided as far as possible, or it will cause the engine to discard the use of the index for a full table scan, such as:
Select ID from t where num is null
You can set the default value of 0 on NUM to ensure that the NUM column in the table does not have a null value and then query this way:
Select ID from t where num=0
2. The use of!= or <> operators in the WHERE clause should be avoided as far as possible, otherwise the engine discards the use of the index for a full table scan. The optimizer will not be able to determine the number of rows that will die through the index, so you need to search all rows of the table.
3. You should try to avoid using or to join conditions in the WHERE clause, or it will cause the engine to discard the use of the index for a full table scan, such as:
Select ID from t where num=10 or num=20
You can query this way:
Select ID from t where num=10
UNION ALL
Select ID from t where num=20
4.in and not in are also used with caution, because in causes the system to be unable to use the index, and can only directly search the data in the table. Such as:
Select ID from t where num in (1,2,3)
For consecutive values, you can use between instead of in:
Select ID from t where num between 1 and 3
5. Try to avoid using a non-heading letter search in indexed character data. This also makes it impossible for the engine to take advantage of indexes.
See the following example:
SELECT * from T1 WHERE NAME like '%l% '
SELECT * from T1 WHERE substing (name,2,1) = ' L '
SELECT * from T1 WHERE NAME like ' l% '
Even if the Name field is indexed, the first two queries are still unable to use the index to speed up the operation, and the engine has to do a single operation on all of the data in the entire table. A third query can use indexes to speed up operations.
6. Forcing the query optimizer to use an index when necessary, such as using parameters in the WHERE clause, also results in a full table scan. Because SQL resolves local variables only at run time, the optimizer cannot defer the selection of the access plan to the runtime, which must be selected at compile time. However, if an access plan is established at compile time, the value of the variable is still unknown and cannot be selected as an entry for the index. The following statement will perform a full table scan:
Select ID from t where num= @num
You can use the index instead of forcing the query:
Select ID from T with (index name) where num= @num
7. You should try to avoid the expression of fields in the WHERE clause, which will cause the engine to discard the use of indexes for full table scans. Such as:
SELECT * from T1 WHERE f1/2=100
should read:
SELECT * from T1 WHERE f1=100*2
SELECT * FROM record WHERE SUBSTRING (card_no,1,4) = ' 5378 '
should read:
SELECT * FROM record WHERE card_no like ' 5,378% '
SELECT Member_number, first_name, last_name from
WHERE DATEDIFF (Yy,datofbirth,getdate ()) > 21
should read:
SELECT Member_number, first_name, last_name from
WHERE dateOfBirth < DATEADD (Yy,-21,getdate ())
That is, any action on the column will result in a table scan, which includes database functions, evaluation expressions, and so on, to move the action to the right of the equal sign whenever possible.
8. Avoid functional operations of fields in the WHERE clause, which causes the engine to discard the use of indexes for full table scans. Such as:
Select ID from t where substring (name,1,3) = ' abc '--name an ID beginning with ABC
The ID generated by the Select ID from t where DATEDIFF (day,createdate, ' 2005-11-30 ') =0--' 2005-11-30 '
should read:
Select ID from t where name like ' abc% '
Select ID from t where createdate>= ' 2005-11-30 ' and createdate< ' 2005-12-1 '
9. Do not perform functions, arithmetic operations, or other expression operations on the left side of "=" in the WHERE clause, otherwise the system may not be able to use the index correctly.
10. When using an indexed field as a condition, if the index is a composite index, the first field in the index must be used as a condition to ensure that the index is used by the system, otherwise the index will not be used, and the order of the fields should be consistent with the index order as much as possible.
11. It is a good choice to use exists in many cases:
Elect num from a where num in (select num from B)
Replace with the following statement:
Select num from a where exists (select 1 from b where num=a.num)
SELECT SUM (T1. C1) from T1 WHERE (
(SELECT COUNT (*) from T2 WHERE t2.c2=t1.c2>0)
SELECT SUM (T1. C1) from T1where EXISTS (
SELECT * from T2 WHERE T2. C2=t1. C2)
The two produce the same result, but the latter is obviously more efficient than the former. Because the latter does not produce a large number of locked table scans or index scans.
If you want to check whether there is a record in the table, do not use COUNT (*) as inefficient, and waste server resources. You can use exists instead. Such as:
IF (SELECT COUNT (*) from table_name WHERE column_name = ' xxx ')
Can be written as:
IF EXISTS (SELECT * FROM table_name WHERE column_name = ' xxx ')
It is often necessary to write a t_sql statement to compare a parent result set and a child result set to find whether there are records in the parent result set that do not exist in the child result set, such as:
SELECT A.hdr_key from Hdr_tbl a----tbl a means tbl replace with alias a
Where not EXISTS (SELECT * from dtl_tbl b where A.hdr_key = B.hdr_key)
SELECT A.hdr_key from Hdr_tbl a
Left JOIN dtl_tbl b on a.hdr_key = B.hdr_key WHERE B.hdr_key is NULL
SELECT Hdr_key from Hdr_tbl
WHERE Hdr_key not in (SELECT Hdr_key from DTL_TBL)
The same correct results can be obtained in all three ways, but the efficiency is reduced in turn.
12. Use table variables as far as possible instead of temporary tables. If the table variable contains a large amount of data, note that the index is very limited (only primary key indexes).
13. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.
14. Temporary tables are not unusable and they can be used appropriately to make some routines more efficient, for example, when you need to repeatedly reference a dataset in a large table or a common table. However, for one-off events, it is best to use an export table.
15. When creating a new temporary table, if the amount of data inserted at a time is large, you can use SELECT INTO instead of CREATE table, to avoid causing a large number of log to improve speed, if the amount of data is small, in order to ease the resources of the system table, you should create table first, and then insert.
16. If you use a temporary table, be sure to explicitly delete all temporary tables at the end of the stored procedure, first truncate the table, and then drop the table, which avoids the longer locking of the system tables.
17. Set the set NOCOUNT on at the beginning of all stored procedures and triggers, and set NOCOUNT off at the end. You do not need to send a DONE_IN_PROC message to the client after executing each statement of the stored procedure and trigger.
18. Try to avoid large business operations, improve system concurrency capability.
19. Try to avoid the return of large data to the client, if the amount of data is too large, we should consider whether the corresponding demand is reasonable.
20. Avoid the use of incompatible data types. For example, float and int, char and varchar, binary, and varbinary are incompatible. Incompatibility of data types may make the optimizer unable to perform some optimizations that could have been done. For example:
SELECT name from employee WHERE salary > 60000
In this statement, if the salary field is a money type, it is difficult for the optimizer to optimize it because 60000 is an integer number. Instead of waiting for run-time conversion, we should convert an integral type into a coin type when programming.
21. Make full use of the connection conditions, in some cases, two tables may be more than one connection between the conditions, then in the WHERE clause in the complete writing of the join conditions, it is possible to greatly improve the query speed.
Cases:
SELECT SUM (A.amount) from the account A,card B WHERE a.card_no = b.card_no
SELECT SUM (A.amount) from the account A,card B WHERE a.card_no = B.card_no and A.account_no=b.account_no
The second sentence would be much quicker than the first sentence.
22, use the view to speed up the query
Sorting a subset of a table and creating a view can sometimes speed up queries. It helps to avoid multiple sorting operations and, in other ways, simplifies the work of the optimizer. For example:
SELECT cust.name,rcvbles.balance,......other Columns
From Cust,rcvbles
WHERE cust.customer_id = rcvlbes.customer_id
and rcvblls.balance>0
and cust.postcode> "98000"
ORDER BY Cust.name
If this query is to be executed multiple times and more than once, all unpaid customers can be found and placed in a single view, sorted by the name of the customer:
CREATE VIEW DBO. V_cust_rcvlbes
As
SELECT cust.name,rcvbles.balance,......other Columns
From Cust,rcvbles
WHERE cust.customer_id = rcvlbes.customer_id
and rcvblls.balance>0
ORDER BY Cust.name
Then query the view in the following way:
SELECT * from V_cust_rcvlbes
WHERE postcode> "98000"
There are fewer rows in the view than in the primary table, and the physical order is the required order, reducing disk I/O, so the query workload can be drastically reduced.
23, can use distinct without GROUP by
SELECT OrderID from the Details WHERE UnitPrice > GROUP by OrderID
Could read:
SELECT DISTINCT OrderID from Details WHERE UnitPrice > 10
24. You can use union all and don't use union.
UNION all does not perform the SELECT DISTINCT function, which reduces a lot of unnecessary resources
35. Try not to use the SELECT INTO statement.
A SELECT inot statement causes table locking and prevents other users from accessing the table.
What we mentioned above is some basic considerations to improve query speed, but in more cases, it is often necessary to test and compare different statements to get the best solution. The best way is to test, of course, see the implementation of the same function of the SQL statement which has the fewest execution time, but if the data in the database is very small, is not come out, then you can view the execution plan, namely: the implementation of the same function of multiple SQL statements to Query Analyzer, according to Ctrl+l look at the use of the index, The number of table scans (both of which have the greatest performance impact), as a percentage of the total cost of the poll.
Iii. optimization of the algorithm
Try to avoid using cursors because the cursors are less efficient and should be considered for rewriting if the cursor is manipulating more than 10,000 rows. Before using a method based on a cursor or a temporary table method, you should first look for a set based solution to solve the problem, and a set based approach is usually more efficient. As with temporary tables, cursors are not unusable. Using Fast_forward cursors for small datasets is usually better than other row-by-line methods, especially if you have to refer to several tables to get the data you need. Routines that include "totals" in the result set are typically faster than those used with cursors. If development time allows, both a cursor based approach and a set based approach can be tried to see which method works better.
Cursors provide a step-by-step scan of a particular set, typically using cursors to traverse data line by row, and to perform different operations based on different conditions of the extracted data. Especially for multiple tables and large table-defined cursors (large data set) loops are easy to get the program into a long wait or even panic.
In some cases, it is sometimes necessary to use a cursor, you can also consider the qualifying data rows into a temporary table, and then the temporary table to define cursors to operate, can be significantly improved performance.
(For example: internal statistics first edition)
Encapsulating stored Procedures
Iv. establishing an efficient index
Creating an index typically has the following two purposes: maintaining the uniqueness of indexed columns and providing fast access to the data in the table. Large databases have two indexes--clustered and non-clustered--and a table without a clustered index stores data by heap structure. All of the data is added to the end of the table, and the table with the clustered index, whose data is physically stored in the order of the cluster key, is allowed to have only one cluster index, so according to the B-tree structure, You can understand that adding any index increases the speed of querying by indexed columns, but reduces the performance of INSERT, update, and delete operations, especially if the fill factor (fill Factor) is larger. Therefore, the index of more tables for frequent inserts, updates, deletions, tables and indexes because of the setting of a small fill factor, in order to leave more free space in the data pages, reduce page segmentation and organizational work.
An index is one of the most efficient ways to get data from a database. 95% of database performance problems can be solved by indexing technology. As a rule, I usually use a unique group index on a logical primary key, a unique, nonclustered index on the system key (as a stored procedure), and a non group index on any foreign key column [field]. However, the index is like salt, too much food is salty. You have to think about how large the database is, how the tables are accessed, and whether they are primarily used for reading and writing.
In fact, you can interpret the index as a special kind of directory. Microsoft's SQL Server provides two indexes: a clustered index (clustered index, also known as a clustered, clustered index) and a nonclustered index (nonclustered index, also known as a nonclustered and nonclustered index). For example, let's take a look at the difference between a clustered index and a nonclustered index:
In fact, the text of our Chinese dictionary is itself a clustered index. For example, if we look at the word "Ann", we will naturally open the first few pages of the dictionary, because the pinyin of "an" is "an", and the dictionary of Chinese characters according to Pinyin is the beginning of the English letter "a" and ending with "Z", then the word "Ann" is naturally ranked in the front of the dictionary. If you end up with all the "a" parts you still can't find the word, so you don't have the word in your dictionary; Similarly, if you look at the word "Zhang", you will also turn your dictionary to the last part, because "Zhang" Pinyin is "Zhang". That is, the body part of the dictionary itself is a directory, and you don't need to look up other catalogs to find what you need to find.
We refer to the content of the body as a "clustered index", which is arranged according to certain rules.
If you know a word, you can quickly find the word from automatic. But you may also encounter words you don't know, do not know its pronunciation, at this time, you can not follow the way you have just found the word you want to search, and need to go according to the "radical" to find the word you are looking for, and then according to the number of the word after the page directly to find the word you are looking for. But the sort of word you find in conjunction with the "Radical catalog" and "CJK ideographs table" is not really the sort of method of body text, for example, you look up the "Zhang" word, we can see in the CJK ideographs table after the radical, "Zhang" page number is 672 pages, CJK ideographs Table "Zhang" above is "Chi" word, but the page number is 63 pages, "Zhang" below is "crossbow "Word, the page is 390 pages. Obviously, these words are not really in the "Zhang" word of the upper and lower side, now you see the continuous "Chi, Zhang, crossbow" three words is actually their sorting in the nonclustered index, is the dictionary body of words in the nonclustered index mapping. We can find the word you need in this way, but it takes two processes to find the result in the TOC and then turn to the page number you want.
We refer to this catalogue as purely a directory, and the text is simply a sort of text that is called a nonclustered index.
Further, we can easily understand that each table can have only one clustered index, because the catalog can only be sorted in one way.
(i) When to use clustered or nonclustered indexes
The following table summarizes when to use clustered or nonclustered indexes (very important).
Action description using a clustered index with a nonclustered index
Columns are often grouped sorted should be
Returning data within a range should not be
One or very few different values should not be
A small number of different values should not be
A large number of different values should not be
Columns that are frequently updated should not be
The foreign key column should be
Primary key columns should be
Frequently modifying index columns should not be
In fact, we can understand the table above by using examples of the definitions of the previous clustered and nonclustered indexes. For example, returns a range of data items. For example, if you have a table with a time column that happens when you set up the aggregate index in that column, this speed will be quick when you query all the data from January 1, 2004 to October 1, 2004, because the text of your dictionary is sorted by date, The clustering index only needs to find the beginning and end data in all the data to be retrieved, and unlike nonclustered indexes, you must first look up the page number of each item in the table of contents, and then find the specific content based on the page number.
(two) combining with practice, talking about the misunderstanding of index use
The purpose of the theory is to apply. Although we have just listed when a clustered index or nonclustered index should be used, the above rules are easily overlooked in practice or cannot be analyzed in the light of actual circumstances. Below we will talk about the use of the index based on the actual problems encountered in practice, so as to facilitate the understanding of the method of index establishment.
1, the primary key is the clustered index
This idea is extremely wrong and is a waste of the clustered index. Although SQL Server creates a clustered index on the primary key by default.
Typically, we create an ID column in each table to distinguish each piece of data, and the ID column is automatically enlarged, and the step size is typically 1. This is the case for the column GID in our example of office automation. At this point, if we set this column as the primary key, SQL Server will think of this Lieme as a clustered index. The good thing about this is that you can physically sort your data in the database by ID, but I don't think it makes much sense.
Obviously, the advantages of clustered indexes are obvious, and each table can have only one clustered index rule, which makes the clustered index more valuable.
From the definition of the clustered index we talked about earlier, we can see that the biggest benefit of using a clustered index is the ability to quickly narrow the scope of the query to avoid full table scans, based on query requirements. In practice, because the ID number is generated automatically, we do not know the ID number of each record, so it is very difficult to use the ID number for the query. This makes the ID number the primary key as a clustered index a waste of resources. Second, having a field with a different ID number as a clustered index does not conform to the rule that the aggregate index should not be established in the case of a large number of different values; Of course, this is only a negative effect on the user's frequent modification of the record, especially the index, but has no effect on the query speed.
In the office automation system, whether the system home page to show the need for the user to sign the file, the meeting or the user file query, and so on in any case data query can not be separated from the field is "date" and the user's own "user name."
Usually, the home page of office automation will display documents or meetings that have not been signed by each user. Although our where statement can only limit the situation that the current user has not signed up for, if your system has been built for a long time and has a large amount of data, it does not make sense to have a full table scan every time a user opens the home page, The vast majority of users have been browsing the files 1 months ago, doing so only to increase the cost of the database. In fact, we can allow users to open the System home page, the database only to query the user nearly 3 months of unread files, through the "Date" field to limit the table scan, improve query speed. If your office automation system has been established for 2 years, then your home page display speed will theoretically be 8 times times faster, or even quicker.
2, as long as the establishment of the index can significantly improve query speed
In fact, we can see that in the above example, the 2nd and 3 statements are exactly the same, and the fields indexed are the same; the difference is that the first is a Fariqi index created on the field, which is an aggregated index, but the query speed is a different one. Therefore, it is not easy to build indexes on any field to improve query speed.
From the statement in the table, we can see that there are 5,003 different records in the Fariqi field in the table with 10 million data. It is more appropriate to establish an aggregated index on this field. In reality, we send a few files every day, and the documents are issued on the same date, which is exactly the same as setting up a clustered index: "Neither the vast majority nor the very few are the same" rule. As a result, it is important that we build an "appropriate" aggregate index to improve our query speed.
3, all need to improve the query speed of the fields are added to the clustered index to improve query speed
As mentioned above, the data query can not be separated from the field is "date" and the user's own "user name." Now that these two fields are so important, we can combine them to create a composite index (compound index).
Many people think that as long as you add any field to the clustered index, you can increase the speed of the query, and others are puzzled: if the composite clustered index fields are queried separately, then the query speed slows down. With this question, let's take a look at the following query speed (result set is 250,000 data): (Date column Fariqi first row in the composite clustered index, the user name Neibuyonghu row in the back column)
We can see that the query speed is almost the same if you use only the starting column of the clustered index as the query condition and all the columns that use the composite clustered index at the same time. It is even faster than using all the composite indexed columns (in the same case as the number of query result sets), and if only the non-starting columns of the composite clustered index are used as query criteria, This index is of no effect. Of course, the query speed of statements 1 and 2 is the same as the number of entries in the query, if all the columns of the composite index are used and the query results are few, this will result in an "index overlay", thus achieving optimal performance. Also, keep in mind that no matter how often you use other columns that aggregate indexes, the leading columns must be the most frequently used columns.
(iii) other precautions
"Water can carry a boat, it can also overturn", the index is the same. Indexing can help improve retrieval performance, but too much or improper indexing can lead to inefficient systems. Because the user adds an index to each table, the database does more work. Too many indexes can even cause index fragmentation. &NBSP
So we're going to create an "appropriate" indexing system, especially for aggregating indexes, and better, so that your database can perform at a high performance