[Turn] analysis on the database optimization of large data volume and high concurrency

Source: Internet
Author: User

Links: http://www.uml.org.cn/sjjm/201308264.asp

High concurrency database can handle large amount of information at the same time, the application scope is very wide. Today we will discuss the large data volume and high concurrency of the database optimization, we hope to help.

First, the design of database structure

If not design a reasonable database model, not only will increase the client and server segment program programming and maintenance difficulty, and will affect the actual performance of the system. Therefore, the design of a complete database model is necessary before a system starts to implement.

In a system analysis, design phase, because the data volume is small, the load is low. We often only notice the realization of the function, and it is difficult to notice the weakness of performance, and until the system put into actual operation for a period of time, only to find that the performance of the system is reduced, then to consider improving the performance of the system will cost more manpower and resources, and the entire system is inevitably formed a patch project.

So when considering the process of the whole system, we have to consider that in the case of high concurrency and large data volume access, our system will not have the extreme situation. (for example, the data anomaly occurred on the external statistics system on July 16, the response time of the database cannot keep up with the speed of data refresh due to the access of the large data volume.) In the case of a date critical (00:00:00), determine if there is a record of the current date in the database, and a record of the current date is not inserted. In the case of low concurrent access, the problem will not occur, but when the date is critical when the traffic is quite large, when making this judgment, there will be multiple conditions, the database will be inserted in a number of current date records, resulting in data errors. ), after the database model is determined, it is necessary to do a data flow diagram within the system to analyze the possible bottlenecks.

In order to ensure the consistency and integrity of the database, it is often designed to design too many inter-table associations, as far as possible to reduce the data redundancy. (for example, the area of the user table, we can put the region into another regional table) if the data redundancy is low, the data integrity is easy to be guaranteed, the data throughput speed is improved, the data integrity is ensured, and the relationship between the data elements is clearly expressed. In the case of multi-table correlation query (especially big data table), its performance will be reduced, but also improve the programming difficulty of the client program, therefore, the physical design needs to compromise, according to business rules, determine the data size of the associated table, the frequency of access to data items, This kind of data table frequently correlated query should improve the data redundancy design, but increase the operation of inter-table connection query, also make the program become complex, in order to improve the response time of the system, reasonable data redundancy is also necessary. Designers should consider the design phase according to the type and frequency of the system operation.

Also, it is best not to use the self-increment attribute field as the primary key associated with the child table. System migration and data recovery are not facilitated. External statistical system mapping relationship lost (******************).

The original table must be rebuilt from a table separated by it. The advantage of using this rule is that you can ensure that no extra columns are introduced in the detached table, and that all of the table structures you create are as large as their actual needs. Applying this rule is a good habit, but unless you're dealing with a very large piece of data, you won't need it. (for example, a pass system, I can userid,username,userpassword, alone to make a table, and then the USERID as a foreign key to other tables)

Table design specific attention to the problem:

1, the length of the data line should not exceed 8020 bytes, if more than this length, the data in the physical page will occupy two lines resulting in storage fragmentation, reduce query efficiency.

2. The ability to select numeric types instead of string types (phone numbers) with fields of numeric type, which reduces the performance of queries and connections, and increases storage overhead. This is because the engine is processing the query and connecting back to each character in a string by comparison, and for a digital type it only needs to be compared once.

3, for the immutable character type char and the variable character type varchar are 8000 bytes, char query fast, but consumes storage space, varchar query is relatively slow but save storage space. In the design of the field can be flexible choice, such as user name, password, such as the length of the field can be selected char, for the comment, such as the length of a large variety of fields can choose varchar.

4, the length of the field in the maximum possible to meet the needs of the premise, should be set as short as possible, so as to improve the efficiency of the query, but also in the establishment of the index can reduce the consumption of resources.

Second, the optimization of the query

Ensure that on the basis of the implementation of the function, minimize the number of access to the database, through the search parameters, minimize the number of access to the table, minimize the result set, thereby reducing the network burden, can separate the operation as far as possible separate processing, improve each response speed; When using SQL in the Data window, Try to put the index used in the first column of the selection; the structure of the algorithm is as simple as possible; when querying, do not use wildcards such as the SELECT * from T1 statement, use a few columns to select a few columns such as: Select Col1,col2 from T1 , as far as possible, limit the number of result sets as possible: Select TOP col1,col2,col3 from T1, because in some cases the user does not need that much data.

In the absence of an index, the database looks for a single piece of data, it has to do a full table scan, all data is traversed once, to find the matching records. In the case of a small amount of data, there may not be a noticeable difference, but when the amount of data is large, the situation is extremely bad.

How SQL statements are executed in SQL Server, and they worry that the SQL statements they write are misunderstood by SQL Server. Like what:

and execution:

Some people do not know whether the execution efficiency of the above two statements is the same, because if it is simple from the statement, the two statements are indeed different, if the TID is an aggregate index, then the last sentence only from the table of 10,000 records after the row , and the previous sentence to look at the whole table to see a few name= ' Zhangsan ', and then based on the conditions of the constraints tid> 10000来 query results.

In fact, such worries are unnecessary. There is a query analysis optimizer in SQL Server that calculates the search criteria in the WHERE clause and determines which index narrows the search space for table scans, which means that it can be automatically optimized. Although the query optimizer can automate query optimization based on the WHERE clause, sometimes the query optimizer does not query quickly as you intended.

During the query analysis phase, the query optimizer looks at each stage of the query and decides whether it is useful to limit the amount of data that needs to be scanned. If a stage can be used as a scanning parameter (SARG), then it is called an optimization, and the index can be used to quickly obtain the required data.

Sarg definition: Used to limit the search to an operation because it usually refers to a specific match, a value within a range of matches, or a connection of more than two conditions. The form is as follows:

Column name operators < constants or variables > or < constants or variables > operator column names

Column names can appear on one side of the operator, while constants or variables appear on the other side of the operator. Such as:

Name= ' Zhang San '

Price >5000

5000< Price

Name= ' Zhang San ' and price >5000

If an expression does not meet the form of sarg, it cannot limit the scope of the search, which means that SQL Server must determine for each row whether it satisfies all the conditions in the WHERE clause. So an index is useless for an expression that does not satisfy the Sarg form.

Therefore, the most important thing in optimizing queries is to try to make the statements conform to the query optimizer's rules to avoid full table scans and use index queries.

Specific to note:

1. Avoid null-valued fields in the WHERE clause, which will cause the engine to discard full-table scans using the index, such as:

You can set the default value of 0 on NUM, make sure that the NUM column in the table does not have a null value, and then query:

2. Try to avoid using the! = or <> operator in the WHERE clause, or discard the engine for a full table scan using the index. The optimizer will not be able to determine the number of rows to be fatal by the index, so it needs to search all rows of that table.

3. You should try to avoid using or in the WHERE clause to join the condition, otherwise it will cause the engine to abandon using the index for a full table scan, such as:

You can query this:

Select ID from t where num=10  

Select ID from t where num=20

4.in and not in are also used sparingly, because in makes the system unusable with indexes and can only search the data in the table directly. Such as:

Select ID from t where num in

For consecutive values, you can use between instead of in:

Select ID from t where num between 1 and 3

5. Try to avoid searching in indexed character data using non-heading letters. This also makes the engine unusable with indexes.

See the following example:

Select * from T1 Where NAME like '%l% '  
Select * from T1 Where substing (name,2,1) = ' L '
Select * from T1 Where NAME like ' l% '

Even though the name field is indexed, the first two queries are still unable to take advantage of the indexing to speed up the operation, and the engine has to perform the task by one-by-one operations on all tables. The third query can use an index to speed up operations.

6. Forcing the query optimizer to use an index if necessary, such as using parameters in the WHERE clause, can also cause a full table scan. Because SQL resolves local variables only at run time, the optimizer cannot defer the selection of access plans to run time; it must be selected at compile time. However, if an access plan is established at compile time, the value of the variable is still unknown and therefore cannot be selected as an input for the index. The following statement will perform a full table scan: You can force the query to use the index instead:

Select ID from T with (index name) where [email protected]

7. You should try to avoid expression operations on the fields in the WHERE clause, which will cause the engine to discard the full table scan using the index. Such as:

Select * from T1 Where f1/2=100

should read:

Select * from T1 Where f1=100*2  

should read:

Select * from RECORD Where card_no like ' 5,378% '  
Select Member_number, first_name, last_name from members

should read:

Select Member_number, first_name, last_name from members  
Where dateOfBirth < DATEADD (Yy,-21,getdate ())

That is, any action on a column causes a table scan, which includes database functions, calculation expressions, and so on, to move the operation to the right of the equals sign whenever possible.

8. You should try to avoid function operations on the fields in the WHERE clause, which will cause the engine to discard the full table scan using the index. Such as:

Select ID from t where substring (name,1,3) = ' abc '--name ID starting with ABC  

should read:


Select ID from t where createdate>= ' 2005-11-30 ' and createdate< ' 2005-12-1 '

9. Do not perform functions, arithmetic operations, or other expression operations on the left side of "=" in the WHERE clause, or the index may not be used correctly by the system.

10. When using an indexed field as a condition, if the index is a composite index, you must use the first field in the index as a condition to guarantee that the system uses the index, otherwise the index will not be used, and the field order should be consistent with the index order as much as possible.

11. Many times it is a good choice to use exists:

Elect num from a where num in (select num from B)

Replace with the following statement:

Select num from a where exists (select 1 from b where num=a.num)  
Select SUM (T1. C1) from T1 Where (
(Select COUNT (*) from T2 Where t2.c2=t1.c2>0)
Select SUM (T1. C1) from T1where EXISTS (
Select * from T2 Where T2. C2=t1. C2)

Both produce the same result, but the latter is obviously more efficient than the former. Because the latter does not produce a large number of locked table scans or index scans.

If you want to verify that there is a record in the table, do not use COUNT (*) as inefficient and waste server resources. Can be replaced with exists. Such as:

Can be written as:

It is often necessary to write a t_sql statement that compares a parent result set and a child result set to find out if there are records in the parent result set that are not in the child result set, such as:

Select A.hdr_key from Hdr_tbl a----tbl a means that TBL uses alias a instead  
Where not EXISTS (Select * from dtl_tbl b Where a.hdr_key = B.hdr_key)
Select A.hdr_key from Hdr_tbl a

Select Hdr_key from Hdr_tbl

Three kinds of writing can get the same correct results, but the efficiency is reduced in turn.

12. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, be aware that the index is very limited (only the primary key index).

13. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.

14. Temporary tables are not unusable, and they can be used appropriately to make certain routines more efficient, for example, when you need to repeatedly reference a dataset in a large table or a common table. However, for one-time events, it is best to use an export table.

15. When creating a temporary table, if you insert a large amount of data at one time, you can use SELECT INTO instead of CREATE table to avoid causing a large number of logs to increase speed, and if the amount of data is small, create table to mitigate the resources of the system tables. Then insert.

16. If a temporary table is used, be sure to explicitly delete all temporary tables at the end of the stored procedure, TRUNCATE table first, and then drop table, which avoids longer locking of the system tables.

17. Set NOCOUNT on at the beginning of all stored procedures and triggers, set NOCOUNT OFF at the end. You do not need to send a DONE_IN_PROC message to the client after each statement that executes the stored procedure and trigger.

18. Try to avoid large transaction operation and improve the system concurrency ability.

19. Try to avoid the return of large data to the client, if the amount of data is too large, should consider whether the corresponding demand is reasonable.

20. Avoid using incompatible data types. For example, float and int, char and varchar, binary, and varbinary are incompatible. Incompatible data types may make the optimizer unable to perform some optimizations that could otherwise have been performed. For example:

Select name from employee Where salary > 60000

In this statement, such as the salary field is a money type, it is difficult for the optimizer to optimize it because 60000 is an integer number. We should convert an integer into a coin type when programming, rather than wait for a run-time conversion.

21. Make full use of the connection conditions, in some cases, there may be more than one connection between the two tables, at this point in the WHERE clause in the connection condition complete write, it is possible to greatly improve the query speed.

Cases:

Select SUM (A.amount) from account a,card B Where a.card_no = b.card_no  

The second sentence will be much faster than the first sentence.

22. Use the view to speed up the query
Sorting a subset of tables and creating views can sometimes speed up queries. It helps to avoid multiple sorting operations, and in other ways simplifies the work of the optimizer. For example:

Select cust.name,rcvbles.balance,......other Columns  
From Cust,rcvbles
Where cust.customer_id = rcvlbes.customer_id
and rcvblls.balance>0
and cust.postcode> "98000"
OrDER by Cust.name

If the query is to be executed more than once, all unpaid customers can be found in a single view and sorted by the customer's name:

Create VIEW DBO. V_cust_rcvlbes  

Select cust.name,rcvbles.balance,......other Columns
From Cust,rcvbles
Where cust.customer_id = rcvlbes.customer_id
and rcvblls.balance>0

Then query in the view in the following way:

Select * from V_cust_rcvlbes  
Where postcode> "98000"

The number of rows in the view is less than the rows in the primary table, and the physical order is the required order, reducing disk I/O, so the query effort can be significantly reduced.

23, you can use distinct without GROUP by

Can be changed to:

24. Use UNION ALL to not use Union

UNION all does not execute the SELECT DISTINCT function, which reduces a lot of unnecessary resources

35. Try not to use the SELECT INTO statement.

The Select inot statement causes the table to lock and prevent other users from accessing the table.

above we mentioned some basic considerations for improving query speed, but in more cases it is often necessary to experiment with different statements to get the best solution. The best way of course is to test, see the implementation of the same function of the SQL statement which execution time is the least, but the database if the amount of data is not comparable, then you can use to view the execution plan, that is: the implementation of the same function of multiple SQL statements to the Query Analyzer, according to Ctrl+l to look at the index used, The number of table scans (both of which have the greatest impact on performance) and the overall cost percentage to be consulted.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.