Common SQL statement writing skills

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Source: javaeye blog Publisher: Admin

Rewriting of SQL Structured query strings is the most realistic and effective way to improve database query performance. Sometimes it is the only way, for example, you cannot greatly modify the existing database structure.
The key to improving query performance by optimizing SQL statements is:
? Establish appropriate indexes based on actual needs;
? Make good use of indexes in all possible ways to avoid full table scanning;
? Minimize memory and data I/O overhead

1. Create an index
(1) creating "appropriate" indexes is the basis for fast query.
An index is a user-defined data structure that is stored on physical media in addition to tables. When you search for data based on the value of the index code, the Index provides quick access to the data. In fact, without indexes, the database can successfully retrieve the results based on the SELECT statement. However, as the table grows larger, the "appropriate" index becomes more and more effective. Note that the word "appropriate" is used in this sentence, because if you do not seriously consider the implementation process when using indexes, indexes can both improve and damage the performance of the database.
An index is actually a special directory. SQL Server provides two types of indexes:
? Clustered index (also known as clustering index and Cluster Index)
We refer to this text content as a directory arranged according to certain rules as "clustered Index ".
For example:
To query a word in a Chinese dictionary by pinyin, you can use the "clustered Index". In fact, you do not need to query the Directory and search for it directly in the dictionary body, you can quickly find the desired Chinese characters (assuming you know the pronunciation ).
? Non-clustered index (nonclustered index, also known as non-clustered index and non-clustered index)
We refer to the directory as a directory, and the text as a non-clustered index ".
For example:
In the Chinese dictionary, you can query a word by the beginning of the Chinese Dictionary. The first directory and the body must be deliberately connected through the page number. The order is not the same.

The biggest difference between clustered index and non-clustered index is that clustered index sorts the original data table, so as long as the index conditions are met, data records can be directly read continuously, instead of clustered indexes, only an index table is created. After obtaining data from the index table, you also need to read the corresponding data from the data row indicated by the pointer. Therefore, in terms of performance, clustered indexes are much better than non-clustered indexes.
However, in a table, only one clustered index is allowed, which is more valuable. Therefore, you should try your best to use the most frequently used indexes. In addition, the start column of the index must be used for query; otherwise, the index is invalid. In addition, the starting column must also be frequently used, so that the index performance can be optimized.

(2) Table: when to use clustered or non-clustered Indexes

Action Description use clustered index use non-clustered Index
Columns are often sorted by group ○
Returns data in a certain range ○
One or few different values
Different decimal places ○
Different values for large numbers ○
Frequently updated columns ○
Foreign key column ○
Primary Key column ○
Frequently modify index columns ○

(3) Notes for indexing
1. Do not waste clustered indexes on primary keys unless you only query by primary key
although SQL Server creates a clustered index on the primary key by default, this is a waste in practical applications. Generally, we create an ID column in each table to distinguish each data entry. The ID column is automatically increased and the step size is generally 1. At this time, if we set this column as the primary key, SQL Server will set this column as a clustered index by default. The advantage of doing so is that you can physically sort your data in the database by ID, but this is of little practical value.
from the definition of clustered indexes we mentioned earlier, we can see that, the biggest advantage of using clustered index is that it can quickly narrow the query scope according to the query requirements to avoid full table scanning. In practice, because the ID number is automatically generated and we do not know the ID number of each record, it is difficult for us to use the ID number for query in practice. This makes the primary key of the ID number a waste of resources as a clustered index. Compared with non-clustered indexes, clustered indexes have obvious advantages. Each table can have only one clustered index rule, which makes clustered indexes more valuable, it should be used in other fields with high query frequency. Second, setting different fields for each ID number as clustered indexes does not comply with the rule of "no aggregated index should be set up for different values of large numbers". Of course, this is only applicable to users' frequent modification of record content, especially when indexing items, but does not affect the query speed.

2. index creation should be based on actual application requirements.
Not simply creating an index on any field can increase the query speed. The rule for building clustered indexes is roughly "the vast majority cannot be the same, and there cannot be only a few identical ". For example, it is appropriate to create an aggregate index on the date and Date Field of the document table. In the government system, we collect some files every day, and these files will be issued on the same date. The performance improvement of building an aggregate index on the document date should be considerable. Under the cluster index, the data is physically stored on the data page in order, and duplicate values are also arranged together. Therefore, you can first find the start and end points of this range during range search, in addition, only data pages are scanned within this range, which avoids large-scale scanning and improves the query speed.
Another example is the opposite: for example, in the "gender" column of the employee table, there are only two different values: "male" and "female", so there is no need to create an index.

3. Add all fields in the clustered index that need to increase the query speed to form a composite index.
Based on the results of some experiments, we can draw some conclusions for reference:
? The query speed is almost the same when only the starting column of the composite clustered index is used as the query condition and all columns of the composite clustered index are used at the same time, it is even faster than the latter (when the number of query result sets is the same );
? If only the non-starting column of the composite clustered index is used as the query condition,
This index does not have any effect.
? All columns of the composite clustered index are used. If the query conditions are strict and the query results are few, "index coverage" is formed, and the performance can be optimal.
? The most important thing is that the starting column of a composite aggregate index must be the most frequently used column.

4. Based on some other practical experience,
? Using an aggregate index is faster than using a primary key that is not an aggregate index;
? Using an aggregate index is faster than using a general primary key for order by, especially in the case of small data volumes;
? When you use an aggregate index for a period of time, the search time decreases proportionally according to the percentage of data in the entire data table, regardless of the number of aggregate indexes used;
? The query speed of the date column is not slowed down because of input in minutes;
? Changing the content of a table will cause index changes. Frequent insert, update, and delete statements will cause the system to perform index updates at a high cost, resulting in a reduction in overall performance. Generally, when the query performance requirement is higher than the data maintenance performance requirement, you should try to use the index. Otherwise, you should carefully consider the price. In some extreme cases, you can delete the index first, update a large amount of data to the database table, and then re-create the index. The new index is always easy to use.

2. Write optimized SQL statements to make full use of Indexes
The following describes the issues that need to be paid attention to when writing the WHERE clause of some SQL statements. In these where clauses, even if some columns have indexes, the system cannot use these indexes when running the SQL statement because of poor SQL writing. The full table scan is also used, this greatly reduces the response speed.
Before SQL statements are submitted to the database for operation, they all go through the query analysis phase. the SQL Server built-in query optimizer analyzes each part of the query condition, determine whether these conditions meet the scan parameter (Sarg) standards. Only when a query condition meets the Sarg standard can the query performance be improved through pre-configured indexes.
Definition of Sarg: a criterion used to restrict search operations. It usually refers to a specific match, a matching within a definite range, or an and connection between two or more conditions. The general format is as follows:
Column name operator <constant or variable>
Or
<Constant or variable> operator column name
The column name can appear on one side of the operator, while the constant or variable appears on the other side of the operator. For example:
Name = 'zhang san'
Price> 5000
5000 <price
Name = 'zhang san' and price> 5000
If an expression cannot meet the form of Sarg, it cannot limit the search range. That is to say, SQL Server must determine whether each row meets all the conditions in the WHERE clause, scans the entire table. Therefore, an index is useless for expressions that do not meet the Sarg format. For example, when the query condition is "price * 2> 5000, you cannot use the index created on the price field.
SQL Server has a built-in query optimizer that can automatically convert certain conditions into Sarg-compliant conditions. For example, convert "price * 2> 5000" to "Price> 2500/2 ", in order to achieve the purpose of using indexes, but this conversion is not 100% reliable, sometimes there is a semantic loss, and sometimes the conversion is not possible. If you are not familiar with the working principle of the "query optimizer", the SQL statements you write may not be queried according to your intention. Therefore, you cannot rely solely on the optimization of the query optimizer. We recommend that you use your optimization knowledge to explicitly write SQL statements that comply with the Sarg standard as much as possible, and determine the construction method of the query conditions on your own, this helps the query analyzer to analyze the optimal index matching sequence, and also facilitates re-reading in the future.Code.
After introducing Sarg, we will further explain it with some practical examples:
1. Whether the like statement belongs to the Sarg depends on the style using the % wildcard.
For example, name like 'zhang % ', which belongs to Sarg
And: name like '% Zhang' does not belong to Sarg
The use of wildcard % in the first character of the string will cause the index to become unusable. Although it is difficult to avoid this in practical application, you should still understand this phenomenon, at least we know that the usage performance is low.
2. The "Non" operator does not meet the Sarg format, making indexes unusable.
The most typical case for statements that do not meet the requirements of the Sarg format is to include non-operator statements, such as: Not ,! =, <> ,! <,!> , Not exists, not in, not like, etc.
The following is an example of a not clause:
... Where not (status = 'valid ')
Not operators are also implicitly included in other logical operators, such as <> operators. See the following example:
... Where status <> 'invalid ';

Let's look at the example below:
Select * from employee where salary <> 3000;
You can rewrite this query to not using not:
Select * from employee where salary <3000 or salary> 3000;

Although the results of these two queries are the same, the second query scheme is faster than the first query scheme. The second query allows indexes for salary columns, while the first query does not.

3. function operations are not in the Sarg format,
for example, columns in the following SQL condition statements have an appropriate index, however, the execution speed is very slow:
select * from record where substring (card_no) = '000000' (13 seconds)
select * from record where amount/30 <1000 (11 seconds)
select * from record where convert (char (10), date, 112) = '000000' (10 seconds)
analysis:
Any column operation results in the WHERE clause are calculated by column one by one during SQL Execution. Therefore, it has to perform a full table scan without using the index on the column; if these results are obtained during query compilation, they can be optimized by the SQL optimizer and indexed to avoid table search. Therefore, rewrite the SQL statement as follows:
select * from record where card_no like '000000' (<1 second)
select * from record where amount <1000*30 (<1 second)
select * from record where date = '2017/12/01 '(<1 second)

You will find that SQL is much faster

4. Try not to directly process the indexed fields.
Select * From employs where first_name + last_name = 'beill cliton ';
Index unavailable

Changed:
Select * from employee where
First_name = substr ('beill cliton', 1, instr ('beill cliton', '')-1)
And
Last_name = substr ('beill cliton', instr ('beill cliton', '') + 1)
You can use the index
5. The efficiency of different types of indexes is different. We recommend that you use high efficiency as much as possible.
For example, the index search efficiency of the numeric type is higher than that of the string type, and the index search efficiency of the fixed-length string char and nchar is higher than that of the variable-length string varchar and nvarchar.
You should set
Where username = 'zhang san' and age> 20
Improved
Where age> 20 and username = 'zhang san'
Note:

Here, the SQL query and analysis optimization function can automatically rearrange the conditional order, but we recommend that you arrange it manually in advance.

6. Do not use is null or is not null as the query condition.
Any column containing null values will not be included in the index. If there is a null value in the data of a column, it is doubtful that the performance improvement of index creation for this column is, especially when null is used as part of the query condition. We recommend that you avoid using is null and is not null, and do not allow null in database fields. You should also use the default value or manually enter a value even if there is no content, such: ''empty string.

7. In some cases, the role of in is equal to or, and indexes cannot be fully utilized.
For example, the stuff table has 200000 rows and the id_no table has non-clustered indexes. See the following SQL statement:
Select count (*) from stuff where id_no in ('0', '1') (23 seconds)

The 'in' in the where condition is logically equivalent to 'or', so the syntax analyzer will ′) convert it to id_no = '0' or id_no = '1' for execution. We expect it to look for the results separately based on each or clause, so that the index on id_no can be used. However, in fact, it uses the "or policy ", that is, first extract the rows that satisfy each or clause, store them to the worksheet of the temporary database, create a unique index to remove duplicate rows, and finally calculate the results from this temporary table. Therefore, the index on id_no is not used in the actual process, and the completion time is also affected by the performance of the tempdb database.
Practice has proved that the more rows in the table, the worse the performance of the worksheet. When stuff has 620000 rows, the execution time will be very long! If different conditions do not generate a large number of duplicate values, it is better to separate the or clause:
Select count (*) from stuff where id_no = '0 ′
Select count (*) from stuff where id_no = '1 ′
Two results are obtained, and it is cost-effective to use Union for an addition. Because each sentence uses an index, the execution time is relatively short,
Select count (*) from stuff where id_no = '0 ′
Union
Select count (*) from stuff where id_no = '1 ′
From the practical results, the use of Union is usually much more efficient than the use of or, while the use of exist and in keywords is similar in usage and performance, full table scans are generated, which is inefficient. According to unverified statements, exist may be faster than in.

8. Use workarounds to improve query efficiency
The like keyword supports wildcard matching, but this matching is particularly time-consuming. For example, select * from customer where zipcode like "21 _", even if an index has been created on the zipcode field, full table scan may be used in this case. If you change the statement to select * from customer where zipcode> "21000", the query will be executed using the index, greatly improving the speed. However, this kind of flexibility is limited and should not cause losses in the business sense. For postal code, zipcode like "21 _" is exactly the same as zipcode> "21000.
9. Efficient Use of composite indexes
Assume that a composite index has been created on the date, place, and amount fields.
Select count (*) from record
Where date> '000000' and date <'000000' and amount> 19991201
(<1 second)

Select date, sum (amount) from record group by date
(11 seconds)

Select count (*) from record
Where date> '200' and place in ('bj ', 'sh ′)
(<1 second)
This is a reasonable combination of indexes. It uses date as the leading column, so that each SQL can use the index, and the index coverage is formed in the first and third SQL statements, so the performance is optimal. If the index is not easy to change, it is also feasible to correct the conditional order in SQL to match the index order.
10. Order by sorting by clustered index columns has the highest efficiency
Sorting is a time-consuming operation. You should try to simplify or avoid sorting large tables, such as narrowing the range of columns to sort only on indexed columns.
Let's look: (GID is the primary key, and fariqi is the aggregate index column)
Select top 10000 GID, fariqi, reader, title from tgongwen
Time: 196 milliseconds. Scan count 1, logical read 289, physical read 1, and pre-read 1527.
Select top 10000 GID, fariqi, reader, title from tgongwen order by GID ASC
Time: 4720 milliseconds. 1 scan count, 41956 logical reads, 0 physical reads, and 1287 pre-reads.
Select top 10000 GID, fariqi, reader, title from tgongwen order by GID DESC
Time: 4736 milliseconds. 1 scan count, 55350 logical reads, 10 physical reads, and 775 pre-reads.
Select top 10000 GID, fariqi, reader, title from tgongwen order by fariqi ASC
Time: 173 milliseconds. Scan count 1, logical read 290, physical read 0, pre-read 0.
Select top 10000 GID, fariqi, reader, title from tgongwen order by fariqi DESC
Time: 156 milliseconds. Scan count 1, logical read 289, physical read 0, pre-read 0.
From the above, we can see that the speed of not sorting and the number of logical reads are equivalent to the speed of "order by clustered index columns, however, these queries are much faster than those of "order by non-clustered index columns.
At the same time, when sorting by a field, whether in positive or reverse order, the speed is basically equivalent.
Iii. Measures to save data query system overhead
1. Use top to minimize the Retrieved Data Volume
Top is the keyword used in SQL Server to extract the first few or a certain percentage of data.
Select top 20 GID, fariqi, reader, title from tgongwen order by GID DESC
Select top 60 percent GID, fariqi, reader, title from tgongwen order by GID DESC
In practical applications, you should often use top to remove unnecessary data and retain only necessary data sets. This not only reduces the number of logical reads of the database, but also avoids unnecessary memory waste and improves system performance.

2. Field extraction should follow the principle of "How much is required and how much is requested" to avoid "select *"
For example:
Select top 10000 GID, fariqi, reader, title from tgongwen order by GID DESC
Time: 4673 milliseconds
Select top 10000 GID, fariqi, title from tgongwen order by GID DESC
Time: 1376 milliseconds
Select top 10000 GID, fariqi from tgongwen order by GID DESC
Time: 80 ms
From this point of view, the larger the field size and the larger the number, the more resources the SELECT statement consumes. For example, a field of the int type is much faster than a char. Each time we extract one less field, the data extraction speed will be improved accordingly. The increase is determined based on the size of the discarded field.
3. Comparison between count (*) and count (field) Methods
Let's take a look at some experiment examples (GID is the primary key of tgongwen ):
Select count (*) from tgongwen
Time: 1500 milliseconds
Select count (GID) from tgongwen
Time: 1483 milliseconds
Select count (fariqi) from tgongwen
Time: 3140 milliseconds
Select count (title) from tgongwen
Time: 52050 milliseconds
From the above, we can see that the speed of using count (*) and count (primary key) is equivalent, while count (*) it is faster than other fields except the primary key, and the longer the field, the slower the summary speed. If count (*) is used, SQL Server automatically searches for the smallest field for summary. Of course, if you write the count (primary key) directly, it will be more direct.
4. When nested queries exist, filter data at the internal layer as much as possible.
If a column appears in both the primary query and the where clause, it is likely that after the column value in the primary query changes, the subquery must be re-queried. The more nested query layers, the lower the efficiency. Therefore, avoid subqueries as much as possible. If the subquery is unavoidable, filter as many rows as possible in the subquery.
5. Pay attention to the table order and filter out data as early as possible when querying multi-Table Association.
When using join for multi-table join queries, you should use the solution with the minimum system overhead. The connection conditions should fully consider the tables with indexes and multiple rows, and pay attention to optimizing the order of the tables. To put it simply, reduce the associated data volume as early as possible.
In general, sqlserver automatically optimizes table connections. For example:
Select name, no from
Join B on A. ID = B. ID
Join C on C. ID = A. ID
Where name = 'wang'
Although Table A is listed first in from, then B, and finally c. However, SQL Server may first use table C. Its selection principle is to limit the query to a single row or a few rows to reduce the total data volume to be searched in other tables. In most cases, SQL Server will make the best choice, but if you find that a complex join query is slower than expected, you can use the set forceplan statement to force SQL Server to use the table in the order that the table appears. In the preceding example, add: Set forceplan on ....... The execution sequence of the Set forceplan off table will be executed in the order you wrote. In the query analyzer, view the two execution efficiency types to select the table connection sequence. The disadvantage of set forceplan is that it can only be used in stored procedures.

Summary:
? Clustered indexes are valuable and should be used at the highest query frequency;
? When the data is in the "not the vast majority of the same, not a very small number of identical" status,
The ability to maximize the potential of clustered indexes;
? Make sure that the order of compound indexes is consistent;
? The expression of a condition clause must comply with the Sarg specification and can be indexed;
? Any operations on columns may cause full table scans, such as database functions and computing expressions,
When querying, try to move the operation to one side of the equal sign;
? Note that indexes cannot be fully utilized when null values exist;
? Exist, in, or, and other clauses often invalidate the index;
If a large number of duplicate values are not generated, consider splitting the clause and concatenating them with union;
? Full use of indexed fields should be made in sorting;
? Filter out useless data as early as possible, and only take necessary data to subsequent operations.
From the content described above, we can see that the essence of SQL statement optimization is to use the Sarg standardized statements that can be identified by the analysis optimizer to fully utilize indexes on the premise that the results are correct, reduce the number of data I/O operations and avoid full table scans.
Some of the above are guiding theoretical principles and some are practical experience. You should handle them flexibly when using them, and select appropriate methods based on the actual situation. The experiment data listed in this article is only used for comparison and is not of universal significance. In actual projects, you should make full use of performance monitoring and analysis tools (such as sqlserver related tools) to test your own optimization results.
In addition, it is important to remind everyone that the cost of completing the same complex data operations at the SQL Server database level is far lowerProgramThe cost of code completion. Therefore, we recommend that you thoroughly and thoroughly study the application of important keywords in SQL syntax, such as group by and having. Try to put data operation tasks in the database system. The performance optimization of database application systems is a complex process. These are only a manifestation of the SQL statement level, in-depth research will also involve resource configuration at the database layer, traffic control at the network layer, and the overall design of the operating system layer.Article.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Common SQL statement writing skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Common SQL statement writing skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support