Why does the Database need to be indexed?

Last Update:2018-12-08 Source: Internet

Author: User

Tags mysql manual

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Unique Index

Here we want to summarize the previous index learning notes:

First, understand why the index increases the speed. When DB executes an SQL statement, the default method is to scan the entire table based on the search conditions, and add the matching conditions to the search result set. If we add an index for a field, we will first locate the number of rows of a specific value in the index list, which greatly reduces the number of matched rows, and significantly increases the query speed. Should indexes be added at any time? Here are several counterexamples: 1. If you need to retrieve all the table records every time, you must scan the entire table in any case, so it doesn't make sense to add an index. 2. Adding an index for a non-unique field, such as a large number of repeated fields such as "gender", is meaningless. 3. For tables with fewer records, increasing indexes will not lead to speed optimization, but will waste storage space, because indexes require storage space, in addition, each execution of update/insert/delete requires that the index of the field be recalculated and updated.

So when should I add an index? Let's take a look at the example in the Mysql manual. Here is an SQL statement:

SELECT c. companyID, c. companyName FROM Companies c, User u WHERE c. companyID = u. fk_companyID AND c. numEmployees> = 0 AND c. companyName LIKE '% I %' AND u. groupID IN (SELECT g. groupID FROM Groups g WHERE g. groupLabel = 'executive ')

This statement involves the join of three tables and contains many search conditions such as size comparison and Like matching. The number of rows to be scanned by Mysql is 77721876 if no index is available. After adding an index to the companyID and groupLabel fields, the number of scanned rows only needs 134 rows. In Mysql, you can use Explain Select to view the number of scans. It can be seen that in the case of such join tables and complex search conditions, the performance improvement brought by indexes is far more important than the disk space occupied by indexes.

How is the index implemented? Most DB vendors implement indexes based on the B-tree data structure. Because B-tree is suitable for organizing dynamic search tables on direct storage devices such as disks. B is defined as follows: An m (m> = 3) Level B tree is a m Cross Tree that meets the following conditions:

1. Each node includes the following scopes (j, p0, k1, p1, k2, p2,... ki, pi). Where j is the number of keywords, p is the Child pointer.

2. All leaf nodes are on the same layer, and the layers are equal to the height of the tree.

3. The number of keywords contained in each non-root node must meet the requirements of [m/2-1] <= j <= S-1

4. If the tree is not empty, the root has at least one keyword. If the root is not a leaf, there are at least two Subtrees, and at most m Subtrees.

Let's look at a B-tree example. The B-tree with 26 English letters can be constructed as follows:

We can see that the complexity of searching for English letters in Tree B is only o (m). When the data volume is large, such a structure can greatly increase the query speed. However, another data structure query function is faster than Tree B-hash. The Hash table is defined as follows: Set all possible keyword sets to u, and the actual stored keywords are recorded as k, while | k | ratio | u | is much smaller. The hash function h maps u to the subscript of table T [0 M-1], so that the keyword in u is a variable, h is the storage address of the corresponding node. So that the search can be completed in o (1) time.
However, there is a defect in the hash, that is, the hash conflict, that is, the two keywords are computed using the hash function to obtain the same results. Set m and n to indicate the length of the hash list and the number of filled nodes, respectively. n/m is the filling factor of the hash list. A larger factor indicates a larger chance of hash conflicts.
Because of this defect, the database does not use a hash as the default index implementation, mysql claims that it will try to convert the disk-based B-tree index to an appropriate hash index based on the Query format to further improve the search speed. I think other database vendors will have similar strategies. After all, in the database battlefield, the search speed and Management Security are equally important competitors.

Basic concepts:

Index

You can use indexes to quickly access specific information in database tables. An index is a structure that sorts the values of one or more columns in a database table, for example, the last name (lname) column of the employee table. If you want to search for a specific employee by name, the index will help you get the information faster than all rows in the table that must be searched.

The Index provides pointers to the data values stored in the specified column of the table, and then sorts these pointers according to the sort order you specify. The database uses an index in a similar way as you use an index in a book: it searches for an index to find a specific value, and then returns the pointer to the row containing the value.

In the database graph, you can create, edit, or delete each index type on the index/Key Attribute page of the selected table. When you save the table to which the index is attached or the relational graph of the table is saved, the index is saved in the database. For more information, see create an index.

Note: not all databases use indexes in the same way. For more information, see Database Server considerations or database documentation.

As a general rule, an index must be created on a table only when data in the index column is frequently queried. Indexes occupy disk space and speed up adding, deleting, and updating rows. In most cases, the speed advantage of indexing for data retrieval is much higher than that of indexing.

Index Column

You can create an index based on a single or multiple column in a database table. Multiple-column indexes enable you to differentiate rows with the same value in one of the columns.

If you often search for two or more columns at the same time or sort by two or more columns, the index is also helpful. For example, if you often set a criterion for the first and second columns in the same query, it makes sense to create multiple columns of indexes in these two columns.

Determine the validity of the index:

Check the WHERE and JOIN clauses of the query. Each column in any clause is an object that can be selected by the index.
Test the new index to check its impact on running query performance.
Consider the number of indexes created on the table. It is best to avoid having many indexes on a single table.
Check the definitions of indexes created on the table. It is best to avoid overlapping indexes that contain shared columns.
Check the number of unique data values in a column and compare the quantity with the number of rows in the table. The comparison result is the selectivity of the column, which helps to determine whether the column is suitable for creating an index. If so, determine the index type.

Index type

Based on the functions of the database, you can create three indexes in the Database Designer: unique index, primary key index, and clustered index. For more information about the index functions supported by the database, see the database documentation.

Tip:Although the unique index helps to locate information, we recommend that you use primary keys or unique constraints to obtain the best performance results.

Unique Index

A unique index is an index that does not allow any two rows to have the same index value.

When duplicate key values exist in existing data, most databases do not allow you to save the newly created unique index with the table. The database may also prevent adding new data that will create duplicate key values in the table. For example, if the employee's last name (lname) in the employee table creates a unique index, neither employee can have the same name.

Primary Key Index

A database table often has a column or a combination of columns. Its Values uniquely identify each row in the table. This column is called the primary key of the table.

When you define a primary key for a table in the database relationship diagram, the primary key index is automatically created. The primary key index is a specific type of unique index. This index requires that each value in the primary key be unique. When a primary key index is used in a query, it also allows quick access to data.

Clustered Index

In the clustered index, the physical order of the row in the table is the same as the logic (INDEX) Order of the key value. A table can contain only one clustered index.

If an index is not a clustered index, the physical sequence of the row in the table does not match the logical sequence of the key value. Compared with non-clustered indexes, clustered indexes generally provide faster data access speeds.

Establishment Method and precautions

The most common case is to create an index for the field that appears in the where clause. For the sake of convenience, we should first create the following table.

Create table mytable (

Id serial primary key,

Category_id int not null default 0,

User_id int not null default 0,

Adddate int not null default 0

);

If you use statements similar to the following in queries:

SELECT * FROM mytable WHERE category_id = 1;

The most direct response is to create a simple index for category_id:

Create index mytable_categoryid

ON mytable (category_id );

OK. What if you have more than one selection condition? For example:

SELECT * FROM mytable WHERE category_id = 1 AND user_id = 2;

Your first possible response is to create an index for user_id. No. This is not the best method. You can create multiple indexes.

Create index mytable_categoryid_userid ON mytable (category_id, user_id );

Have you noticed my habits in naming? I use "Table name_field 1 name_field 2 name. You will soon know why I did this.

Now you have created an index for an appropriate field. However, it is a bit difficult. You may ask, will the database actually use these indexes? Test it. For most databases, this is very easy. You only need to use the EXPLAIN command:

EXPLAIN

SELECT * FROM mytable

WHERE category_id = 1 AND user_id = 2;

This is what calls s 7.1 returns (exactly as I expected)

NOTICE: query plan:

Index Scan using mytable_categoryid_userid on

Mytable (cost = 0. 00 .. 2.02 rows = 1 width = 16)

EXPLAIN

The above is the ipvs data. We can see that the database uses an index (a good start) during query, and it uses the second index I created. See the benefits of my naming above. You will immediately know that it uses the appropriate index.

Next, let's make it a little more complex. What if there is an order by clause? Believe it or not, most databases will benefit from the index when using order.

SELECT * FROM mytable

WHERE category_id = 1 AND user_id = 2

Order by adddate DESC;

Just like creating an index for a field in the where clause, it also creates an index for the field in the order by clause:

Create index mytable_categoryid_userid_adddate

ON mytable (category_id, user_id, adddate );

Note: "mytable_categoryid_userid_adddate" will be truncated

"Mytable_categoryid_userid_addda"

CREATE

Explain select * FROM mytable

WHERE category_id = 1 AND user_id = 2

Order by adddate DESC;

NOTICE: query plan:

Sort (cost = 2. 03 .. 2.03 rows = 1 width = 16)

-> Index Scan using mytable_categoryid_userid_addda

On mytable (cost = 0. 00 .. 2.02 rows = 1 width = 16)

EXPLAIN

Let's take a look at the EXPLAIN output. The database has done more sorting that we don't need. Now we know how the performance is damaged. It seems that we are a little optimistic about the operation of the database itself. So, give the database more tips.

In order to skip the sorting step, we do not need other indexes. We just need to change the query statement slightly. Postgres is used here. We will give the database an extra prompt-Add the field in the where statement to the order by statement. This is only a technical process, and it is not necessary, because in fact, there is no sorting operation on the other two fields, but if you add, postgres will know what it should do.

Explain select * FROM mytable

WHERE category_id = 1 AND user_id = 2

Order by category_id DESC, user_id DESC, adddate DESC;

NOTICE: query plan:

Index Scan Backward using

Mytable_categoryid_userid_addda on mytable

(Cost = 0. 00 .. 2.02 rows = 1 width = 16)

EXPLAIN

Now we use the expected index, and it is quite intelligent. We know that we can start to read the index, thus avoiding any sorting.

The above is a little more detailed. However, if your database is huge and your daily page requests reach millions, I think you will benefit a lot. However, if you want to perform more complex queries, such as combining multiple tables for query, especially when the where restriction clause contains fields from more than one table, what should you do? I usually try to avoid this practice, because the database should combine all the items in each table, and then exclude the unsuitable rows, which may cause great overhead.

If it cannot be avoided, you should check each table to be combined and use the above policy to create an index, and then use the EXPLAIN command to verify whether the expected index is used. If yes, OK. If not, you may need to create a temporary table to combine them and use appropriate indexes.

Note that creating too many indexes will affect the update and insertion speed, because it needs to update each index file as well. For a table that often needs to be updated and inserted, there is no need to create an index for a rarely used where clause. For a small table, the sorting overhead is not very high, there is no need to create another index.

The above is just a few basic things. In fact, there are a lot of knowledge in it. By explaining, we cannot determine whether this method is optimal, each database has its own optimizer. Although it may not be well-developed, they will compare which method is faster during query. In some cases, it may not be faster to create an index. For example, when an index is placed in a non-contiguous bucket, this will increase the read burden on the disk. Therefore, the actual environment should be used to determine which one is the best.

In the beginning, if the table is not large and there is no need to make an index, my opinion is to make an index only when necessary, and some commands can be used to optimize the table, for example, MySQL can use "optimize table ".

To sum up, you should have some basic concepts about how to create an appropriate index for the database.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More