Explanation of why the database should be indexed

Source: Internet
Author: User
Tags documentation mysql manual sorts

A database index is an identity attached to a table field to increase query speed. Seen a lot of people mechanically understand the concept of indexing and think that adding indexes only benefits no harm. Here you want to summarize the previous index learning notes:

First understand why the index increases speed, when the DB executes an SQL statement, the default way is to perform a full table scan based on the search criteria, and join the search result collection when the match condition is met. If we add an index to a field, the query will first go to the number of rows in the index list once to a particular value, greatly reducing the number of rows to be traversed, so it can significantly increase the speed of the query. Should it be indexed at all times? Here are a few counter-examples: 1, if you need to fetch all the table records, in any case must be a full table scan, then whether or not to add index is meaningless. 2, to non-unique fields, such as "gender" such a large number of duplicate values of the field, to increase the index does not make any sense. 3, for the record less table, increase the index does not bring speed optimization instead of wasting storage space, because the index is required storage space, and a fatal disadvantage is that for each execution of Update/insert/delete, the index of the field must recalculate the update.

So when is it appropriate to add an index? Let's look at a MySQL manual cited example, here is an SQL statement:

SELECT C.companyid, c.companyname from Companies C, User u WHERE C.companyid = U.fk_companyid and c.numemployees >= 0 A ND c.companyname like '%i% ' and U.groupid in (SELECT g.groupid from Groups g WHERE g.grouplabel = ' Executive ')

This statement involves joins of 3 tables and includes many search criteria such as size comparisons, like matches, and so on. In the absence of an index, the number of scanned rows that MySQL needs to perform is 77721876 rows. After we indexed the two fields on CompanyID and Grouplabel, the number of rows scanned only required 134 rows. In MySQL, you can view the number of scans by Explain Select. It can be seen that in the case of such tables and complex search conditions, the performance gain of the index is much more important than the disk space it occupies.

So how is the index implemented? Most DB vendor implementation indexes are based on a data structure--b tree. Because B-trees are characterized by the right to organize dynamic lookup tables on direct storage devices such as disks. The definition of B-tree is this: a M (m>=3)-Order B-Tree is a M-fork tree that satisfies the following conditions:

1. Each node includes the following scopes (j, P0, K1, p1, K2, p2, ... ki, pi) where j is the number of keywords and P is the child's pointer

2, all leaf nodes on the same layer, the number of layers equal to the tree height h

3, each non-root node contains the number of keywords to meet [m/2-1]<=j<=m-1

4, if the tree is not empty, then the root has at least 1 keywords, if the root is not a leaf, then at least 2 subtrees tree, at most of the M subtrees tree

Looking at a B-tree example, a B-tree for 26 English letters can be constructed like this:

Can be seen in the B-Tree Search English letter complexity is only O (M), in the case of large data volume, such a structure can greatly increase the query speed. However, there is another data structure query that is wasted faster than a B-tree-a hash table. The definition of a hash table is this: set all possible keyword sets to u, the actual occurrence of stored keywords recorded as k, and |k| is much smaller than |u|. The hashing method is to map u to the subscript of table t[0,m-1] by the hash function h, so that the keyword in U is a variable, and the result of the function of H is the storage address of the corresponding node. So that the search can be completed in the time of O (1).
However, the hash table has a flaw, that is, hash conflict, that is, two keywords through the hash function to calculate the same result. Set m and n respectively to indicate the length of the hash table and the number of nodes filled, n/m is the filling factor of the hash list, the larger the factor, the greater the opportunity to indicate a hash conflict.
Because of this flaw, the database does not use a hash table as the default implementation of the index, and MySQL claims to attempt to transform the disk-based B-tree index into an appropriate hash index based on the execution of the query format to further improve the search speed. I think other database vendors will have similar strategies, after all, in the database battlefield, search speed and management security is a very important point of contention.


Introduction to Basic concepts:


Index

Use an index to quickly access specific information in a database table. An index is a structure that sorts the values of one or more columns in a database table, such as the Last Name (lname) column of an employee table. If you want to find a specific employee by last name, the index will help you get that information faster than if you have to search all the rows in the table.

The index provides pointers to the data values stored in the specified columns in the table, and then sorts the pointers according to the sort order that you specify. The database uses the index in the same way that you use an index in a book: it searches the index to find a specific value, and then follows the pointer to the row that contains the value.

In a database diagram, you can create, edit, or delete each index type in the Indexes/Keys property page of the selected table. The index is saved in the database when you save the table to which the index is attached, or when you save the diagram where the table is located. For more information, see Creating Indexes.

Note that not all databases use the index in the same way. For more information, see Database Server considerations, or consult your database documentation.

As a general rule, you need to create an index on a table only when you frequently query the data in an indexed column. Indexes consume disk space and reduce the speed at which rows are added, deleted, and updated. In most cases, the speed advantage of indexing for data retrieval is much greater than it.

Indexed Columns

You can create indexes based on single or multiple columns in a database table. Multi-column indexes allow you to distinguish between rows in which one column might have the same value.

Indexes are also helpful if you frequently search for two or more columns or sort by two or more columns at the same time. For example, if you frequently set criteria for a first and last name in the same query, it would make sense to create a multicolumn index on those two columns.

To determine the validity of an index:

    • Check the WHERE and JOIN clauses of the query. Each column included in either clause is an object that the index can select.
    • Experiment with the new index to check its impact on running query performance.
    • Consider the number of indexes that have been created on the table. It is best to avoid having many indexes on a single table.
    • Checks the definition of an index that has been created on the table. It is best to avoid overlapping indexes that contain shared columns.
    • Checks the number of unique data values in a column and compares the number to the number of rows in the table. The result of the comparison is the selectivity of the column, which helps to determine if the column is suitable for indexing and, if appropriate, the type of the index.
Index Type

Depending on the capabilities of your database, you can create three indexes in the Database Designer: Unique indexes, primary key indexes, and clustered indexes. For more information about the index features supported by the database, see the database documentation.

Tips:Although a unique index helps locate information, for best performance results, it is recommended that you use a primary KEY or a unique constraint instead. Unique index

A unique index is one that does not allow any two rows to have the same index value.

When duplicate key values exist in existing data, most databases do not allow a newly created unique index to be saved with the table. The database may also prevent the addition of new data that will create duplicate key values in the table. For example, if a unique index is created on the employee's last name (lname) in the Employees table, none of the two employees will have a namesake.

Primary key Index

Database tables often have one column or column combination whose values uniquely identify each row in the table. This column is called the primary key of the table.

Defining a primary key for a table in a database diagram automatically creates a primary key index, which is a specific type of unique index. The index requires that each value in the primary key be unique. When a primary key index is used in a query, it also allows quick access to the data.

Clustered index

In a clustered index, the physical order of rows in a table is the same as the logical (indexed) Order of the key values. A table can contain only one clustered index.

If an index is not a clustered index, the physical order of the rows in the table does not match the logical order of the key values. Clustered indexes typically provide faster data access than nonclustered indexes.


Building methods and considerations

The most common scenario is to build an index for the field that appears in the WHERE clause. To facilitate the narration, we first set up a table as follows.

CREATE TABLE MyTable (

ID Serial primary Key,

category_id int NOT NULL default 0,

USER_ID int NOT NULL default 0,

adddate int NOT NULL default 0

);

If you use a statement similar to the following at query time:

SELECT * FROM MyTable WHERE category_id=1;

The most straightforward response is to create a simple index for category_id:

CREATE INDEX Mytable_categoryid

On MyTable (category_id);

OK. What if you have more than one choice? For example:

SELECT * FROM MyTable WHERE category_id=1 and user_id=2;

Your first reaction might be to create an index for user_id. Not good, this is not the best way. You can build multiple indexes.

CREATE INDEX Mytable_categoryid_userid on MyTable (category_id,user_id);

Did you notice my habit of naming it? I use "Table name _ Field 1 _ Field 2 Name" way. You'll soon know why I did it.

Now that you've indexed the appropriate fields, it's a bit of a worry, you might ask, does the database actually use these indexes? Testing is OK, for most databases, this is easy, just use the explain command:

EXPLAIN

SELECT * FROM MyTable

WHERE category_id=1 and user_id=2;

This is what Postgres 7.1 returns (exactly as I expected)

Notice:query PLAN:

Index Scan using Mytable_categoryid_userid on

MyTable (cost=0.00..2.02 Rows=1 width=16)

EXPLAIN

The above is the Postgres data, you can see that the database used an index when querying (a good start), and it is using the second index I created. See the benefits of naming me above, you immediately know it uses the appropriate index.

And then, a little bit more complicated, what if there's an order by sentence? Believe it or not, most databases will benefit from the index when they use the order by.

SELECT * FROM MyTable

WHERE category_id=1 and user_id=2

ORDER by Adddate DESC;

As simple as creating an index for a field in the WHERE clause, an index is also created for the field in the ORDER BY clause:

CREATE INDEX Mytable_categoryid_userid_adddate

On MyTable (category_id,user_id,adddate);

Note: "Mytable_categoryid_userid_adddate" will be truncated to

"Mytable_categoryid_userid_addda"

CREATE

EXPLAIN SELECT * FROM MyTable

WHERE category_id=1 and user_id=2

ORDER by Adddate DESC;

Notice:query PLAN:

Sort (cost=2.03..2.03 Rows=1 width=16)

-Index Scan using Mytable_categoryid_userid_addda

On MyTable (cost=0.00..2.02 Rows=1 width=16)

EXPLAIN

Look at the output of the explain, the database did a we do not ask for a sort, this is how the performance is damaged, it seems that we are a bit too optimistic about the operation of the database itself, then, give the database a little more hints.

To skip the sorting step, we don't need another index, just change the query statement a little bit. Here's the postgres, we'll give the database an extra hint--in the ORDER by statement, add the field in the where statement. This is only a technical process and is not necessary because there is actually no sort operation on the other two fields, but if you join, Postgres will know what it should do.

EXPLAIN SELECT * FROM MyTable

WHERE category_id=1 and user_id=2

ORDER by category_id desc,user_id desc,adddate DESC;

Notice:query PLAN:

Index Scan Backward using

Mytable_categoryid_userid_addda on MyTable

(cost=0.00..2.02 Rows=1 width=16)

EXPLAIN

Now we're using the index we expect, and it's pretty smart, knowing that you can start reading from behind the index, avoiding any sort of ordering.

That's a little bit more, but if your database is huge and the daily page requests are millions, I think you'll benefit a lot. But what if you want to make more complex queries, such as querying multiple tables together, especially if the fields in the where constraint are from more than one table? I usually try to avoid this, because the database has to combine everything in each table and then exclude the inappropriate rows, which can be costly.

If you can't avoid it, you should look at each table you want to combine, and use the above strategy to build the index, and then use the explain command to verify that you have used the index you expect. If it is, then OK. If not, you may want to create a temporary table to combine them and use the appropriate index.

It is important to note that too many indexes will affect the speed of updates and insertions because it requires the same update for each index file. For a table that often needs to be updated and inserted, there is no need to index a rarely used where clause, and for smaller tables, the cost of sorting is not significant and there is no need to create additional indexes.

The above is only a few very basic things, in fact, there are a lot of learning, explain alone we can not determine whether the method is the most optimized, each database has its own some of the optimizer, although may not be perfect, but they will be compared in the query of which way faster, in some cases, Indexing is not necessarily fast, for example, when the index is placed in a discontinuous storage space, which increases the load on the read disk, and therefore, which is optimal, should be verified by the actual use of the environment.

In the beginning, if the table is not large, there is no need to index, my opinion is to be indexed when needed, there are also some commands to optimize the table, such as MySQL available "OPTIMIZE table."

In summary, you should have some basic concepts on how to build the right index for your database.

Explanation of why the database should be indexed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.