Database indexing of massive data processing

Source: Internet
Author: User

The first part, database index and its optimization one, what is an index

A database index is like a directory in front of a book, speeding up the query speed of a database.
For example such a query: SELECT * FROM table1 where id=44. If there is no index, the entire table must be traversed until the row with the ID equal to 44 is found, and after the index (which must be an index established on the ID column), you can find the position of the line by looking at 44 (that is, in the ID column) to find the line. It is visible that the index is used for positioning.
The index is divided into clustered index and non-clustered index, and the clustered index is in order according to the physical location of the data, and the non-clustered index is different; it is obvious that only one clustered index can be built on a base table. After the clustered index is established, updating the data on the index column often results in a change in the physical order of the records in the table, which makes it more expensive to set up clustered indexes for frequently updated columns, and the clustering index can improve the speed of multi-row retrieval, while the non-clustered index is very fast for single-line retrieval. Create a clustered index such as:

Create cluster index ID on Student (ID);

Two, overview

The purpose of indexing is to speed up the lookup or sorting of records in a table.
There is a cost to indexing a table: one is to increase the storage space for the database, and the other is to spend more time inserting and modifying the data (because the index changes as well).

In the case of streamlining, an index is a structure. In SQL Server, the storage structure of the Index and table (which refers to the table with the clustered index) is the same as the B-tree, which is a balanced multi-fork tree used for lookups. Understanding the concept of B-trees such as:

Understanding why a B-tree is used as a structure for indexes and tables (with clustered indexes) first requires understanding how SQL Server stores data.

In SQL Server, the smallest unit of storage is a page, and the page is not re-divided. Just as cells are no longer part of biology, or atoms are the smallest units in chemistry that cannot be divided. This means that SQL Server reads the page, either the entire read, or not at all, without compromise.

In the case of database retrieval, the disk IO scan is the most time consuming. Because disk scanning involves a lot of physical features, these are quite time consuming. So the B-tree was designed to reduce the number of scans on the disk. If a table or index does not use a B-tree (for tables that do not have a clustered index that uses heap heap storage), then looking up a data requires a full scan of the database pages contained in the entire table. This will undoubtedly greatly increase the IO burden. Instead of using B-trees for storage in SQL Server, you just need to store the root node of the B-tree in memory, and after a few lookups, you can find the page containing the leaf node that holds the required data! The overall scan to avoid this improves performance.

Below, an example is shown to illustrate:

In SQL Server, if the clustered index is not established on the table, it is stored in heap (heap), assuming I have a table like this:

Now there are no indexes on this table, that is, heap storage, and I show a decrease in IO by adding a clustered index (stored in B-tree) to it:

Iii. Understanding clustered and nonclustered indexes

In SQL Server, the two most important types of indexes are clustered and nonclustered indexes. As you can see, these two classifications are clustered around this keyword. So first of all to understand what is aggregation.

The definitions that are clustered in the index:

In order to improve the query speed of a property (or group of properties), it is called aggregation that a tuple with the same value on this or these attributes (called the aggregation code) is stored in a contiguous set of physical blocks.

In simple terms, a clustered index is:

In SQL Server, the role of aggregation is to change the physical order of one column (or columns) to match the logical order, for example, I extract 5 data from the employee of the AdventureWorks database:

When I set up a clustered index on ContactID, I query again:

In SQL Server, the storage of a clustered index is stored as a B-tree, and the leaves of the B-tree store the data of the clustered index directly:

Because the clustered index changes the physical storage order of the table in which it resides, there can be only one clustered index per table.

Nonclustered Indexes

Because each table can have only one clustered index, if our query on a table is not limited to the fields on the clustered index. We also have requirements for indexes outside of the clustered index column, so we need a nonclustered index.

A nonclustered index is, in essence, a clustered index. A nonclustered index does not alter the physical structure of the table in which it is located, but rather generates an additional B-tree structure for a clustered index, but the leaf node is a reference to its own table, which is divided into two types, and the row number is referenced if there is no clustered index on the table. If the clustered index already exists on the table, the page referencing the clustered index is referenced.

A simple nonclustered index concept is as follows:

As you can see, nonclustered indexes require additional space for storage, clustered indexes by indexed columns, and a leaf node in the B-tree contains pointers to the table where the nonclustered indexes are located.

In MSDN, the description graph for nonclustered indexes is:

As you can see, a nonclustered index is also a B-tree structure, unlike a clustered index, where the leaf node of a B-tree is a pointer to a heap or clustered index.

The principle of nonclustered indexes shows that if the physical structure of the table is changed, such as adding or deleting a clustered index, then all nonclustered indexes need to be rebuilt, and this loss of performance is considerable. Therefore, it is best to set up a clustered index and then set up a corresponding nonclustered index.

Clustered Index VS nonclustered index

This is explained earlier by the principle of clustered and nonclustered indexes. It's easy to see that, in most cases, a clustered index is faster than a nonclustered index. Because the B-tree leaf nodes of the clustered index store data directly, the clustered index also requires additional pointers to the leaf nodes to find the data.

Also, for large numbers of continuous data lookups, nonclustered indexes are weak because nonclustered indexes need to find a pointer to each row in the B-Tree of the nonclustered index, and then go to the table where the data is found, so performance can be compromised. Sometimes it's better not to add nonclustered indexes.

Therefore, in most cases, the clustered index is faster than the nonclustered index. But there can only be one clustered index, so choosing the columns that are applied to the clustered index is critical to query performance improvements.

Copyright NOTICE: This article for Bo Master http://www.zuiniusn.com original article, without Bo Master permission not reproduced.

Database indexing of massive data processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.