Q: As the database grows, since the role of the index is so important, who can put aside the specific database to explain how the index works? Answer: (I'll answer the question myself: O) Why do I need an index
The data is stored as a block on disk. To ensure atomicity of the disk operation, all data blocks are accessed when the data is accessed. These blocks of data on disk are similar to linked lists, where they all contain a data segment and a pointer pointing to the memory address of the next node (the data block), and none of them require continuous storage (that is, logically contiguous chunks can be physically far apart).
Since many records can only be sorted by one field, to query an unsorted field, you need to use a linear lookup, that is, to access N/2 blocks of data, where n refers to all the blocks of data that a table covers. If the field is a non-key field (that is, does not contain a unique value), then search the entire tablespace, that is, to access all n blocks of data.
However, for a sorted field, you can use a binary lookup so that you access only log2 n blocks of data. Similarly, for a non-key field that has already been ordered, it is no longer necessary to search for additional chunks of data in the table as long as a larger value is found. As a result, there is a substantial improvement in performance.
What is an index
An index is a way to sort records by multiple fields. Indexing a field in a table creates another data structure that holds the value of the field, and each value points to the record associated with it. The data structure of this index is sorted so that it can perform a binary lookup.
The disadvantage of an index is that it consumes additional disk space. Because the index is stored in the MyISAM database, if you index many fields in the same table, the file may quickly expand to the upper limit set by the file system.
How the Index Works
First, consider the pattern of a sample database table:
Field name Data type The size on the diskId(PrimaryKey) UnsignedInt4 BytesFirstNameChar(50) 50 lastname char (50) 50 emailaddress char (100) 100 byte
Note: Use thischar
And notvarchar
is to accurately describe the size of the disk that the data occupies. The sample database contains 5 million rows of records, and no indexes are established. Next we analyze two queries against this table: one query usesid
(The sorted key field), another query uses thefirstName
(Unsorted non-key field).
Example Analysis One
For this sample database with R = 5 000 000 records, allocate a fixed storage space of R = 204 bytes per record on disk. This table is saved in the MyISAM database, and the default database block size for this database is B = 1024 bytes. Thus, we can calculate that the block factor for this table is BFR = (B/R) = 1024/204 = 5, that is, each block of data on the disk holds 5 records. So, the number of blocks required to save the entire table is N = (r/bfr) = 5000000/5 = 1 000 000.
Use a linear lookup search ID field-This field is the key field (the value of each field is unique) and requires access to N/2 = 500 000 data blocks to find the target value. However, because this field is sorted, you can use the binary lookup method, which only requires access to log2 1000000 = 19.93 = 20 blocks. Obviously, this will bring about a great improvement in performance.
Look again at the FirstName field, this field is unsorted, so it is not possible to use binary lookup, and the value of this field is not unique, so to find the end from the beginning of the table, that is, to access N = 1 000 000 data blocks. This situation can be improved by building an index.
If an index record contains only an indexed field and a pointer to the original record, it is definitely smaller than the record it points to that contains more fields. That is, the index itself consumes less disk space than the original table, so the number of chunks that need to be traversed is less than the original table. The following is the pattern for the FirstName field index:
Field name Data type size on disk firstname char (50) 50 (record pointer) special 4 byte
Note: In MySQL, depending on the size of the table, the size of the pointer may be 2, 3, 4, or 5 bytes.
Example Analysis Two
For this sample database with R = 5 000 000 records, each index record consumes R = 54 bytes of disk space, and the same default chunk size B = 1024 Bytes is used. Then the block factor of the index is BFR = (B/R) = 1024/54 = 18. Finally, the index of this table needs to occupy N = (r/bfr) = 5000000/18 = 277 778 data blocks.
You can now use the index to improve performance by searching the FirstName field again. Using a binary lookup for an index requires access to log2 277778 = 18.09 = 19 data blocks. Plus access to a block of data to find the address of the actual record, with a total access to 19 + 1 = 20 blocks, which is a far cry from the fact that searching for unindexed tables requires access to 277 778 blocks.
When to use an index
Creating an index takes extra disk space (for example, an additional 277 778 blocks in the example above), and too many indexes may result in insufficient disk space. Therefore, be sure to choose the correct field carefully when indexing.
Because an index can only increase the speed of a matching field in a search record, it is purely a waste of disk space and processing time to index a field in the case of an insert and delete operation, in which case the index is not used. In addition, the cardinality (cardinality) or uniqueness of the data is also important due to the binary lookup reason. When you index a field with a cardinality of 2, the data is divided in half, and about 1000 records are returned for fields with a base of 1000. With such a low cardinality, the efficiency of the index is reduced to the level of linear lookups, and the query optimizer discards the index when the cardinality is greater than 30% of the number of records, which in fact equals the index simply wasting space.
How the query optimizer works
The most important problem in query optimization is to accurately estimate the cost of different query plans. When the optimizer estimates the cost of a query plan, it uses a mathematical model that relies on an estimate of the cardinality (or weight count) of the maximum amount of data involved in each query plan. The estimation of cardinality depends on the estimation of the predicate selection factor (selection factor of predicates) in the query. In the past, when estimating selectivity, the database system used detailed statistics about the distribution of values in each field, such as histograms. This technique works well for estimating the selectors of an orphaned predicate. However, many of the predicates of the query are interrelated, such asselect count(*) from R where R.make=‘Honda‘ and R.model=‘Accord‘
。 Query predicates are often highly correlated (for example,model=‘Accord‘
The precondition ismake=‘Honda‘
), and estimating the selectivity of this association is very difficult. The query optimizer chooses poor query plans on the one hand because the cardinality estimation is not allowed, on the other hand because of the omission of a lot of relevance. This is why database administrators should frequently update database statistics, especially after important data is loaded and unloaded. (translated from Wikipedia: Http://en.wikipedia.org/wiki/Query_optimizer. )
How database indexing Works