Storage and indexing of database management systems

Source: Internet
Author: User

First, the basic concept

8.1 Data on external storage

The unit of reading and writing storage information on disk is a page, the size of which is a parameter of the DBMS, and the typical value is 4KB or 8KB. The cost of page I/O (from disk input to main memory and from main memory output to disk) represents the cost of typical database operations, so it is important to optimize the database system to reduce this cost.

Remember the main points:

1. Disk is the most important external memory device, it allows to retrieve each page at a fixed price (more or less), but if we sequentially read several pages in the physical order in which the pages are stored, it may cost much less than reading the same page in any order.

2. Tapes are sequential access devices that require a page-by-page read, and are therefore typically used for archival of infrequently used data

3. Each record in the file has a unique identifier, which becomes the record ID, or RID, and a RID has an attribute that can be used to identify the address of the page containing the record on disk.

To handle this, the file needs to be read into memory, and for persistent storage, the data needs to be written to disk, which is done by the software layer that becomes the buffer manager. The space on the disk is managed by disk space manager.

8.2 File organization and Index

A log file is an important abstraction of the DBMS, where files can be created, deleted, and inserted into or deleted from records, and the file layer stores the records in a file on a set of disk pages.

The simplest file structure of a tree is an unordered file or a heap file.

An index is a data structure on a disk that is used to optimize the operation of a class of data retrieval. We use term data items to refer to records stored in an index file. Search for data items with a value of K the disciplinary k*, which contains enough information to locate a data record with a search code value of K.

The following are three ways to store data as an index:
1. Data item k* is a true data record.

2. The data item is a <k, rid> pair, where rid-list is the record ID of the data record with the search code value K.

3. The data item is a <k, rid-list> pair, where rid-list is a list of record IDs for data records with a search code value of K.

8.2.1 Clustered Index

Definition: The order of data records is the same as or similar to the order of data items for an index, or it is a clustered index, otherwise it is not a clustered index. In the above three index methods, the first way is clustered index, and two or three way index only when the data records are sorted by the search code is clustered index, otherwise, if the data is arbitrary arrangement, only by their physical order, then the index data items in the order of data records is meaningless.

Why indexes can improve performance: because you can quickly narrow the scope of the search!!!

8.3 Index data structure:

A) hash-based index:

The records in the file are placed in separate buckets, one of which consists of a home page or a chain of pages consisting of a home page and multiple page. A record belongs to which bucket can be determined by a special function for the search code, and this function becomes a hash function. Given a bucket number, a hash-based index structure allows us to retrieve the bucket's home page one or two times I/O.

b) Tree-based index:

The data items are arranged according to the search code values, and a hierarchical search data structure is maintained to direct the search to the page to which the data item belongs.

Non-clustered hash indexes and tree indexes provide high-speed insertions and deletions, but the processing of scans and range queries with more matching results is slower. In terms of equivalent search, the hash index is slightly faster, but the hash index does not support the selection of range search criteria.

Comparison of I/O costs:

Use B to indicate the number of data item pages when there is no space wasted in the page, use R to represent the number of records per page, average time to read or write a disk page is D, average time to process a record is C, and in a hash file organization, use a function (hash function) to map a record to a set of numbers, The time required to apply a hash function to a record is H, and for the tree index, the fan-out is represented by F. The following table is available:

A) design examples of clustered indexes:

Select E.do

From Employees E

where e.age>40

Analysis of the use of B + Tree index in age, whether such an index is worth building, first of all depend on the condition of selectivity (popularity). If everyone is actually more than 40 years old, such an index advantage will be very small. Assuming that only 10% of employees are older than 40, is such an index valid? The answer depends on whether the index is clustered, and if it is non-clustered, each employee who satisfies the condition spends one time I/O. If it is clustered, you only need to scan 10% I/O.

Select E.dno,count (*)

From Employees E

where e.age>10

Group by E.dno

Assuming that there is a B + Tree index on the age, the retrieved records are sorted on DNO and the results of the query are obtained. If almost all employees are more than 10 years old, this is not a number of implementation plans, if the index is non-clustered, then the plan is probably the worst.

Suppose the index on DNO meets our goal: to retrieve all records by using that index, to group by the values of DNO, and then to count the number of records older than 10 for each DNO value (this strategy requires only the records to be grouped and no records to be sorted). Therefore, the effectiveness of this method depends very much on whether the index is clustered. If it's non-clustered, I/O is possible for each record, the cost of this scenario is staggering.

Conclusion: If the condition on age is not very selective (that is, records that meet this condition account for up to 80% of Total Records), a clustered index should be established on the DNO. If the conditions on age are selective, you should consider indexing on age (not necessarily clustered) to quickly narrow the look-up.

Clustering is also important for indexing on search codes that do not contain candidate codes.

Select E.dno

From Employees E

Where e.hobby= ' stamps '

If a lot of people are collecting stamps, it is inefficient to retrieve records from a non-clustered index. Even if there is no index (simple scanning of all records may be cheaper), therefore, you should consider establishing a clustered index on hobby.

Take a look at how the aggregation operation affects the selection of the index:

Select E.dno,count (*)

From Employees E

Group by E.dno

The most straightforward query scenario: Sort employees by DNO, and then calculate the number of employee records for each DNO value. However, if you have a DNO index, you can answer the query only by scanning the index without scanning the records.

b) Composite code design Example

A composite search code supports a wider range of queries because it can match more criteria.

Consider the following query, which requires employees returning 20<age<30 and 3000<sal<5000

Select E.eid

From Employees E

where e.age between and 30

and e.sal between and 5000

Compound codes in <age,sal> or <sal,age> may be helpful for queries if the Where condition is selective. And obviously this index is a B + Tree index rather than a hash index because the hash index cannot answer the range query.

Then the order of the search codes sometimes comes back very differently:

Select E.eid

From Employees E

where E.age =25

and e.sal between and 5000

The B + Tree index in <age,sal> will have good performance because the records are first sorted by the age after the Sal sort. Performance on the <sal,age> is not so good, even if two have the same age value, also back very far.

Composite indexes are also useful when working with clustered queries:

Select AVG (e.sal)

From Employees E

where E.age =25

and e.sal between and 5000

In the index of <age,sal> or <sal,age>, the query is allowed to be answered only by scanning the index.

Storage and indexing of database management systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.