How does a relational database work-cache (translated from Coding-geek article)

Last Update:2016-03-27 Source: Internet

Author: User

Tags oracle documentation prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: http://coding-geek.com/how-databases-work/#Buffer-replacement_strategies

The cache chapters are translated first, and the other chapters are translated later. Contents of translation in the original text:

first, Data manager

The data query performs a query operation that fetches data from the data table and sends a request to data manger to get the information. There are 2 questions:

Relational data uses a thing model that cannot perform query operations when the database is performing a modification operation. Avoid querying dirty data.
Data extraction is the slowest database operation because the data is to be read from disk. Therefore, the database must have a very powerful data caching system.

In this chapter, we'll look at how relational data solves both problems. We will not discuss how the database loads data from disk, this is not the focus of this article (limited by space, do not expand analysis).

second, the high-speed buffer

As I said before, the database's performance bottleneck is I/O. To improve performance, modern databases use caching.

Instead of reading data directly from the disk file, the data Finder fetches the data from the cache manger. The cache manger manages a memory area called a buffer pool. Getting data directly from memory makes the performance of accessing the database leap. However, it is difficult to assess how important it is to use the cache, depending on what kind of database operation you are doing.

Sequential access VS random access.
Read operations VS write operations.

And what kind of disk the database is using.

7.2k/10k/15k rpm HDD
Ssds
RAID 1/5/...

However, I dare say that using memory cache is 100 to 100,000 times times faster than the non-applicable cache reading data directly from disk.
This also leads to another problem (all databases have this problem ...), the cache needs to prefetch the data before the query accesses the data, otherwise the query needs to be suspended, waiting for the cache to load the data from the disk into memory first.

third, cache data pre-fetching

The core of the problem is "data prefetching". The data finder knows what data is needed because it understands the specific requirements of each query operation and also the storage structure of the database tables. The basic logic for data prefetching is this:

The data query notifies the cache manger to load the second batch of data into the buffer before fetching the first batch of data.
The data Finder notifies the cache manger to load the third batch of data in advance when it obtains the second batch of data, and the first batch of data can be removed from the cache.
.......

The cache Manger stores all the data in the buffer pool. To determine whether the data in the cache pool is being used, cache manger needs to maintain some additional information about the data (something called a lock).

Sometimes, however, the data Finder does not know what data is needed for the next step, or the database does not provide functionality to specify which data to prefetch. Instead, the database provides a random prefetch function (for example, after querying the data, it may also require 7,8,9, loading 7,8,9 into the cache in advance) or sequential caching (after executing a query, the other data adjacent to the query data on the disk is also pre-provisioned into the cache).

To assess the effectiveness of the cache manger expected mechanism, modern database systems provide a metric metric: Cache Hit ratio. The cache hit rate describes the probability that the finder will get the data from the cache (without having to read the disk file).

Description: A bad cache hit rate does not always mean the cache is not working well. For more information, refer to the Oracle documentation.

However, the cache memory size is limited and the cached content needs to be constantly rehan transpire. The loading and removal of cached data requires disk I/O and network I/O resources to be consumed. If a query operation is to be performed frequently, it is very inefficient to load and remove the cached data frequently. To solve this problem, the modern database uses some cache substitution strategies.

Iv. Cache substitution Strategy

Most modern database cache substitution strategies Use the LRU algorithm, at least SQL Server, MySQL, Oracle and DB2.

1. LRU

LRU means that it is not currently used recently. The algorithm is based on the assumption that the most recently used data will be reused in the future in a very large probability and needs to reside in the cache, whereas the data that is not recently used can be removed.

For ease of understanding, we assume that the data in the cache is not locking (and therefore can be removed). As an example of how it works, the cache pool can hold 3 of data in this simple example.

After the cache manger uses data 1, it puts 1 into the cache.
After the cache manger uses data 4, it puts 4 into the cache.
After the cache manger uses data 3, it puts 3 into the cache.
After the cache manger uses data 9, it puts 9 into the cache. Because the cache is full, you need to remove a piece of data first, and which one to remove?
According to the LRU principle, 1 is the furthest current data to be used, removing 1 and adding 9.
Cache Manger is cached after using Data 4, and 4 becomes the most recently used data. Adjust the order.
Cache Manger is cached after using Data 1, and 1 becomes the most recently used data. 3 is removed.
......

Algorithm OK, but there are some limitations, if reading is a large table? In other words, the table data being read is too large to exceed the size of the cache space. Using this algorithm clears all data before the cache, even if the newly loaded large table data is used only once and is no longer used.

2. Algorithm improvements

To solve this problem, some database management systems have added some special rules. For example: Oracle Rule description:

For a super large table, read the data directly from the disk file, avoiding caching. For medium tables, you can either read directly from the disk file or use the cache, and if you use the cache, you should put the read data at the end of the LRU list (so that the data for the table will be removed first when the cache data is added).

The LRU algorithm has an advanced version called Lru-k. For example, SQL Server uses the lru-k, k=2. K represents the number of times a data access is considered in the most recent time period.
The previous example is the simplest example of the LRU-K algorithm, considering only one visit, K = 1. The principle of lru-k is as follows:

Record the number of recent visits to the data (up to K times).
Set a weight based on the number of data accesses. The greater the number of recent accesses, the higher the weight value.
When a batch of new data is loaded into the cache, data with a large weight is not removed, even if the data is loaded early into the cache.
If the data has not been reused for a long time, the weight will gradually decrease.

The calculation of weights is very resource-intensive, which is why SQL Server uses k=2. This kind of setting way, the input-output ratio is high.

For more in-depth knowledge of the LRU algorithm, you can refer to the algorithm document (Google Docs).

3. Other algorithms

There are also other algorithmic strategies for managing the cache.

2Q (similar to lru-k algorithm)
CLOCK (similar to lru-k algorithm)
MRU (used more algorithms, logic similar to LRU, with another set of rules)
LRFU (most recently, most frequently used algorithms)
......

Some databases allow you to use algorithms other than the default algorithm. A variety of options are available.

Five, write cache

The most discussed before is the read buffer, which is loaded into memory before the data is used. There is also a write cache in the database that accumulates data stores that have been modified multiple times and writes to disk files at once. Reduces frequent access to disk IO (database bottlenecks in I/O).

Keep in mind that the data stored in the cache is paged data and not the rows in people's intuitive impressions. If a page of data in the cache has been modified and not saved to disk, this page is called "Dirty page". There are several strategy algorithms that can evaluate the best time to write dirty page data to disk, which is also strongly related to things (the transaction is what will be expanded in the next section).

How does a relational database work-cache (translated from Coding-geek article)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More