SSD-based database performance optimization

Source: Internet
Author: User
Tags blank page

Nor and NAND

Nor and NAND are both Flash technologies. Nor is developed by Intel. It is somewhat similar to memory and allows direct access to any memory unit through an address. The disadvantage is: low Density (Small capacity), low write and erase speeds. Nand is developed by Toshiba. It features high density (Large capacity) and fast write and erase speeds. However, it can only be accessed after address conversion through a specific IO Interface, some are similar to disks.

The USB flash drives, SD cards, and SSD that are widely used now all belong to the NAND type. Vendors encapsulate flash memory into different interfaces. For example, Intel SSD uses SATA interfaces, like a general SATA disk, some enterprise-level flash cards, such as fusionio, are encapsulated as PCIe interfaces.


SLC is a single-pole unit and MLC is a multi-level unit. The difference between the two lies in the data volume (density) stored per unit. SLC stores only one bit per unit and only contains 0 and 1 voltage characters, each unit of MLC can store two pairs of voltage characters (, and 11 ). Obviously, MLC has a larger storage capacity than SLC, but SLC is simpler and more reliable, SLC reads and writes faster than MLC, and SLC is more durable than MLC, MLC can be erased for 1 million times per unit, while SLC can be wiped for 10 million times. Therefore, enterprise-level flash storage products generally use SLC, Which is why enterprise-level products are much more expensive than home products.

Technical Features of SSD

SSD, compared with traditional disks, has no mechanical device, and is changed from magnetic medium to dielectric. There isFTL(Flash transalation layer), which is equivalent to the Controller in the disk. Its main function is to map the physical address of flash memory to the logical address of the disk LBA, and provides transparent access to the OS.

SSD does not have the seek time and delay time of traditional disks, so SSD can provide a very high random reading capability, which is its biggest advantage. SLC type SSD can usually provide more than 35000 iops, A traditional 15 k sas disk can only contain up to 160 iops, which is almost an astronomical number for a traditional disk. SSD's continuous read capability is not obvious compared with ordinary disks, because continuous read does not require seek time for traditional disks. For a 15 k sas disk, the continuous read throughput can reach 130 MB, the SLC type SSD can reach 170-200 MB. We can see that although SSD is higher than traditional disks in terms of throughput, its advantage is not obvious.

SSD write operations are special. The minimum write unit of SSD is 4 kb, which is called page. When a blank position is written, it can be written in 4 kb, however, if you want to rewrite a unit, you need an extra erase action. The erased unit is generally 128 pages (Kb), and each erased unit is called a block ). If you want to write information to a blank page, you can directly write data without Erasure. However, to rewrite the data of a storage unit (PAGE), you must first read the entire block into the cache, then, modify the data, erase the data of the entire block, and write the entire block. Obviously, it is very costly for SSD to rewrite the data. This feature of SSD is called erase-before-write.

After testing, the instant Write Performance of slc ssd can reach about 3000 iops, and the continuous write throughput can reach 170-200 MB. This data is much higher than that of traditional disks. However, with the continuous Writing of SSD, the write performance will gradually decrease when more and more data needs to be rewritten. After our tests, SLC is obviously better than MLC in this aspect. After a long period of writing, the random write Io of MLC is greatly reduced, while the SLC performance is relatively stable. To solve this problem, each vendor has many policies to prevent the write performance from dropping.

Wear leveling

Because SSD has a "Write wear" problem, when a unit is repeatedly written (such as Oracle redo) for a long time, it will not only cause write performance problems, in addition, it will greatly shorten the service life of SSD. Therefore, we must design a load balancing algorithm to ensure that each SSD unit can be used in a balanced manner. This is wear leveling, called the loss balancing algorithm.

Wear leveling is also implemented by FTL in SSD, which achieves balanced loss through data migration. Wear leveling depends on some of the reserved space in SSD. The basic principle is to set two block pools in SSD, one of which is the free block pool (idle pool ), one is the data block pool. When a page needs to be rewritten (if the original location is written, the entire block must be erased before data can be written ), instead of writing the original location (the operation that does not need to be erased), the new block is retrieved from the idle pool, and the existing data and the data to be rewritten are merged into a new block, write a new blank block together. The original block is marked as invalid (waiting for erased and recycled), and the new block enters the data pool. Background tasks regularly extract invalid data blocks from data blocks, erase them, and recycle them to the idle pool. The advantage of this is that the same block is not repeatedly written, and the write speed is faster (the erasure action is omitted ).

Wear leveling is divided into two types: dynamic loss balancing and static loss balancing. The two have the same principle. The difference is that dynamic algorithms only process dynamic data, for example, data migration is triggered only when data is rewritten, which does not work for static data. However, static algorithms can balance static data. When a background task finds a low-loss static data block, migrate it to other database blocks and place these blocks in the idle pool for use. From the perspective of the balanced effect, the static algorithm is better than the dynamic algorithm, because almost all blocks can be used in a balanced manner, and the life of SSD will be greatly extended, however, the disadvantage of static algorithms is that when data is migrated, The Write Performance may decrease.

Write Amplification

Because SSDErase-before-writeSo there is a concept of write amplification. For example, if you want to rewrite 4 k Data, you must first read the data in the entire erased block (Kb) to the cache, write the entire block together. At this time, you actually write 128 kb of data with a write amplification factor. The best case for write amplification is 1, which means there is no amplification.

The wear leveling algorithm can effectively alleviate the write amplification problem, but unreasonable algorithms still lead to write amplification. For example, when users need to write 4 K data, they find that there is no blank block in the free block pool, in this case, you must select a block containing invalid data in the data block pool, read the block into the cache first, rewrite it, and write the entire block together, the wear leveling algorithm still causes write amplification.

By reserving more space for SSD, You can significantly alleviate performance problems caused by write amplification. Based on our test results, after a long period of random write, the performance of the mlc ssd is significantly reduced (random write iops is even reduced to 300 ). If we reserve more space for wear leveling, We can significantly improve the performance degradation of mlc ssd after a long write operation. The more space reserved, the more obvious the performance improvement. In comparison, the performance of slc ssd is much more stable (random write can be stable at 3000 iops after a long random write ), I think it should be because the SLC SSD capacity is usually relatively small (32g and 64g), and the space used for wear leveling is relatively large.

Database Io Feature Analysis

Io has four types: continuous read, random read, random write, and continuous write. The IO size of continuous read/write is usually large (kb-1mb), which mainly measures the throughput, the random read/write Io size is relatively small (less than 8 KB), which mainly measures iops and response time. Full table scanning in the database is a continuous read Io, index access is a typical random read Io, log files are continuous write Io, and data files are random write Io.

The database system is designed based on the traditional disk access features. The biggest feature is the use of log filesSequential LoggingThe log files in the database must be written to the disk when the transaction is committed, which requires a high response time. Therefore, the sequential write method can effectively reduce the disk tracing time, reduce latency. The sequential write of log files, although physical locations are continuous, is not the same as the traditional continuous write type. The IO size of log files is very small (usually less than 4 K ), each Io is independent (the head must be lifted to seek again and wait for the disk to rotate to the corresponding position), and the interval is very short. The database uses the log buffer (cache) and group commit (batch commit) to increase the IO size, and reduce the number of Io times to get a smaller response latency, therefore, the sequential write of log files can be considered as"Random write at continuous locations", The bottleneck is still in iops, rather than throughput.

Data File UsageIn place updateIt means that all the modifications to the data file are written to the original location. The data file is different from the log file and will not be written to the data file during the transaction commit, the dirty buffer is refreshed to the corresponding position only when the database finds that there are too many dirty buffers or requires checkpoint actions. This is an asynchronous process. Generally, random write of data files does not have very high Io requirements, as long as they meet the requirements of checkpoint and dirty buffer.

Analysis of ssd I/O features

1. the random read capability is very good, and the continuous read performance is average, but better than that of common SAS disks.

2. There is no latency for disk seek, and there is little difference between random write and continuous write Response latency.

3. the erase-before-write feature causes write amplification and affects write performance.

4. Write wear characteristics. The wear leveling algorithm is used to prolong the service life, but it also affects read performance.

5. The IO response latency of read and write operations is not equal (read operations are much better than write operations), while the IO response latency of read and write operations on general disks is slightly different.

6. Continuous writing is better than random writing. For example, 1 m sequential writing is much better than 128 8 K instant writing, because instant writing will lead to a lot of erasure.

Based on the above features of SSD, if you put all the databases on SSD, the following problems may occur:

1. Log File sequential logging will repeatedly erase the same location. Although there is a loss balancing algorithm, long write will still cause performance degradation.

2. Data Files in place update will generate a large number of random writes, and erase-before-write will generate write amplification.

3. Hybrid database read/write applications with a large number of random writes, affecting read performance and resulting in a large amount of Io latency.

SSD-based database optimization rules:

SSD-based optimization solves the write amplification problem caused by erase-before-write. Different types of Io separation reduce the performance impact of write operations.

1. Change sequential logging to In-page logging to avoid repeated writes at the same location.

2. A large number of random in-place update writes are merged into a small number of sequential writes through cache writing.

3. Taking advantage of the high random read/write capability of SSD, we can reduce the number of writes and increase the read performance.

In-page Logging

In-page logging is an optimization method based on SSD for database sequential logging. The sequential logging in the database is very beneficial to traditional disks and can greatly improve the response time, however, SSD is a nightmare because it needs to be repeatedly written at the same location. Although the wear leveling algorithm can balance the load, it still affects performance and produces a lot of Io latency. Therefore, in-page logging combines logs and data, and changes the sequential log writing to random write. Based on SSD, the latency of random write and continuous write Io response is not significantly different, avoid repeated writes at the same location to improve overall performance.

Basic Principle of In-page logging: In data buffer, there is an in-memory log sector structure. Similar to log buffer, each log sector corresponds to a data block. In data buffer, data and log are not merged, but a correspondence relationship is established between data block and log sector. logs of a data block can be separated. However, in the underlying flash memory of SSD, data and logs are stored in the same block (erased Unit). Each block contains data page and log page.

When the log information needs to be written (the log buffer space is insufficient or the transaction is committed), the log information will be written to the block corresponding to flash memory, that is, the log information is distributed in many different blocks, the log information in each block is append write, so there is no need to erase the action. When the log sector in a block is full, an action is triggered to read the information in the entire block and then apply the log sector in the block to obtain the latest data, then write the entire block. In this case, the log sector in the block is blank.

In the in-page logging method, the dirty block in data buffer does not need to be written into flash memory. Even if dirty buffer needs to be switched out, it does not need to be written into flash memory. To read the latest data, you only need to merge the data and log information in the block to obtain the latest data.

The in-page logging method stores logs and data in the same erasure unit, which reduces the number of repeated writes to the same flash location and does not need to write dirty blocks to flash, this greatly reduces the random write and erase operations of In-Place update. Although a merge operation is required during reading, in-page logging can improve the overall performance because the data is stored together with the log storage and the random Reading Capability of SSD is high.

SSD as write cache-append write

SSD can be used as the disk write cache, because SSD continuous write performance is better than random write performance. For example, 1 m sequential write is much better than 128 8 K random writes, we can merge a large number of random writes into a small number of sequential writes, increase the IO size, reduce the number of Io (erasure) times, and improve the write performance. This method is similar to the append write method of many nosql products, that is, only data is appended without rewriting, and merging is required.

Basic principle: When the dirty block needs to be written to a data file, it does not directly update the original data file, but first performs Io merge, merge multiple 8 K dirty blocks into a kb write unit and write them to a cache file (stored on SSD) using append write to avoid erasure, improves write performance. Data in the cache file is written in a circular order. When the cache file space is insufficient, the background process writes the data in the cache file to the real data file (stored on the disk). At this time, the second Io merge is performed to merge the data in the cache file, integration into a small number of sequential writes. For disks, the final I/O is 1 Mbit/s sequential write, which only affects the throughput, while the Disk Throughput does not become a bottleneck, the iops bottleneck is converted to the throughput bottleneck, thus improving the overall system capability.

When reading data, you must first read the cache file, and the data in the cache file is stored unordered. to quickly retrieve the data in the cache file, an index is usually created for the cache file in the memory, when reading data, this index will be queried first. If the query cache file is hit and the data file is not hit, this method is not just to write the cache, it also plays a role in read cache.

SSD is not suitable for storing Database Log Files. Although the log files are also append write files, they cannot be merged because the IO size of log files is small and must be written synchronously. For SSD, A large number of erasure actions are required. We have also tried to put the redo log on the SSD. Considering that the random write speed of the SSD can reach 3000 iops, the response latency is much lower than that of the disk, however, this depends on whether the wear leveling algorithm of SSD is excellent, and log files must be stored independently. If the writing of log files is a bottleneck, it is also a solution. In general, I recommend that you store log files on a common disk instead of an SSD.

SSD as read cache-flashcache

Because most databases are read-write-less, SSD, as the database flashcache, is the simplest of the optimization solutions. It can take full advantage of the SSD read performance, this avoids SSD write performance issues. There are many methods to achieve this. When reading data, you can write the data to the SSD at the same time, or when the data is flushed out of the buffer, write it to the SSD. When reading data, first query in buffer, then query in flashcache, and finally read datafile.

The biggest difference between SSD as flashcache and memcache as the external cache of the database is that the data is not lost after the SSD power is down, which also leads to another consideration. When the database is restarted due to a fault, is the data in flashcache valid or invalid? If it is effective, the data consistency in flashcache must be ensured at all times. If it is invalid, flashcache also faces a push problem (which is the same as the problem after memcache power loss ). Currently, as far as I know, it is basically considered invalid because it is very difficult to maintain data consistency in flashcache.

Flashcache is the second-level cache between memory and disk. In addition to performance improvement, from the perspective of cost, the price of SSD is between memory and disk and serves as a layer of cache between the two, you can find a balance between performance and price.


As SSD prices continue to decrease and capacity and performance continue to increase, it is only a matter of time for SSD to replace disks.

Tape is dead, disk is tape, Flash is disk, Ram locality is king. Jim Gray


In the future, I will organize a ppt and provide a technical sharing.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.