Last Saturday, I attended a Baidu technology Salon held by infoq. The topic is about MySQL. There were a lot of people present on that day. Some of them were unexpected, maybe because they were free of charge :)
There are two speakers, the first from Baidu, who mainly introduced Baidu's work on SSD/flash and the second from terradata. The topic is about spatial database. Frankly speaking, the first speech was very good. We can see that Baidu has done a lot of work on SSD and told it in detail. The second is a bit like boiled water, which is dull and tasteless, and has relatively shallow properties.
Focus on the first lecture.
In fact, the idea is very simple. Since I want to make an article on SSD, I naturally want to take advantage of the unique performance advantages of SSD. What are the benefits of SSD? The speaker gave some data. Simply put, SSD provides good support for random read, General Random write performance, and good append write performance. Therefore, the use of SSD either suits the advantages of SSD in application scenarios, or changes the I/O mode of application scenarios from the architecture to adapt to the characteristics of SSD.
For MySQL, there must be a large number of read operations for the database, and SSD will certainly improve the performance. However, on the other hand, a lot of write operations will also be faced in many cases, in particular, random write, which is not very suitable for SSD. Therefore, Baidu writes a 3000-line patch to the MySQL storage engine InnoDB. What is the application of this patch? Simply put it bluntly, we will change random write to append write.
So how can we do this?
Since the PPT was not released yet, I just drew a sketch based on my memory. Of course, I ignored many parts of the architecture. Let's take a look.
When MySQL needs to write data to the database, it does not directly write the original data file, but stores the written page in the memory, when the memory is filled with several pages, it organizes these pages into one block and writes the block to another cache file in append write mode. This step is very important because it programmed the append write to the cache file for operations originally written to the original data file randomly.
After reading the data, a page mapping is maintained in the memory, so that when the database needs to read the data of a page, it will be written in page mapping to find whether the page is in data file or cache file, and then retrieve the data from the corresponding file. Although the data in the cache file is very messy, this is not a problem because the random read performance of SSD is good.
Finally, the cache file must be merged into the original data file at a certain time point. This step generates a large number of random writes. However, through scheduling and control, the system can find a time when the system load is very small to execute this merge operation, such as late at night. This avoids the impact of a large number of random writes on the system performance.
From another perspective, it does not actually convert random writes into append writes, but temporarily changes random writes into append writes, then find a suitable time to change these append writes back to random writes. It seems that the design is quite simple. However, I believe that this solution must have been developed after many experiments and discussions.
In addition, the failure rate of SSD is another criticism. One metric provided by Baidu is 2‰ failures per week, which can be said to be very high. After the meeting, I asked privately how to solve the problem. The answer is roughly divided into several layers. First, at the device level, SSD itself has some fault tolerance recovery mechanisms. Second, raid can be used for fault tolerance. Finally, a fault tolerance algorithm is designed in a distributed architecture.
-- End --