---- Reading Notes on "large-scale distributed storage system: Principle Analysis and Architecture Practice"
Recently, we have been analyzing the source code of oceanbase, and have met oceanbase's new core developer "large-scale distributed storage system: Principle Analysis and architecture practice". after reading the sample chapter, I decided to start with it. This is a good book for students who are preparing to learn about distributed systems. It introduces distributed technologies and projects in a comprehensive and systematic manner. Another half is about oceanbase. For me, it is just because I have no space for my shoes. Next I will have several Reading Notes dedicated to the storage engine. If you don't talk much about it, go to the subject.
1. Storage media and read/write
When talking about storage, it is obviously very important to understand the characteristics of the storage media. The book talks about a lot of hardware structures, but the most important conclusions are concentrated in the storage media comparison table.
Disk media comparison
Category |
Read/write (iops) times per second |
Price per GB (RMB) |
Random read |
Random write |
Memory |
Tens of millions |
150 |
Friendly |
Friendly |
SSD Disk |
35000 |
20 |
Friendly |
Write Amplification |
SAS Disk |
180 |
3 |
Disk seek |
Disk seek |
SATA disk |
90 |
0.5 |
Disk seek |
Disk seek |
From the table, we can see that the random read/write capability of the memory is the strongest, far exceeding that of SSD disks and disks. But we all know that the memory cannot be persistent. At present, many companies use SSD disks in areas with high performance requirements. Compared with SAS and SATA disks, the random read speed has greatly improved. However, for random write, there is a write amplification problem.
The write amplification problem is related to the characteristics of the SSD disk. The SSD disk cannot be written randomly, but only the entire block can be written. The simplest example is to write a 4 KB data. The worst case is that there is no clean space in a block, but invalid data can be erased, therefore, the master node reads all the data, erases the block, and writes the new 4 KB data back. The write amplification caused by this operation is that the actual 4 k Data is written, this causes the write operation of the entire block (128 Kb), which is times the amplification. In addition, the lifetime of an SSD disk is related to the number of writes.
If SSD is used as the storage media of the storage engine, it is recommended to reduce or avoid random Write Design and replace it with sequential write.
2. Introduction to the bitcask Storage Model
The basic functions of the storage system include adding, deleting, reading, and modifying. The read operations include sequential read and random read.
In general, most applications use the most read functions, and solving the read performance is an important proposition of the storage system. In general. The idea of quick search is basically derived from binary search and hash query. For example, the B + storage model commonly used in relational databases is the idea of binary search. Of course, the actual implementation is much more complicated than binary search. The B + storage model supports sequential scanning. Another type is hash-based Key-value models that do not support sequential scanning and only support random reading.
The bitcask model to be discussed today is a log-type key-value model. The log type does not directly support random writing, but supports append operations like logs. The bitcask model converts random writes into sequential writes. There are two benefits:
- Increase the random write throughput because the write operation does not need to be searched and can be directly appended.
- If SSD is used as the storage medium, the new hardware features can be better utilized.
There are three types of files in bitcask, including data files, index files, and clue files (hint file, just call it a clue file ). The data file is stored on the disk and contains the key value information of the original data. The index file is stored in the memory and used to record the location information. When bitcask is started, it reads the location information of all data into a hash table in the memory, that is, the index file. The hint file is not a required file for bitcask, it exists to provide the speed at which index files are built at startup.
2.1 log-type data files
The data file organization of bitcask is called active data file. The rest of the data files are read-only files called older data files.
The data structure in the file is very simple. It is a data write operation. The structure of each data entry is as follows:
The preceding data items are the CRC check values, timestamp, key, value, and key values of the following items, respectively.
The data file contains consecutive records in the preceding format, for example:
2.2 index hash table
The index hash table records the primary key and location information of all records. The value of the index hash table contains the number, length, position, and timestamp of the record file. The overall data structure of bitcask is as follows:
2.3 hint File)
When a bitcask is started, the index hash table needs to be re-created. If the data volume is too large, the startup will be slow. The hint file is used to accelerate the rebuilding of hash tables at startup. The format of hint file is basically the same as that of the data file. The only difference is that the data value recorded in the data file, and the hint File) the location of the record data.
In this way, you can read the hint file instead of reading the data file at startup, and rebuild the hash table with one row.
3. bitcask features
As mentioned above, the basic functions of the storage system include adding, deleting, reading, and modifying. How is bitcask implemented?
How to add records?
The records written by the user are directly appended to the active file, so the active file will become larger and larger. When it reaches a certain hour, bitcask will freeze the active file and create a new active file for writing, the previous active file is changed to the older data file. When writing a record, you must add an index record to the index hash table.
How do I delete records?
Bitcask does not directly delete a record, but adds a record with the same key and sets the value to a deleted tag. The original record still exists in the data file, and then the index hash table is updated.
How to modify records?
Bitcask does not support random write. Because the addition and modification of the basic functions of the storage system are actually the same, they are all directly written into the active data file. Modify the value of the corresponding record in the index hash table. (In fact, the same key value in the data file corresponds to multiple records and is determined based on the timestamp record. The latest data prevails .)
How to read records?
During reading, first locate the record in the disk from the index hash table, and then retrieve the corresponding record through Io reading.
Merge operations
Bitcask, which is constantly written without decreasing, will inevitably lead to the constant expansion of data files. Many of them are useless records marked for deletion and modification. The merge operation is to remove this part of data and reduce the size of the data file.
The merge operation periodically scans all the data in the older data file and generates a new data file (not including the active data file because it is still writing ). If the same key has multiple records, only the latest one is retained. Remove the redundant data in the data file. The hint file can also be generated along with the Marge operation ). The merge (Marge) operation is usually performed when the database is idle, for example, at one or two o'clock in the morning.
4. Summary
Bitcask is a refined key-value storage model. Using a log-type data structure, logs are recorded without rewriting. This increases the random write throughput. By creating a hash table, you can speed up query and merge data files on a regular basis, generate the hint file to increase the speed of rebuilding the hash table at startup.
This is a python implementation I 've referenced online and added some of the functionality after the code: https://github.com/Winnerhust/Code-of-Book/blob/master/Large-Scale-Distributed-Storage-System/bitcask.py
In addition to adding, deleting, and reading/writing, the main functions are as follows:
- When merging data files, you can choose to generate a clue file (hint file)
- You can use the hint file to start
Refer:
Bitcask
Elegant bitcask