Google database leveldb Small test and implementation of the principle of detailed

Source: Internet
Author: User
Tags assert comparison data structures delete key key string ranges redis

LEVELDB address, Address: https://code.google.com/p/leveldb/


Temporarily does not support windows, although there are Windows branch, but has been updated for two years, according to the online method, toss a bit, or there are several functions can not link, so I intend to postpone and toss.

Linux under the convenience of a lot of direct download source code, make all compile, will generate a and so files, I use a file

Write an example of the following, or relatively smooth

The code is as follows Copy Code
#include <iostream>


#include "Leveldb/db.h"





int main ()


{


LEVELDB::D b* DB;


Leveldb::options Options;


Options.create_if_missing = true; If none is created





Leveldb::status Status = leveldb::D b::open (Options, "/home/test/test_leveldb/testdb", &db);





std::string key = "key123";


std::string value = "value123";





Status = Db->put (Leveldb::writeoptions (), key, value);


ASSERT (Status.ok ());





std::string value1;


Status = Db->get (Leveldb::readoptions (), key, &value1);


ASSERT (Status.ok ());





Std::cout << "Get, Key1:" << key << "value1:" << value1 << Std::endl;


std::string key1 = "key456";


Status = Db->put (Leveldb::writeoptions (), Key1, value1);


ASSERT (Status.ok ());





Status = Db->delete (Leveldb::writeoptions (), key1);


ASSERT (Status.ok ());








Delete db;


return 0;


}



The results are not posted, you will know when you try.
There will be several files in the folder you created above, which is a record of your persistent data.

Linux is 30 seconds to brush the buffer data to the hard drive, and leveldb use mmap, so you may lose data, read the IDEAWU blog has written, leveldb the author that the sync every second mechanism should be the user's own implementation, Instead of LEVELDB provide, so he wrote a ssdb based on LEVELDB, compatible Redis protocol, and did the synchronization output per second to the hard disk, so the probability of losing data will be greatly reduced, GitHub address: Https://github.com/ideawu/ssdb

Leveldb one of the records: LEVELDB 101

Speaking of LEVELDB may not be clear to you, but if you're an IT engineer, and you don't know the next two great God-level engineers, your leader will probably hold: Jeff Dean and Sanjay Ghemawat. These two are Google's heavyweight engineers, and a handful of Google Fellow.

Jeff Dean: Http://research.google.com/people/jeff/index.html,Google Large-scale distributed platform BigTable and MapReduce are the main design and implementation.

Sanjay Ghemawat: Http://research.google.com/people/sanjay/index.html,Google Large-scale distributed platform gfs,bigtable and MapReduce are mainly designed and implemented by engineers.

Leveldb is the open source project initiated by the two great God-level engineers, in short, leveldb is a C + + library that can handle the Key-value data persistence storage of the 1 billion-tier scale. As described above, these two bits are BigTable's design and implementation, and if you understand bigtable, you should know that there are two core components in this far-reaching distributed Storage System: Master server and tablet server. Where master server does some management data storage and distributed scheduling, the actual distributed data storage and read and write operations are done by the tablet server, while Leveldb can be understood as a simplified version of the tablet server.

LEVELDB has the following characteristics:

First of all, Leveldb is a persistent storage kv system, unlike Redis this type of memory kv system, LEVELDB will not eat as much memory as Redis, but most of the data stored on disk.

Second, the Levledb store data in accordance with the record key value ordered storage, that is, adjacent key values in the storage file is sequentially stored in sequence, and the application can customize the key size comparison function, Levledb will be in accordance with user-defined comparison functions in order to store these records.

Again, like most KV systems, LEVELDB's operating interface is simple, with basic operations including writing records, Reading Records, and deleting records. Atomic bulk operations for multiple operations are also supported.

In addition, LEVELDB supports data snapshot (snapshot) functionality, which allows read operations to be unaffected by write operations and can always see consistent data during a read operation.

In addition, LEVELDB also supports operations such as data compression, which is directly helpful for reducing storage space and increasing IO efficiency.

Leveldb performance is very prominent, the official website reported that its random write performance reached 400,000 records per second, while the random read performance of 60,000 records per second. Generally speaking, Leveldb writes much faster than read operations, while sequential read and write operations are much faster than random read and write operations. As for why this is the case, after reading our follow-up leveldb, we estimate that you will understand the underlying reasons.

Leveldb the second: the overall structure

LEVELDB is essentially a set of storage systems and some of the operational interfaces that are available on this set of storage systems. In order to understand the whole system and its processing process, we can look at the levledb from two different angles: static angle and dynamic angle. From a static point of view, you can assume that the entire system is running (constantly inserting deletes to read data), at this time we give leveldb photography, from the photos can be seen before the system's data in memory and disk is how the distribution, in what state, and so on, from a dynamic point of view, is mainly to understand how the system is written to a record, Read a record, delete a record, but also include internal operations, such as compaction, in addition to these interface operations, such as the operation of the system after the crash of how to restore the system and so on.

The overall architecture described in this section is primarily static, and the following sections detail the files or memory data structures involved in the static structure, and the latter part of the LEVELDB is mainly about the leveldb of the dynamic perspective, which means how the whole system works.

Leveldb as a storage system, data recording storage media including memory and disk files, if, as mentioned above, when Leveldb run for a while, at this time we give leveldb a perspective to take photos, then you will see the following scene:

Figure 1.1:LEVELDB Structure

As can be seen from the diagram, there are six main parts that comprise the LEVELDB static structure: memtable and immutable memtable in memory and several main files on disk: Current file, manifest file, Log files and sstable files. Of course, LEVELDB has some supporting files in addition to these six main parts, but the above six files and data structures are the main elements of LEVELDB.

LEVELDB's log file and memtable are consistent with the bigtable paper, when the application writes a key:value record, Leveldb writes to the log file first, and succeeds in inserting the record into the memtable. This basically completes the write operation, because one write operation involves only one disk sequential write and one memory write, so this is the main reason why the leveldb write speed is extremely fast.

Log files in the system's role is mainly used for system crash recovery without losing data, if there is no log file, because the written record is initially stored in memory, at this point if the system crashes, the data in memory has not been able to dump to disk, so the data will be lost (Redis this problem). To avoid this, LEVELDB records the action in the log file before it is written to memory, and then it is recorded in memory, so that even if the system crashes, the memtable in memory can be recovered from the log file without causing data loss.

When the memtable inserted data takes up memory to a limit, the memory needs to be exported to the external storage file, Levledb will generate new log files and memtable, the original memtable became immutable memtable, as the name suggests, This means that the contents of this memtable cannot be changed, and can only be read and not written or deleted. The new data is recorded in the new log file and the MEMTABLE,LEVELDB background scheduler will export immutable memtable data to disk to form a new sstable file. Sstable is the data from the memory of the continuous export and compaction operations, and sstable all of the files is a hierarchy, the first level is level 0, the second layer is level 1, and so on, the hierarchy gradually increased, That's why it's called leveldb.

sstable files are key ordered, that is, in the file small key records in the big Key records before the level of the sstable is the case, but here is the point to note: level 0 of the sstable file (suffix is. SST) Compared to other level documents, it is specific: in this level of. sst files, two files may have key overlap, such as two levels 0 SST files, file A and file B, file a key range is: {bar, car}, file B's key range is {blue, Samecity}, it is likely that all two files will have a key= "blood" record. For other levels of sstable files, there will be no key overlap in the same level. sst file, that is, any two. sst files in layer L can guarantee that their key values will not overlap. This requires special attention, and later you will see that many of the operational differences are due to this cause.

Sstable a file in a particular level, and its stored records are key ordered, then there must be a file of the smallest key and the largest key, which is very important information, LEVELDB should write down this information. Manifest is doing this, it records the sstable of each file management information, such as which level, file name, the minimum key and the maximum key each is how much. The following figure is a schematic of what the manifest stores:

Figure 2.1:manifest Storage schematic

The figure shows only two files (manifest will record this information for all sstable files), the Test.sst1 and test.sst2 files of level 0, and the corresponding key ranges for these files, such as the TEST.SSTT1 key range is " An "to" banana, and the file Test.sst2 key range is "Baby" to "samecity", you can see that the key range is overlapping.

What does the current file do? The contents of this file have only one message, which is to record the present manifest file name. Because in the operation of the Levledb, as the compaction, sstable files will change, there will be new files, old files are discarded, manifest will also reflect this change, at this time will often generate a new manifest file to record this change, And current is used to point out which manifest file is the manifest file that we care about.

The content described above constitutes the overall static structure of LEVELDB, in the next content of LEVELDB, we will first introduce the specific data layout and structure of important files or memory data.

LEVELDB Day Three: Log file

The previous section describes the main function of log files in Leveldb is to ensure that no data is lost when the system fails to recover. Because the log file is written before the record is written to the memory memtable, the data in memtable does not have a sstable file to dump to the disk even if the system fails. Leveldb can also restore memory based on log file memtable data structure content, will not cause the system lost data, in this point leveldb and bigtable are consistent.

Let's take a look at the specific physical and logical layout of the log file, leveldb for a log file, it will be cut into a 32K unit of physical block, each read the unit with a block as the basic reading Unit, The log file shown in the following figure is made up of 3 blocks, so a log file is made up of contiguous 32K size blocks from the physical layout.

Figure 3.1 Log file layout

In the application of the field of vision is not to see these blocks, the application saw a series of key:value pairs, within the LEVELDB, will be a key:value to a record of the data, in addition to the data in front of a record header, to record some management information to facilitate internal processing , figure 3.2 shows how a record is represented within the LEVELDB.

Figure 3.2 Recording Structure

The record header contains three fields, Chechsum is the checksum code for the type and data fields, in order to avoid processing incomplete or corrupted data, the data will be validated when Leveldb reads the record data, and if the same checksum is found and stored, the data is complete and free of damage. You can continue to follow the process. "Record Length" records the size of the data, "data" is the above key:value numeric pairs, "type" field points out the logical structure of each record and log file physical structure of the relationship between, in particular, there are mainly the following four types of: full/first/middle/ Last.

If the record type is full, the current record content is completely stored in a physical block and is not cut off by a different physical block; If the record is cut off by the adjacent physical block, the type is one of the other three types. We specify the example shown in Figure 3.1.

Assuming that there are currently three records, the record A,record B and C, where the 10k,record b size is the 80k,record C size of 12K, the logical layout in the log file is shown in Figure 3.1. Record A is shown in the blue area of the figure, because the size is 10k<32k and can be placed in a physical block, so its type is full; The record b size is 80K, and Block 1 is left with 22K because it has been placed in records A, so the remainder of block 1 is placed at the beginning of the recording B, and the type ID is first, representing the starting part of an entry ; Record B also has 58K without storage, which can only be placed in successive physical blocks, because Block 2 is only 32K and still doesn't fit the remainder of record B, so block 2 is all for record B, and the identity type is middle, It means that this is a piece of data in the middle of record B; The remaining part of record B can be completely placed in block 3, the type is identified as last, and this is the end data of the B, which is identified as full because of the amount of space left in the figure 12k,block 3 is sufficient to drop it all.

From this small example, we can see the relationship between the logical record and the physical block, leveldb a physical read as a block, and then stitch out the logical records according to the type situation for subsequent process processing.

Leveldb on the day of the record four: Sstable documents

Sstable is a crucial piece of bigtable, and for leveldb, understanding the Sstable implementation details of LEVELDB also helps to understand some of the implementation details in BigTable.

This section focuses on the static layout structure of sstable, we once said in the "Leveldb of the second: The overall architecture", sstable file formed a hierarchy of different levels, as to how this hierarchy is formed we put in the back of the compaction section. This section focuses on the physical layout and logical layout structure of a sstable file, which is helpful for understanding the running process of a leveldb.

Leveldb There are many sstable files at different levels (later suffix. SST), and all. sst files have the same internal layout. The previous section describes the log file is a physical block, sstable also will be divided into the file as a fixed size of the physical storage block, but the logical layout of the two is very different, the root cause is: Log files in the record is key unordered, that is, the record key size does not have a clear size relationship and the. sst file is based on the records of the key from small to large arrangement, from the sstable layout described below can realize that key order is why this design. The key to the SST file structure.

Figure 4.1. SST File Block structure

Figure 4.1 shows the physical partition structure of a. sst file, as well as the log file, which is divided into blocks of fixed size, each block is divided into three parts, the red part is the data store, and the blue type area is used to identify whether the data storage area uses the data compression algorithm ( Snappy compression or no compression two kinds), CRC is a data parity code, used to determine whether the data in the generation and transmission errors.

The above is. SST physical layout, the following describes the logical layout of the. sst file, the so-called logical layout, that is, although everyone is a physical block, but each piece of storage what content, the internal structure and so on. Figure 4.2 shows the internal logical interpretation of the. sst file.

Figure 4.2 Logical layout

As can be seen from Figure 4.2, the. sst file can be divided into data storage area and data management area from the big aspect, the data store store the actual key:value data, the data administrative Zone provides some index pointers and so on to manage the data, the purpose is to find the corresponding records more quickly and conveniently. Two areas are based on the above block, that is, the first several blocks of the file to actually store kv data, followed by the data management area to store administrative data. The management data is divided into four different types: the purple meta block, the Red Metablock Index and the Blue data index blocks, and a file tail block.

LEVELDB version 1.2 has no real use for meta block, except that it retains an interface that is expected to add content to subsequent versions, let's look at the internal structure of the data index area and file tail footer.

Figure 4.3 Data index

Figure 4.3 is a schematic diagram of the internal structure of the data index. Again, the KV records in the data block are arranged according to the key from the small to the large, each record in the index area is the index information that is set up for a database, each index information contains three content, in Figure 4.3, the index of blocks I is shown. I: The first field in the Red Section records the key that is greater than the largest key value in Block I, the second field indicates the starting position of block I in the. sst file, and the third field indicates the size of the data blocks I (sometimes with data compression). The back two fields are well understood, is used to locate the location of the block of data in the file, the first field needs to be explained in detail, the value of the key stored in the index may not necessarily be a key to a record, in Fig. 4.3, suppose the smallest key= of block I "Samecity", Max Key= " "The best", the smallest key= "The Fox", the largest key= "zoo" of the Block i+1, then for index I of Block I, the first field record is greater than the maximum key ("best") of block I and is less than the data block i+ 1 of the smallest key ("The Fox"), so the first field of index I in the example is: "The C", which satisfies the requirement, while the first field of index i+1 is "Zoo", that is, the largest key of the block i+1.

The internal structure of the footer block at the end of the file is shown in Figure 4.4,metaindex_handle the starting position and size of the metaindex blocks; Inex_handle points out the starting address and size of the index chunk; These two fields can be understood as index indexes that are set up to read the index values correctly, followed by a fill area and a number of demons.

Figure 4.4 Footer

This is mainly about the internal structure of the data area, let's look at how the data part of a block in the data section is internally laid out (the red part in Figure 4.1), figure 4.5 is its internal layout diagram.

Figure 4.5 Data Block internal structure

As you can see from the diagram, its internal also divided into two parts, the front is a KV record, the order is based on key values from small to large arrangement, at the end of the block is a number of "Restart points" (restart point), in fact, are some pointers, pointing out the block content in some record locations.

What does "restart point" do? We repeatedly stressed that the block content in the KV record is in accordance with the key size ordered, so, adjacent to the two records are likely to overlap key parts, such as key i= "The car", key i+1= "The Color", Then there is the overlap part "The C", in order to reduce the storage of key, key i+1 can only store and the previous key different parts of the "olor", the common part of both from key I can be obtained. The key to the record is stored in the Block Content section, with the main purpose of reducing storage overhead. "Restart point" means: At the beginning of this record, no longer to take only different key parts, but to record all key values, assuming that key i+1 is a restart point, then the key will be stored in a complete "the color", rather than using a simple "olor" way. The block tail indicates which records are the restart points.

Figure 4.6 Recording format

What is the internal structure of each KV record in the block content area? Figure 4.6 shows the detailed structure, each record contains 5 fields: Key share length, such as the "Olor" record above, the key and the previous record share key part length is "The C" length, that is, 5; Key unshared length, for "Olor", is 4;value length that indicates the length of value in Key:value, stores the actual value value in the subsequent Value Content field, and key unshared content actually stores the "Olor" key string.

These are all the internal mysteries of the. sst file.

Leveldb Day of the record of the five: memtable detailed

Leveldb the previous section describes the important static structure associated with disk files, and this section describes the important role of data structures in memory in the memtable,memtable system. Overall, all kv data is stored in memtable,immutable memtable and sstable, immutable memtable is identical in structure and memtable, except that it is read-only and does not allow write operations , while Memtable is allowed to write and read. When the data written by Memtable takes up a specified number of memory arrivals, it is automatically converted to immutable memtable, waiting for the dump to disk, and the system automatically generates new memtable for write operations to write new data, understanding memtable, then immutable Memtable Nature is a cinch.

LEVELDB's memtable provides an operating interface to write, delete, and read KV data, but in fact memtable does not have a real deletion, and deleting the value of a key is implemented as inserting a record in memtable. But will hit a key deletion mark, the real deletion operation is lazy, will remove this kv in the later compaction process.

It should be noted that the LEVELDB memtable kv is stored according to the key size ordered, in the system to insert a new KV, leveldb to put this kv into the appropriate position to maintain this key order. In fact, Leveldb's memtable class is just an interface class, the real operation is done by the skiplist behind, including insert operation and read operation, so the core data structure of memtable is a skiplist.

Skiplist was invented by William Pugh. He published the Skip lists:a probabilistic alternative to balanced trees in the communications of the ACM June 1990, 33 (6) 668-676, in which the detailed solution The SKIPLIST data structure and insert deletion operation are released.

Skiplist is an alternative data structure of the balance tree, but not the same as the red-black tree, Skiplist is based on a randomized algorithm for the implementation of the tree, which means that the skiplist insertion and deletion work is simpler.

The detailed introduction of Skiplist can refer to this article: http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html, it is very clear, Leveldb's skiplist is basically a concrete realization, and there is no special place.

Skiplist is not only a simple implementation of maintaining ordered data, and compared to the balance tree, when inserting data can avoid frequent tree node adjustment operations, so write efficiency is very high, leveldb overall is a high write system, Skiplist should also play a very important role in it. Redis in order to speed up the insert operation, the skiplist is also used as the internal implementation data structure.

Leveldb on the six records of the record of writing and deleting

In the previous five leveldb, we introduced the leveldb of some of the static files and their detailed layout, starting from this section, we look at leveldb some of the dynamic operations, such as reading and writing records, compaction, error recovery and other operations.

This section describes the record update operation for LEVELDB, that is, inserting a KV record or deleting a single KV record. The LEVELDB update operation is very fast, because its internal mechanism determines the simplicity of this update operation.

Figure 6.1 Leveldb Write record

Figure 6.1 is a schematic of how to update the KV data, as can be seen from the figure, for an insert operation put (Key,value), the completion of the insert operation contains two specific steps: First of all, this leveldb is appended to the end of the log file described in sequence. Because although this is a disk read-write operation, the sequential append write efficiency of the file is very high, so it does not cause a decrease in write speed; the second step is to insert this KV record into the memtable in memory if the log file is written successfully, as described earlier, memtable Just a layer of encapsulation, its interior is actually a key ordered Skiplist list, insert a new record of the process is also very simple, that is, find the appropriate insertion position, and then modify the corresponding link pointer to insert a new record. By completing this step, the write record is complete, so an Insert record operation involves a disk file append write and memory skiplist insert operation, which is the root cause of why the leveldb write speed is so efficient.

From the above introduction process can also be seen: Log file is the key disorder, and memtable is the key in order. So what if you delete a KV record? For Leveldb, there is no immediate deletion, but it is the same as the insert operation, except that the insert operation inserts a Key:value value, and the delete operation inserts "Key: Delete tag" and does not actually delete the record. It's the backstage compaction to do the real delete operation.

The Leveldb write operation is so simple. The real trouble is in the read operation that will be described later.

Leveldb of the day: Reading Records

LEVELDB is a stand-alone repository for large-scale key/value data, from the point of view of application, LEVELDB is a storage tool. As a competent storage tool, the common calling interface is nothing more than a new KV, delete kv, read kv, update key corresponding to the value of such several operations. Leveldb interface does not directly support the update operation of the interface, if you need to update the value of a key, you can choose to insert a new KV directly, keep key the same, so that the system key corresponding value will be updated; or you can delete the old KV first, Then insert the new KV, so that it is more tactful to complete the KV update operation.

Assuming the application submits a key value, let's look at how leveldb reads its corresponding value from the stored data. Figure 7-1 is the overall schematic of the LEVELDB read process.

Figure 7-1 Leveldb Read recording process

Leveldb first looks at the memtable in memory and returns value if Memtable contains key and its corresponding value, and if key is not read in memtable, then next to the same memory immutable Memtable to read, similarly, if read to return, if not read, then only helpless from the disk in a large number of sstable files to find. Because the number of sstable is large and is divided into levels, reading data in sstable is quite a winding journey. The general reading principle is this: first look up from the file that belongs to level 0, and if found, return the corresponding value value, if not found in the file in Level 1, and so on, until you find the value of the key in a layer sstable file ( Or find the highest level, the search failed, indicating that the entire system does not exist this key.

So why is it from memtable to immutable memtable, and then from immutable memtable to files, and why is a query path from low to high level in the file? The reason for this query path is to choose It is because from the update time of information, it is obvious that the Memtable store is the freshest kv pair; The number of KV data stored in immutable memtable is second to none, and the KV data in all sstable files is not as fresh as in memory memtable and Immutable memtable. For sstable files, if the same information is found at level L and level l+1 the same key,level L must be newer than level l+1. In other words, the search path listed above is ranked according to the freshness of the data, the more fresh the more first look.

Why prioritize the search for fresh data? This is self-evident, for instance. For example, we first insert a data into the leveldb {key= "www.samecity.com" value= "we"}, after a few days, samecity site renamed: 69 The same city, at this time we insert data {key= "www.samecity.com" "Value=" 69, same key, different value, logically understood as if there is only one storage record in Leveldb, that is, the second record, but there is likely to be two records in Leveldb, where two of the above records are stored in Leveldb, At this point, if the user queries key= "www.samecity.com", we certainly want to find the latest update record, that is, the second record back, which is why the priority to find fresh data.

The previous article said: for sstable files, if the same at level L and level l+1 find the same key,level l information must be more than the level l+1 to be new. This is a conclusion that requires a proof process in theory, or it will incur the following question: For God's horse? From the point of view of the truth, it is clear: because the level of l+1 data is not from the cracks in the stone, nor dream of dreams, then where it came from? Level l+1 data is obtained from level L after compaction (if you do not know what compaction is, then ... maybe later), that is to say, the level you see now l+ The 1-layer sstable data is from the original level L, now levels l than the original level L data to be fresh, so it can be verified that the now is higher than the current level of l+1 data to be fresh.

Sstable a lot of files, how to quickly find the value value of the key? Level 0 has always been a specialization in leveldb, and the process of finding a key in level 0 and Other level is different. Because the different files under level 0 may overlap the scope of the key, a key to be queried may contain multiple files, so LEVELDB's strategy is to first find out which files in level 0 contain the key ( Manifest file records the level and the corresponding file and file key in the range of information, leveldb in memory to keep this mapping table, and then sorted by the freshness of the file, the new file in front, followed by search, read out key corresponding value. And if the level 0, because the level of the file between the key is not overlapping, so only from a file can find the key corresponding to the value.

Finally, if you give a key that you want to query and a key range that contains the sstable file for the key, how do you leveldb the search process? Leveldb typically finds cache records that contain this file in the memory cache first. If included, read from the cache, and if not, open the Sstable file and load the index portion of the file into memory and into the cache. So the cache has this sstable cached item, but only the index part in memory, then LEVELDB according to the index can be located to which content block will contain the key, from the file read out the contents of the blocks, in accordance with the records of each comparison, If found, return the result, if not found, then the level of the Sstable file does not contain this key, so go to the next class of sstable to find.

Leveldb from the previous introduction of the write operation and read here can be seen, relative to write operations, read operations to deal with a lot of complexity, so the speed of writing must be much higher than the speed of reading data, that is to say, leveldb more suitable for writing more than read the operation of the application situation. And if the application is a lot of read operation type, then sequential read efficiency is relatively high, because most of the content will be found in the cache, as far as possible to avoid a large number of random read operations.

Leveldb on the day of the eight: compaction operation

As mentioned above, for Leveldb, write record operation is very simple, delete records only write a deletion mark even if finished, but the reading record is more complex, need in memory and in each level of files in order to search according to freshness, the price is very high. In order to speed up the reading speed, LEVELDB adopted a compaction way to compress the existing records, in this way, to remove some no longer valid KV data, reduce the size of the data, reduce the number of documents and so on.

The compaction mechanism and process of leveldb is basically consistent with bigtable, and there are three kinds of compaction:minor, major and full in bigtable. The so-called minor compaction, is to export the data in memtable to sstable file, major compaction is to merge different levels of sstable files, and full Compaction is the merging of all sstable.

Leveldb contains two of them, minor and Major.

We will give you a detailed account of its mechanism.

First look at the process of minor compaction. The purpose of Minor compaction is to save the content to a disk file when the memtable size in memory is a certain value, figure 8.1 is a schematic diagram of its mechanism.

Figure 8.1 Minor compaction

From 8.1 It can be seen that when the number of memtable to a certain extent will be converted to immutable memtable, at this time can not write records, can only read KV content. Previously introduced, immutable memtable is actually a multi-level queue skiplist, where the record is sorted according to key. So this minor compaction implementation is also very simple, that is, according to immutable Memtable record from small to large traversal, and then write a level 0 of the new sstable file, after the completion of the file index data, This completes a minor compaction. From the figure can also be seen, for the deleted records, in the minor compaction process does not really delete this record, the reason is very simple, here only know to delete key records, but this KV data where? That requires complex lookups, so in minor Compaction do not delete, just the key as a record to write to the file, as for the real deletion, in later higher level of compaction will do.

When the number of sstable files in a level exceeds a certain set value, Leveldb selects a file (level>0) from the sstable of the levels and merges it with the level+1 file of the higher-layer, which is sstable Compaction.

We know that at a level greater than 0, the key in each sstable file is stored from small to large order, and the key range between the files (the minimum key and the maximum key) will not overlap. The sstable file for level 0 is somewhat special, although each file is arranged according to key from small to large, but since level 0 files are generated directly through minor compaction, any two level The two sstable files under 0 may have overlapping key ranges. So when doing major compaction, for levels greater than level 0, select one of the files, but for grade 0, when you specify a file, there is likely to be an overlap between the key ranges of other sstable files and the file. In this case, it is necessary to find all files with overlapping files and level 1 to merge, that is, level 0 may have multiple files involved in major compaction when making file selections.

LEVELDB Select a level for compaction, but also to choose which file to do compaction,leveldb here have a little trick, that is, take turns, such as the file a compaction, then the next is in key The range is compaction to file B next to file A, so that each file has a chance to merge with the high-level level file.

If you chose the level L file A and levels l+1 file to merge, then the problem again, you should select what the level l+1 files to merge? Leveldb Select all files in the l+1 layer that overlap with file a on key range to merge with file a.

In other words, file A of level L was selected, and all the files that needed to be merged were found in level l+1 b,c,d ... and so on. The rest of the problem is how to do major merging? That is, given a series of files, each file is key in order, how to merge these files, so that the new generated files are still key ordered, while leaving out which no longer valuable kv data.

Figure 8.2 illustrates this process.

Figure 8.2 Sstable Compaction

The process of Major compaction is as follows: to multiple files using a multiple-merge sort, in turn, find the smallest key record, that is, a number of files in the reordering of all records. Then take a certain standard to determine whether the key still need to be saved, if the judgement is not saved value, then directly thrown away, if you feel you need to continue to save, then write it to the level l+1 layer in a newly generated sstable file. In this way, the KV data one by one processing, forming a series of new l+1 layer data files, before the L-level file and L+1 layer to participate in compaction file data is meaningless now, so all deleted. This completes the process of merging the L-tier and l+1-level file records.

So in the process of major compaction, what is the criterion to judge whether a KV record is discarded or not? One of the criteria is: for a key, if the key exists in less than L layer, then the KV can be thrown away in the major compaction process. As we have analyzed earlier, for a file with a level below L, if there is a record of the same key, then there is a new value for key, so the value of the past is meaningless, so it can be deleted.

The cache in the nine leveldb of Leveldb's Daily knowledge record

Read the previous article, said earlier for LEVELDB, reading operations, if not found in the memory of the memtable to find records, to do more than a disk access operation. Assuming the optimal situation, that is, the first time in level 0 in the latest file to find this key, then also need to read the disk 2 times, is the sstable file in the index part of the memory, so that according to this index can determine the key is in which block to store; The second is to read the contents of this block, and then look up the value of the key in memory.

Two different cache:table cache and block cache are introduced into the LEVELDB. Where the block cache is configured to be optional, that is, specify whether this feature is turned on in the configuration file.

Figure 9.1 Table Cache

Figure 9.1 is the structure of the table cache. In cache, the key value is the file name of the sstable, the value part contains two parts, one is the file pointer to the Sstable file opened by the disk, which is to facilitate the reading of the content, and the other is to point to the table structure pointer corresponding to the sstable file in memory. The table structure, in memory, holds the index content of the sstable as well as the cache_id used to indicate the block cache, with a few other things in addition, of course.

For example, in a Get (key) read operation, if LEVELDB determines that the key is within the key range of a file A at a level, it is necessary to determine if the file a really contains this kv. At this point, Leveldb will first find the table cache, see if the file is in the cache, if found, then according to the index section can find out which block contains this key. If the file is not found in the cache, then open the Sstable file, read the index part of the memory, and then insert the cache inside, go to the index to locate which block contains the key. If you determine which block contains this key, you need to read the block content, which is the second read.

Figure 9.2 Block Cache

Block cache is to speed up the process, figure 9.2 is the schematic diagram of its structure. The key is the cache_id of the file plus the starting position of the block in the file Block_offset. And value is the content of this block.

If Leveldb found this block in the blocks cache, then you can avoid reading data, directly in the cache in the box to find key in the value of the line, if not found it? Then read it and insert it into the block cache. Leveldb is the way to speed up reading through two cache. As you can see from here, if the read data locality is better, that is to read most of the data in the cache can read, then read efficiency should still be very high, and if the key for sequential read efficiency should also be good, because once read can be reused several times. But if it's a random read, you can infer how efficient it is.

Leveldb The ten Version, Versionedit, Versionset

Version saves the current disk and all file information in memory, and typically only one version is called "Current" version. Leveldb has also saved a series of historical editions, what are the effects of these historical editions?

When a iterator is created, the iterator is referenced to current version, and as long as the iterator is not removed the iterator referenced version will survive. This means that when you run out of a iterator, you need to delete it in time.

When a compaction is finished (a new file is generated and the file before the merge needs to be deleted), LEVELDB creates a new version as the current version, and the original current version becomes a historical version.

Versionset is a collection of all version, managing all the surviving version.

Versionedit represents the change between version, which is equivalent to Delta increment, indicating how many files have been added, and files have been deleted. The following figure shows the relationship between them.

Version0 +versionedit–>version1

Versionedit will be saved to the manifest file and will be read from the manifest file to reconstruct the data as the data is recovered.

Leveldb This version of the control, let me think of the dual-buffer switch, dual-buffer switching from the graphics, to solve the screen drawing when the splash screen problem, in the server programming is also useful.

For example, we have a dictionary library on the server, every day we need to update the dictionary library, we can open a new buffer, the new dictionary to load the new buffer, wait until the loading, the dictionary pointer to the new dictionary library.

LEVELDB version management is similar to a dual-buffer switch, but if the original version is referenced by a iterator, the version will remain until it is referenced by any iterator, and this version can be deleted at this point.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.