solemnly declare: This blog is their own learning Leveldb to achieve the principle of reference to the Langerhans Technology Series blog Finishing, the original address:http://www.samecity.com/blog/Index.asp? Sortid=12, just to deepen the impression, this article's map is self-drawing, most of the content and the original text similar, we can browse the raw page:-), interested in the words can be discussed together LEVELDB principle of implementation!
one of Leveldb's Daily Records
:
LevelDb 101
Speaking of LEVELDB may not be clear to you, but if you are an IT engineer and do not know the following two major God-level engineers, then your leader will be unable to hold: Jeff Dean and Sanjay Ghemawat. These two are Google's heavyweight engineers and few Google Fellow.
Jeff Dean's People: http://research.google.com/people/jeff/index.html,Google large-scale distributed platform BigTable and MapReduce main design and implementation.
Sanjay Ghemawat Its people: http://research.google.com/people/sanjay/index.html,Google large-scale distributed platform gfs,bigtable and MapReduce main design and implementation engineer.
Leveldb is an open source project initiated by these two great-god-level engineers, in short, leveldb is a C + + library capable of handling 1 billion-scale Key-value data persistence storage. As described above, these two bits are designed and implemented by BigTable, and if you understand bigtable, you should know that there are two core parts in this far-reaching distributed storage System: Master Server and tablet server. Where master Server does some storage of management data and distributed scheduling, the actual distributed data storage and read and write operations are done by the tablet server, and LEVELDB can be understood as a simplified version of the tablet server.
Leveldb has some of the following features:
First, Leveldb is a persistent storage kv system, unlike Redis's memory-type KV system, LEVELDB does not eat as much memory as Redis, but instead stores most of the data on disk.
Second, when storing data, levledb is stored in order according to the key value of the record, that is, the adjacent key value is stored sequentially in the stored file, and the application can customize the key size comparison function, and Levledb stores the records sequentially according to the user-defined comparison function.
Again, like most KV systems, the Leveldb interface is simple, with basic operations such as writing records, Reading Records, and deleting records. Atomic bulk operations for multiple operations are also supported.
In addition, the LEVELDB supports the data snapshot (snapshot) feature so that read operations are not affected by write operations, and consistent data can always be seen during a read operation.
In addition to this, LEVELDB supports operations such as data compression, which can be directly helpful in reducing storage space and increasing the efficiency of the IO.
Leveldb performance is very prominent, the official website reported that its random write performance of 400,000 records per second, and random read performance of 60,000 records per second. In general, Leveldb writes are much faster than read operations, while sequential read and write operations are much faster than random read and write operations. As to why this is the case, after watching our follow-up leveldb, it is estimated that you will understand its intrinsic reasons.
Leveldb: The whole structure of the second day
LEVELDB is essentially a set of storage systems and some operating interfaces that are available on this storage system. To facilitate understanding of the entire system and its processing flow, we can look at Levledb from two different angles: static angle and dynamic angle. From a static point of view, it can be assumed that the entire system is running (constantly inserting delete read data), at this time we give leveldb photography, from the photos can see how the system's data in memory and disk is how the distribution, in what state, etc. from a dynamic point of view, mainly to understand how the system is writing a record, Read a record, delete a record, but also include in addition to these interface operations, such as internal operations such as compaction, the system after the crash how to restore the system and so on.
The overall architecture described in this section is mainly from the static point of view, then the following sections will detail the static structure involved in the file or memory data structure, LEVELDB in the second half of the day to introduce the dynamic perspective of the leveldb, that is how the whole system is working.
Leveldb as a storage system, the data record of the storage media including memory and disk files, if as mentioned above, when the leveldb run for a while, we give LEVELDB perspective, then you will see the following scenes:
Figure 1.1:LEVELDB Structure
As you can see, the LEVELDB static structure consists of six main parts: in-memory memtable and immutable memtable and several main files on disk: Current file, manifest file, log file, and sstable file. Of course, LEVELDB has some auxiliary files in addition to these six main parts, but the above six files and data structures are the main constituent elements of LEVELDB.
Leveldb log files and memtable are consistent with the bigtable paper, when the application is written to a key:value record, LEVELDB will first write to the log file, and then insert the record into the memtable after success. This basically completes the write operation, because one write operation involves only one disk sequence and one memory write, so this is the main reason why the leveldb write speed is very fast.
Log file in the system is mainly used for system crash recovery without loss of data, if there is no log file, because the record is initially stored in memory, if the system crashes, the data in memory has not had time to dump to disk, so the data will be lost (Redis has this problem). To avoid this, LEVELDB logs the operation to the log file before it is written to the memory, and then into memory, so that if the system crashes, the in-memory memtable can be recovered from the log file without causing data loss.
When the memtable inserted data takes up memory to a limit, need to export memory records into the external memory file, Levledb will generate a new log file and Memtable, the original memtable became immutable memtable, as the name implies, This means that the content of this memtable is immutable and can only be read and not written or deleted. Incoming data is credited to the new log file and the MEMTABLE,LEVELDB background scheduler will export immutable memtable data to disk to form a new sstable file. Sstable is the data in memory is continuously exported and compaction operation formed, and all sstable files is a hierarchy, the first level is level 0, the second layer is level 1, and so on, and so on, the level gradually increased, This is why it is called LEVELDB.
Sstable file is key in order, that is, the small key record in the file before the big key record, the level of the sstable are so, but the point to note is: level 0 sstable files (suffix. sst) are specific compared to other level files:. sst files in this hierarchy, two files may have key overlap, such as two level 0 SST files, file A and file B, file A's key range is: {bar, car }, File B's key range is {blue,samecity}, it is likely that two files have key= "blood" records. For other levels of the sstable file, there is no key overlap in the SST file at the same level, which means that any two. sst files in the rank L can guarantee that their key values will not overlap. This requires special attention, and you will see that many of the differences in operations are due to this cause.
A file in sstable is at a specific level, and its stored records are key-ordered, so there must be a minimum key and a maximum key in the file, which is very important information that LEVELDB should write down. Manifest is to do this, it records the sstable of the various file management information, such as which level, file name, what is the minimum key and the maximum number of each. is a schematic of what the manifest stores:
Figure 2.1:manifest Storage
The figure shows only two files (manifest will record this information for all sstable files), which is the level 0 Test.sst1 and Test.sst2 files, along with their respective key ranges, such as the TEST.SSTT1 key range " An "to" banana ", and the file Test.sst2 key range is" Baby "to" samecity ", you can see that the key range is overlapping.
What does the current file do? There is only one message for this file, which is to record the current manifest file name. Because in the levledb running process, with the compaction, sstable files will change, there will be new files, old files are discarded, manifest will follow this change, this time will often generate a new manifest file to record this change, And current is used to point out which manifest file is the manifest file that we care about.
The above-described content constitutes the overall static structure of LEVELDB, in the next content of LEVELDB, we will first introduce the important files or memory data specific data layout and structure.
Leveldb: The third day of knowledge
log File
The main function of the log file in Leveldb is to ensure that data is not lost when system failure is restored. Because the log file is written before the record is written to the memory memtable, the data in memtable does not have time to dump to the disk's sstable file, even if the system fails. Leveldb can also recover the contents of the Memtable data structure based on the log file, without causing the system to lose data, LEVELDB and bigtable are consistent at this point.
Let's take a look at the specific physical and logical layout of the log file, leveldb for a log file, it will be cut into a 32K unit of physical block, each unit read a block as the basic reading unit, the display of the log file is composed of 3 blocks , so from a physical layout, a log file is made up of contiguous 32K size blocks.
Figure 3.1 Log file layout
In the application of the field of view is not see these blocks, the application sees a series of Key:value, in the Leveldb, will be a key:value to a record of data, in addition to this data before adding a record header, used to record some management information to facilitate internal processing , figure 3.2 shows how a record is represented within the LEVELDB.
Figure 3.2 Recording Structure
The record header contains three fields, Chechsum is the "type" and "Data" field of the check code, in order to avoid processing incomplete or corrupted data, when Leveldb read the record data when the data will be verified, if the discovery and storage of the same checksum, indicating that the data intact without destruction, You can continue with the follow-up process. "Record Length" records the size of the data, "data" is the above-mentioned Key:value value pair, "type" field indicates the logical structure of each record and the log file physical block structure of the relationship between, in particular, there are the following four kinds: full/first/middle/ Last.
If the record type is full, the current record is stored intact in a physical block that is not cut off by a different physical block, and if the record is cut off by an adjacent physical block, the type is one of the other three types. We specify it in the example shown in Figure 3.1.
Assuming that there are currently three records, record A,record B and record C, where the record A size is 10k,record B size is 80k,record C size 12K, then its logical layout in the log file is shown in 3.1. Record A is shown in the blue area of the figure, because the size is 10k<32k, can be placed in a physical block, so its type is Full;record B size 80K, and block 1 because put in Record a, so there is 22K left, Not enough to put down record B, so the remainder of block 1 is placed at the beginning of record B, the type is identified as first, which represents the starting part of a record, and record B has 58K without storage, which can only be placed in the subsequent physical block, Because the Block 2 is only 32K in size and still does not fit the remainder of record B, Block 2 is all used to place record B, and the identity type is middle, which means that this is the middle segment of record B; the remainder of record B can be completely placed in block 3, the type is identified as last, which represents the end of record b data; The yellow record C in the figure because the size of 12k,block 3 is enough for the rest of the space to drop it all, so its type is identified as full.
From this small example, you can see the relationship between the logical record and the physical block, leveldb a physical read as a block, and then splicing logical records according to the type, for subsequent processes.
leveldb: Four days of knowledge
sstable File
Sstable is a critical piece of bigtable, and for leveldb, understanding the Sstable implementation details of LEVELDB also helps to understand some of the implementation details in BigTable.
This section mainly describes the static layout structure of sstable, we have said in the "Leveldb two: The whole structure" in the Sstable file to form a hierarchy of different levels, as to how this hierarchy is formed we put in the back compaction section of the details. This section focuses on the physical layout and logical layout structure of a sstable file, which is useful for understanding LEVELDB's running process.
Leveldb There are many sstable files at different levels (the prefix. SST is characteristic), all the. sst files have the same internal layout. The previous section describes the log file is a physical block, sstable will also divide the file into a fixed-size physical storage block, but the logical layout of the two is very different, the root cause is: Log file is the key unordered, that is, the key size of the record has no definite size relationship, The interior of the. sst file is arranged from small to large according to the key of the record, from the sstable layout described below can be realized that key order is why so design. SST file Structure key.
Figure 4.1 The block structure of the. sst file
Figure 4.1 shows the physical partitioning structure of a. sst file, like a log file, divided into fixed-size storage blocks, each block is divided into three parts, and the red part is the data storage area. The blue type area is used to identify whether the data store uses a data compression algorithm (snappy or uncompressed two), the CRC part is the data check code, to determine whether the data in the generation and transmission error.
The above is the physical layout of the. sst file, which describes the logical layout of the. sst files, the so-called logical layout, that is, although everyone is a physical block, but what the content of each piece of storage, the internal structure and so on. Figure 4.2 shows the internal logical interpretation of the. sst file.
Figure 4.2 Logical layout
As can be seen from Figure 4.2, from a large aspect, the. sst file can be divided into data storage area and data management area, the data store holds the actual key:value data, and the data Zone provides some index pointers and other administrative data to find the corresponding records more quickly and conveniently. Two regions are based on the above-mentioned block, that is, the file in front of a number of blocks actually store the KV data, the data management area behind the storage administration data. The management data is divided into four different types: the purple meta block, the Red Metablock Index and the Blue Data Index block, and a file trailing block.
LevelDb version 1.2 for the meta block is not actually used, just reserved an interface, it is estimated to include in the subsequent version of the content, the following we look at the data index area and the file tail footer internal structure.
Figure 4.3 Data index
Figure 4.3 is the internal structure of the data index. Again, the KV record in data block is sorted by key from small to large, each record in the data index area is the index information that is established for a certain data block, each index information contains three contents, and the index of the block I is shown in Figure 4.3. I: The first field in the Red Section records the key that is greater than or equal to the largest key value in Block I, the second field indicates the starting position of block I in the. sst file, and the third field indicates the size of the data block I (sometimes with data compression). The following two fields are good to understand, is used to locate the data block in the file position, the first field needs to explain in detail, the key value stored in the index is not necessarily a key of a record, in the example of Figure 4.3, assume the minimum key= "samecity" of the Block I, the maximum key= " The best "i+1", the smallest key= "The Fox", the largest key= "zoo", then for the index of the data block I, the first field record is greater than or equal to block I of the maximum key ("The best") is smaller than the data block i+ 1 of the minimum key ("The Fox"), so the first field of index I in the example is: "The C", this is to meet the requirements, and the first field of index i+1 is "Zoo", that is, the largest key of the block i+1.
The internal structure of the footer block at the end of the file is shown in Figure 4.4,metaindex_handle, which indicates the starting position and size of the Metaindex block; Inex_handle indicates the starting address and size of the index block. These two fields can be understood as indexed indexes, which are set up to correctly read the index values, followed by a fill area and magic number.
Figure 4.4 Footer
The above is mainly about the internal structure of the data management area, let us look at the data section of a block of the internal layout of the data part (Figure 4.1 in the Red section), Figure 4.5 is its internal layout.
Figure 4.5 Data Block internal structure
It can be seen that the interior is also divided into two parts, the front is a KV record, its order is based on the key value from small to large arrangement, in the Block tail is a number of "Restart Point" (Restart points), is actually some pointers, pointing out the block content of some record location.
What is the "restart point"? We have repeatedly emphasized that the KV record in block content is ordered according to the key size, so that the adjacent two records are likely to overlap the key part, such as key i= "The Car", key i+1= "The color", then there are overlapping parts "the C", In order to reduce the storage of key, key i+1 can only store and the last key different part "Olor", the common part of the two can be obtained from key I. The recorded key is stored in the Block Content section, primarily to reduce storage overhead. "Restart point" means: At the beginning of this record, no longer take only the different key parts, but re-record all the key value, assuming that key i+1 is a restart point, then key will be stored in full "the color", rather than the use of a simple "olor" way. The end of the block is to indicate which records are these restart points.
Figure 4.6 Recording format
What is the internal structure of each KV record in the block content area? Figure 4.6 shows its detailed structure, each record contains 5 fields: key shared length, such as the "Olor" record above, its key and the previous record shared key part length is "The C" length, that is 5;key non-shared length, for "Olor", is 4 The value length indicates the length of value in Key:value, stores the actual value value in the subsequent Value Content field, and the key non-shared content actually stores the key string "Olor".
These are all the internal mysteries of the. sst file.
Leveldb Five of the day:
Memtable Detailed
Leveldb the previous section describes the important static structure of disk files, this section describes the in-memory data structure memtable,memtable important position in the whole system is also self-evident. Overall, all kv data is stored in memtable,immutable memtable and sstable, and immutable memtable is structurally and memtable exactly the same, except that it is read-only and does not allow write operations , while Memtable is allowed to write and read. When the data written by memtable reaches a specified amount of memory, it is automatically converted to immutable memtable, waits for dump to disk, the system automatically generates a new memtable for write operations to write new data, and understands memtable, then immutable Memtable Nature is a cinch.
LEVELDB's memtable provides an operational interface for writing KV data, deleting and reading KV records, but in fact memtable does not have a real delete operation, and the value of deleting a key is implemented as a record insertion in memtable, But will hit a key delete tag, the actual deletion is lazy, will be removed in the subsequent compaction process of this kv.
It is important to note that the KV pair in the Leveldb memtable is stored sequentially according to the key size, and when the system is plugged into a new kv, the LEVELDB is to plug the kv into the appropriate position to maintain the key ordering. In fact, Leveldb's memtable class is just an interface class, the real operation is done through the skiplist behind, including insert operations and read operations, so memtable core data structure is a skiplist.
Skiplist was invented by William Pugh. He published the Skip lists:a probabilistic alternative to balanced trees in communications of the ACM June 1990, 33 (6) 668-676, in which the detailed solution The SKIPLIST data structure and insert delete operations are released.
Skiplist is an alternative data structure for a balanced tree, but not the same as the red and black trees, skiplist is based on a randomized algorithm for the balance of the tree, which means that the insertion and deletion of skiplist is relatively straightforward.
A detailed introduction to Skiplist can be found in this article: http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html, it is very clear, The skiplist of LEVELDB is basically a concrete realization, and there is no special place.
Skiplist is not only a simple implementation of maintaining ordered data, but compared to the balance tree, when inserting data can avoid frequent tree node adjustment operations, so the write efficiency is very high, leveldb overall is a high write system, Skiplist should also play a very important role in it. Redis also uses skiplist as an internal implementation data structure in order to speed up the insert operation.
Leveldb Six of the days of knowledge
Write and delete records
In the previous five leveldb, we introduced some of LEVELDB's static files and their detailed layout, starting with this section, we look at some of LEVELDB's dynamic operations, such as reading and writing records, compaction, error recovery, and other operations.
This section describes the record update operation for LEVELDB, which inserts a KV record or deletes a KV record. The LEVELDB update operation is very fast, because its internal mechanism determines the simplicity of this update operation.
Figure 6.1 Leveldb Write record
Figure 6.1 is how LEVELDB update kv data, it can be seen that for an insert operation put (Key,value), the completion of the insert operation contains two specific steps: First, this KV record is appended to the end of the previously described log file in a sequential manner, Because although this is a disk read and write operation, but the order of the file append write efficiency is very high, so does not cause the write speed decrease, the second step is: If the log file is written successfully, then this KV record into the memory of the Memtable, described earlier, Memtable is just a layer of encapsulation, the inside is actually a key ordered Skiplist list, the process of inserting a new record is also very simple, that is, first find the appropriate insertion position, and then modify the corresponding link pointer to insert a new record. By completing this step, the write record is completed, so an Insert record operation involves a disk file append write and memory skiplist insert operation, which is the root cause of why the leveldb write speed is so efficient.
From the above introduction process can also be seen: The log file is key unordered, and memtable is key in order. So what if we delete a KV record? For Leveldb, there is no immediate delete operation, but the same as the insert operation, except that the insert operation inserts the Key:value value, and the delete operation inserts "Key: Delete tag" and does not really delete the record. But the background compaction the time to do the real delete operation.
The Leveldb write operation is so simple. The real trouble is in the read operation that will be introduced later.
Leveldb Seven of the day:
Reading Records
Leveldb is a stand-alone repository for large-scale key/value data, and LEVELDB is a storage tool from an application perspective. And as a competent storage tool, the common call interface is nothing more than a new KV, delete kv, read kv, update key corresponding to the value of such several operations. Leveldb interface does not directly support the interface of the update operation, if you need to update the value of a key, you can choose to insert the new KV directly vigorous, maintain the same key, so that the system key corresponding to the value will be updated, or you can first delete the old kv, Then insert the new KV, so that the more tactful to complete the update operation of KV.
Assuming the application submits a key value, let's see how leveldb reads its corresponding value value from the stored data. Figure 7-1 is the whole of the LEVELDB reading process.
Figure 7-1 Leveldb Read the recording process
Leveldb first looks at the in-memory memtable, if the memtable contains the key and its corresponding value, the value is returned, if the memtable is not read to key, then the same in-memory immutable Memtable read, similarly, if read to return, if not read, then can only helpless down from the disk in a large number of sstable files to find. Because the number of sstable is large and divided into levels, reading the data in sstable is quite a winding journey. The general reading principle is this: first look up from a file belonging to level 0, if found then return the corresponding value value, if not found then to the level 1 in the file to find, so repeated, Until the value corresponding to this key is found in a layer sstable file (or at the highest level, the lookup fails, indicating that the key does not exist in the entire system).
So why is it from memtable to immutable memtable, then from immutable memtable to files, and why is this a query path from low to high? What is the reason? This query path is chosen because, from the time of update of information, it is obvious that the Memtable store is the freshest kv pair; The immutable of the KV data pairs stored in the memtable , and the KV data in all sstable files must not be as fresh as in-memory memtable and immutable memtable. For sstable files, if the same key,level L information is found at level L and level l+1, it must be newer than level l+1. In other words, the search path listed above is sorted by the freshness of the data, and the more fresh it is, the more it will be searched first.
Why would you prefer to find fresh data? This is a truism, for instance. For example, we first insert a data into the leveldb {key= "www.samecity.com" value= "we"}, after a few days, the Samecity website renamed: 69 The same city, at this time we insert data {key= "www.samecity.com "Value=" 69 with the same city "}, the same key, different value, logically understood as if there is only one storage record in Leveldb, that is, the second record, but there is likely to be two records in Leveldb, that is, the above two records are stored in Leveldb, At this point, if the user queries key= "www.samecity.com", we certainly want to find the latest update record, that is, the second record is returned, which is why you should first look for fresh data.
The previous article said: for sstable files, if both level L and level l+1 to find the same key,level l information must be newer than the level l+1. This is a conclusion that theoretically requires a process of proving otherwise it will incur the following problems: For God's horse? From the truth, it is clear: because the level of l+1 data is not from the stone seam, nor dream of dreaming, then it is from where? Level l+1 data is obtained from level L after compaction (if you don't know what compaction is, then ...). Perhaps later will know), that is, you see the current level l+1 layer of sstable data is from the original levels L, and now I am more than the original to be fresh, level l data, so can be certified, now the level L + now 1 of the data should be fresh.
Sstable file a lot, how to quickly find key corresponding value? In Leveldb, Level 0 has always been a specialization, and the process of finding a key in level 0 and Other level is not the same. Because different files under level 0 may overlap the range of keys, a key to be queried may contain multiple files, so the LEVELDB strategy is to first find What files in 0 contain this key (the manifest file contains the scope information of the key in the level and corresponding files and files, leveldb the mapping table in memory), sorted by the freshness of the file, the new file is in front, and then in turn, Read the value corresponding to key. And if it is non-level 0, because the level of the file between the key is not overlapping, so only from a file can find the key corresponding to the value.
The last question, if given a key to query and a key range contains the sstable file of this key, then how does leveldb do the specific search process? LEVELDB will typically look in the in-memory cache for a cached record of the file, if it is included, read from the cache, or if not, open the Sstable file and load the index portion of the file into memory and into the cache. In this case, the cache has this sstable, but only the index part in memory, then LEVELDB according to the index can be located to which content block will contain this key, read the contents of this block from the file, in accordance with the records of the comparison, If found, the result is returned, if not found, then the sstable file of this level does not contain this key, so go to the next sstable to find.
From the previous introduction of the LEVELDB and the read operation described here can be seen, relative write operations, read operations to deal with a lot more complex, so the speed of writing must be much higher than the speed of reading data, that is, leveldb more suitable for write operations than read operation of the application. If the application is a large number of read operation types, sequential read efficiency will be higher, because most of the content will be found in the cache, as much as possible to avoid a large number of random read operations.
Leveldb: Eight Days of knowledge
compaction Operation
As mentioned earlier, for Leveldb, write record operation is very simple, delete records only write a delete tag even if it is done, but the reading records are more complex, need to be in the memory and in each level of the file according to the freshness of the search, the cost is high. In order to speed up the reading speed, Leveldb took a compaction way to collate the existing records, in this way, to remove some no longer valid KV data, reduce the size of the data, reduce the number of files and so on.
The compaction mechanism and process of leveldb is basically consistent with bigtable, with three types of Compaction:minor, major and full in bigtable. The so-called minor compaction, is the memtable in the data exported to the Sstable file, major compaction is to merge different levels of sstable files, and the full Compaction is to merge all the sstable.
Leveldb contains two of them, minor and Major.
We will give you a detailed description of its mechanism.
First look at the process of minor compaction. The purpose of Minor compaction is to save the contents to a disk file when the memtable size in memory is a certain value, and figure 8.1 is its mechanism.
Figure 8.1 Minor compaction
As can be seen from 8.1, when the number of memtable to a certain extent will be converted to immutable memtable, at this time cannot write to the record, only the KV content can be read from. Previously introduced, immutable Memtable is actually a multi-level queue skiplist, where the records are sorted by key. So this minor compaction implementation is also very simple, that is, according to immutable Memtable recorded from small to large traversal, and then write to a level 0 of the new sstable file, after writing the index data to establish the file, This completes a minor compaction. Also can be seen, for the deleted records, in the minor compaction process does not really delete this record, the reason is very simple, here only know to delete the key record, but where the KV data? That requires complex lookups, so in minor Compaction do not delete, but the key as a record to write to the file, as for the real delete operation, in the later higher level of compaction will do.
When the number of sstable files in a level exceeds a certain set value, Leveldb selects a file (level>0) from the sstable of the levels and merges it with level+1 sstable files of a higher hierarchy, which is major Compaction.
We know that at levels greater than 0, the keys within each sstable file are stored in small to large order, and the key range between the different files (the minimum key and the maximum key within the file) does not overlap. Level 0 sstable files are special, although each file is also based on key from small to large, but because level 0 files are generated directly through minor compaction, so any two level 0 under two sstable files may overlap on the key range. So when doing major compaction, for levels greater than level 0, select one of the files on the line, but for 0, after specifying a file, it is likely that there are other Sstable file's key range and this file overlap, In this case, to find all overlapping files and level 1 files to merge, that is, level 0 in the file selection, there may be more than one file to participate in major compaction.
Leveldb after selecting a level for compaction, but also to choose which file to do compaction,leveldb here is a small trick, that is, take turns, for example, this time is file a compaction, then the next time is the key The range is compaction to file B next to file A, so that each file will have the opportunity to merge with the top level files in turn.
If you choose the level L file A and levels l+1 layer of files to merge, then the problem comes again, you should choose what files to be l+1 to merge? LEVELDB Select all the files in the l+1 layer and file a that overlap on the key range to merge with file a.
That is, the file A of level L is selected, and then all the files that need to be merged are found in level l+1 b,c,d ... etc... The remaining question is how does the major merger work? That is, given a series of files, each file is a key in order, how to merge these files, so that the newly generated file is still a key in order, while throwing away what is no longer valuable kv data.
Figure 8.2 illustrates this process.
Figure 8.2 Sstable Compaction
The process of Major compaction is as follows: for multiple files in a multi-merge sort, find the smallest key record in turn, that is, to reorder all the records in multiple files. Then take certain criteria to determine whether the key still needs to be saved, if the judgment does not save value, then directly throw away, if you feel you need to continue to save, then write it to the level l+1 layer in a newly generated sstable file. In this way, the KV data one by one processing, formed a series of new l+1 layer data files, the previous L-layer file and l+1 layer to participate in the compaction file data at this time has no meaning, so all deleted. This completes the merge process of the L-layer and l+1-layer file records.
So in the process of major compaction, what is the criterion for judging whether a KV record is discarded? One of the criteria is that for a key, if the key is present in a lower-than-l layer, the KV can be thrown away during the major compaction process. As we have previously analyzed, if there is a record of the same key in a file with a level lower than L, then there is a more fresh value for key, then the past value is meaningless, so it can be deleted.
Leveldb The ninth of the day
Leveldb in the
Cache
The book is connected to the previous article before, for LEVELDB, if the read operation does not find a record in the memory memtable, the disk access operation is performed more than once. Assuming the optimal situation, that is, the first time in level 0 in the most recent file found this key, then also need to read 2 disk, one time is the sstable file in the index portion of the memory read, so according to this index can determine which block in which the key is stored The second time is to read the contents of the block and then find the value corresponding to the key in memory.
Two different cache:table caches and block caches were introduced in LEVELDB. Where the block cache is optional, specify whether to turn this feature on in the configuration file.
Figure 9.1 Table Cache
Figure 9.1 is the structure of the table cache. In the cache, the key value is the file name of sstable, the value section contains two parts, one is the file pointer to the Sstable file opened by the disk for easy reading, and the other is a pointer to the table structure corresponding to the sstable file in memory. The table structure, in memory, preserves the index content of sstable and the cache_id used to indicate the block cache, except, of course, some other content.
For example, in a Get (key) read operation, if LEVELDB determines that the key is within the range of a key range of a file A, it is necessary to determine if the file a really contains the KV. At this point, LEVELDB will first look for the table cache to see if the file is in the cache, and if so, then you can find out which block contains the key based on the index section. If the file is not found in the cache, open the Sstable file, read its index part into memory, and then insert it into the cache to locate which block in index contains the key. If you determine which block contains this key, you need to read the block content, which is the second read.
Figure 9.2 Block Cache
The Block cache is designed to speed up the process, and figure 9.2 is its structure. The key is the file's cache_id plus the block in the file starting position block_offset. And value is the content of this block.
If Leveldb finds the block in the block cache, it avoids reading the data and looks for the key's value directly in the cache's block content, if not found. Then read the block contents and insert it into the block cache. This is how the leveldb speeds up reading with two caches. From here, it can be seen that if the data is better local, that is to read the most of the data in the cache can be read, then the reading efficiency should be very high, and if the key is sequential read efficiency should be good, because once read in can be reused multiple times. But if you read randomly, you can infer how efficient it is.
Leveldb Ten of the dayVersion, Versionedit, Versionset
version saves the current disk and all of the file information in memory, and typically only one version is called "Current" version. LEVELDB also holds a series of historical versions, what are the effects of these historical versions?
When a iterator is created, iterator refers to the current version, as long as the iterator is not deleted so the version referenced by iterator will survive. This means that when you run out of a iterator, you need to delete it in a timely manner.
Once the compaction is finished (a new file is generated, the file before the merge needs to be deleted), LEVELDB creates a new version as the current version and the original current version becomes the historical version.
Versionset is a collection of all version, which manages all surviving version.
Versionedit represents a change in version, equivalent to the delta increment, which indicates how many files have been added and the files have been deleted. Represent the relationship between them.
Version0 +versionedit-->version1
The Versionedit is saved to the manifest file, and is read from the manifest file when the data is restored to reconstruct the data.
Leveldb This version of the control, let me think of the dual buffer switch, the dual buffer switch from the graphics, to solve the screen drawing of the splash screen problem, in the server programming is also useful.
For example, we have a dictionary library on our server, we need to update this dictionary library every day, we can open a new buffer, load the new dictionary library into the new buffer, wait until the loading is complete, and point the dictionary pointer to the new dictionary library.
The version management of LEVELDB is similar to the dual buffer switch, but if the original version is referenced by a iterator, then this version will persist until it is not referenced by any iterator, and this version can be deleted at this time.
Note: blog post reference Langerhans technology blog: http://www.samecity.com/blog/Index.asp?SortID=12
This text address: http://www.cnblogs.com/haippy/archive/2011/12/04/2276064.html