tokyocabinet1.4.19 Reading notes (i) Overview of hash database
started a formal study of key-value forms of persistent storage scenarios, the first reading of the project is Tokyo cabinet, version number is 1.4.19.
Tokyo Cabinet supports several database forms, including hash database, B + Tree database, fix-length database, table database. So far I have only seen the implementation of the first hash database. This is chosen because the first type of database seems to be the most used TC, followed by its algorithm is simpler than the B + tree and the performance of the efficiency is not the same.
Look at the organization of the code in TC. On the above several classifications of the database implementation, in fact, in the TC Project code organization in the form of a single file, such as the hash database code are all concentrated in the tchdb.c/h, but only more than 4,000 lines. In addition to the implementation files of these databases, the rest of the code file functionality can be broadly divided into two categories, one is auxiliary code, the other part of the project is used, and the other part is a separate CLI program for managing the database, such as tchmgr.c/ H is the code for the CLI program that manages the hash database. The reason is to explain the code in the project organization, nothing more than to illustrate, in fact, if the problem is concentrated in the hash database or other forms of database implementation, at least in the TC, the code to be concerned about is not much.
First look at how the database files are organized.
As you can see from the diagram, the hash database file is roughly divided into four parts: the header of the database file, the bucket array, the free pool array, and finally the part that really holds the record. Here is a description of these sections.
1 Database file Header
The header portion of the database file contains some general information about the database, including the following:
Name |
Offset |
Length |
Feature |
Magic number |
0 |
32 |
Identification of the database. Begins with "ToKyO CaBiNeT" |
Database type |
32 |
1 |
Hash (0x01)/b + Tree (0x02)/fixed-length (0x03)/table (0x04) |
Additional flags |
33 |
1 |
Logical Union of Open (1<<0) and Fatal (1<<1) |
Alignment Power |
34 |
1 |
The alignment size, by power of 2 |
Free Blocks Pool Power |
35 |
1 |
The number of elements in the free block pool with power of 2 |
Options |
36 |
1 |
Logical Union of Large (1<<0), Deflate (1<<1), BZIP2 (1<<2), TCBs (1<<3), Extra codec (1<<4) |
Bucket number |
40 |
8 |
The number of elements of the bucket array |
Record number |
48 |
8 |
The number of records in the database |
File size |
56 |
8 |
The file size of the database |
A record |
64 |
8 |
The offset of the The |
Opaque region |
128 |
128 |
Users can use this region arbitrarily |
To be explained, the above form is from Tokyocabinet's official documentation, here. At the same time, the database files in the need to store data where the use of small-end mode of storage, the following is no longer explained this. As you can see from the table above, the size of the header of the database file is 256 bytes.
In all APIs that manipulate the hash database, a pointer with an object type of TCHDB is used, and the information stored in the structure contains the contents of all the header of the database file, so each time you open or create a hash database, Will read the database header information into this pointer (function Tchdbloadmeta).
2) Bucket Array
Each element in the bucket array is an integer that is stored as a 32-bit or 64-bit integer by using a 32-bit or 64-bit system. The integer value that this array holds is the offset of the first element in the database file that corresponds to the hash value that was obtained after each hash of the key.
3 Free Pool Array
Each element in the free pool array defines the following structure body:
typedef struct {//type of structure for a free block
uint64_t off; Offset of the block
uint32_t Rsiz; Size of the Block
} HDBFB;
Obviously, there are only two members, one is the offset in the database file, and the other is the size of the free block. The free pool array is used to hold the deleted record information so that it can be recycled to take advantage of these data areas, followed by a detailed analysis of the operations related to free pool.
4) Record Data area
The structure of each record data area is the following table:
Name |
Offset |
Length |
Feature |
Magic number |
0 |
1 |
Identification of record block. Always 0xc8 |
Hash value |
1 |
1 |
The hash value to decide the path of the hash chain |
Left chain |
2 |
4 |
The alignment quotient of the destination of the left chain |
Right chain |
6 |
4 |
The alignment quotient of the destination of the right chain |
Padding size |
10 |
2 |
The size of the padding |
Key size |
12 |
Vary |
The size of the key |
Value size |
Vary |
Vary |
The size of the value |
Key |
Vary |
Vary |
The data of the key |
Value |
Vary |
Vary |
The data of the value |
Padding |
Vary |
Vary |
Useless data |
Of course, the above structure is just the structure of the record when it is used, and when a record is deleted, its structure becomes:
Name |
Offset |
Length |
Feature |
Magic number |
0 |
1 |
Identification of record block. Always 0xb0 |
Block size |
1 |
4 |
Size of the Block |
In contrast, the first magic number is different, and when magic number is 0xb0 that the record is deleted, the 4 bytes are stored in the size of the free record, And the part behind the record can be ignored.
After analyzing several parts of the hash database file, we also see from the first sketch of the database file that the part of the file header to the bucket array will be mapped to the shared memory of the system through mmap, of course, the content can be mapped more than here, however, the database file header + Bucket array These two parts are necessarily mapped to shared memory, that is, there is no limit to the upper limit of content mapped to shared memory in the hash database, but the lower bound is the file header +bucket array section.
At the same time, free pool also allocates memory on a heap through malloc and is stored in the TCHDB fbpool pointer.
These sections (except for the record zone) are read into memory in different ways, in order to speed up the search, followed by a detailed description.
tokyocabinet1.4.19 Reading notes (ii) Hash database lookup key process
This section focuses on how the hash database in TC finds the record for the key based on a key, because subsequent deletions, which are based on lookup, are described first.
From the overview in the previous section, you can see that there are two members left,right in the record structure:
typedef struct {//type of structure for a record
uint64_t off; Offset of the record
uint32_t Rsiz; Size of the whole record
Uint8_t Magic; Magic number
uint8_t Hash; Second hash value
Uint64_t left; Offset of the left child record
uint64_t right; Offset of the right child record
uint32_t Ksiz; Size of the key
uint32_t Vsiz; Size of the value
uint16_t Psiz; Size of the padding
const char *KBUF; Pointer to the key
const char *VBUF; Pointer to the value
uint64_t Boff; Offset of the body
Char *bbuf; Buffer of the body
} Tchrec; Note that each record is stored in the structure of a two-forked tree of a class.
In fact, TC will first calculate the key bucket index and hash index according to a record key, the code is as follows:
/* Get the bucket index of a record.
' HDB ' Specifies the hash database object.
' Kbuf ' specifies the pointer to the region of the key.
' Ksiz ' Specifies the size of the region of the key.
' HP ' specifies the pointer to the variable in which the second hash value is assigned.
The return value is the bucket index. */
Static uint64_t Tchdbbidx (tchdb *hdb, const char *kbuf, int ksiz, uint8_t *hp) {
ASSERT (HDB && kbuf && ksiz >= 0 && hp);
uint64_t idx = 19780211;
uint32_t hash = 751;
const char *RP = kbuf + ksiz;
while (ksiz--) {
IDX = IDX * (uint8_t *) kbuf++;
hash = (Hash *) ^ * (uint8_t *)--RP;
}
*HP = hash;
Return idx% hdb->bnum;
A special reminder is that the above algorithm, according to the key to calculate the bucket index, is the result of the tchdb->bnum after the die, that is, this value is limited---maximum can not exceed TCHDB initialization when the maximum number of bucket; And the calculated two-level hash value, I do not see that there is a numerical limit, why? Look at the back of the content to understand.
Therefore, all records of bucket index are calculated according to the records are all organized in the form of a two-fork tree, and each bucket array element holds the integer value of the bucket root of the record offset.
In this connection, the associated structure is clear, and the following flowchart shows the process of locating the record of a key:
Simply explained, the search process is first based on the lookup key to calculate the bucket, and then in this bucket two-fork tree in accordance with the conditions of the process of traversal.
As mentioned earlier, the bucket array is mapped to shared memory by the entire mmap. Let's make an estimate, assuming that the memory for the bucket array is 1G, and the file that actually holds the record is 16G, that is, bucket Array elements and records are about 1:16 of the relationship, assuming that the selected hash algorithm is good enough, so that each record key can be more evenly distributed on the different bucket index, that is, each bucket array elements of the two-fork tree on the average 16 elements, Then there is a maximum of O (4) Read file I/O (every time the data read is a read disk operation) + O (1) Secondary memory read operation (because you need to get the offset of the root element in the bucket array).
But wait, there are some details that are not clear.
First of all, the above two-fork tree is not like AVL, red-black tree such as a balanced binary lookup tree, that is, it is likely to be in extreme cases into a list---the tree side without elements, the other side has all the elements.
Second, the above flowchart is one of the first comparison of each comparison is the hash value, the mystery of this value is to solve the problem mentioned above. Since it's just an ordinary two-fork tree that doesn't guarantee a balance, it's going to be balanced by calculating this two-level hash value---of course, The premise remains that the chosen hash algorithm is good enough to guarantee the distribution of the key average.
As mentioned earlier, an unbalanced two-fork tree will only evolve into an extremely unbalanced two-tree-linked list in extreme situations, and such as AVL, red and black tree, such as the balance of binary trees, the algorithm is relatively complex coding, debugging up also trouble, error to follow up more trouble, and also don't forget, The balance of these two-forked trees when you delete/Add elements to make the tree to rebalance the operation, such as rotation, are involved in reading and writing tree nodes, and these, are currently stored on the disk---that is, this is a relatively time-consuming operation, so the problem is: Is it worth optimizing for this extreme situation? In addition, the introduction of the two-level hash is to partially solve this extreme imbalance problem, it is easy to implement the idea, but the introduction of another problem is every time the search according to key to calculate bucket index, It also takes time to calculate the hash index.
Balance point, or balance point. Time or space, this is a problem.
Therefore, after the hash database of TC to find key flow analysis, the biggest feeling is: it does not use complex algorithms and data structures, but through some ingenious optimization such as the introduction of two-level hash, to achieve the system efficiency and coding debugging complexity a better balance. Learn to "balance" various factors, is to do the project to do things, have to master a skill, and this, only a lot of experience to think can slowly accumulate.
All right, simply review the key points of the whole search:
1 All the record is organized in the form of a two-fork tree on the same bucket.
2 This binary tree is not a balanced two-fork tree
3 in order to solve the problem of extreme imbalance caused by the two, TC introduced a level two hash to ensure that the binary tree as much as possible balance.
Above, is TC to record, the organization of the bucket and the process of the entire lookup algorithm. You can see, the algorithm, the structure of the definition, etc. are not complicated, but because of ingenious ideas, can use as simple as possible algorithm/data structure, but also to avoid some potential pitfalls, while still ensuring the efficiency of the search .
Lookup is the core process of key-value form storage, which can optimize the process and have a great impact on the performance of the whole system.
tokyocabinet1.4.19 Reading notes (iii) hash database delete Data flow
The
section focuses on the entire process of locating the data for deletion based on the key. The
first looks at the flowchart of the process, which is simple enough to include the following step-by-step steps:
A) first, find the corresponding record based on the key, which was described in the previous section, which was also mentioned, A lookup operation is the basis for subsequent deletions and inserts of new data.
If the record is not found, the original is not there, then you do not have to go on.
Suppose that the data you want to delete is now found, and then the following steps:
B to set the magic number of the record to 0xb0, as mentioned in the first section of the hash database overview, there are two different magic numbers in the header information for each record, Based on this to determine whether a record has been deleted, the magic number is now set to 0xb0 means that the record has been deleted.
C) Inserts this deleted record into the appropriate location in the free pool array, which is the focus of the next section, and this is a good idea to start with.
D) Mentioned in the previous section, the same bucket index is organized in the form of a two-fork tree, although not a balanced two-fork tree, but the deletion of a data will destroy the nature of the binary tree, so you need to find the appropriate records in the binary tree to replace the deletion of the remaining position after the record.
familiarity with data structures and algorithms knows that a sorted binary tree is ordered if it is traversed in the middle order. So to keep the order of the sorted binary tree after deleting a record is the focus of the delete operation, the following is the