Preface
Work encountered some of the demand for KV storage, such as the recommendation system needs to store a commodity ID corresponding to the similar Product ID list, or a user's browsing the Product ID list, which requires a key value to store. This article describes the storage based on the requirements of a simple implementation of a version, the actual work is much more complex, in order to make the reader easy to understand, based on this analysis of a primary key to String key,value is the ID of the store implementation of the list. This storage is implemented in Java, the bottom is in the way of mmap, can realize drop disk, and excellent performance. Storage-System Architecture Index
The index of the data, and each key records the location of its corresponding value store in index. Block
The location where value is actually stored. Table
A complete data table containing Block,index and CONFIG. Config
Some configuration information, such as which index is currently in use. Log
Log information for each piece of data
The whole system consists of the above several main modules, each write a data as shown in the following figure.
As shown in the figure above, if there is a request to write data (Key,value), the first is to append this record to the log file, and then in the configuration information Select an index file, from the index file to find the key of this data record, if there is no record, create a new record, This record is the location of the key's value in the block, if the index file has this record, you can get the location, and then write the data in the corresponding location of the blocks, a write request complete.
The module index in the diagram is divided into index0 and index1, and there are two reasons for the need to re-establish the index when the system is expanded, and to switch between the two indexes. Block in fact there are many pieces, can be understood as these small blocks quickly make up a large blocks, each block size is fixed a value, not enough will be added to a chunk, if not enough in the picture will add a block5,. Of course, in fact, index0 or index1 itself has a lot of blocks of data, but these blocks are all part of an index, which means that index0 or index1 itself contains all the index information, setting two is only for expansion and reconstruction. In fact, the block part of the reader can be understood as a whole is a large block. The equivalent of index0 corresponding to the entire block, the index to rebuild the index information to switch to INDEX1, at this point, index1 corresponding blocks, the advantage of this is that only need to save a piece of data, and the index capacity expansion can be separated.
The following article analyzes the implementation of each module in detail. Design Points underlying read and write files
How to drop the data in many ways, the simple introduction of two data in memory accumulation to a certain value after the background of a separate thread to save the file, can be understood as a timed to brush the data into the disk. Write log file, this way is when the system is hung or need to reboot through the traversal operation log again to simulate the operation, restore the original data.
This storage system uses the operating system's mmap function, will brush the disk the work to the operating system management. Mmap can map a file to memory, the process can read and write directly to memory, much faster than ordinary write file performance (about Mmap recommended readers to understand). With this tool, the operation of the brush disk becomes very easy, performance is also guaranteed, the only drawback is that because the mmap is the operating system to control the brush to disk, not real-time, in the system encountered problems hang will cause data loss, simply this system also has log information, which can also achieve the recovery data function.
Add a common sense, memory read and write the fastest, the disk in order to read and write faster than random, so the index and block parts are used mmap way, and is random read and write, log log does not require random read and write, you can directly append files to the sequential reading and writing methods.
After the adoption of MMAP management, there will still be some disk consolidation problems, such as index expansion, block expansion, or the data was deleted after a lot of debris and so on, but also in the later article in detail. Concurrency Issues
The performance of the storage system will inevitably face some problems of multithreading. Many systems use both read and write locks, which can lead to significant performance bottlenecks. Because the demand is to read the frequency is far higher than the frequency of writing, so the system initially adopted a read-write lock, that is, when writing can only be a thread in the writing, reading can be multi-threaded reading. Later found that the performance is not good, because the frequency is not very low, the final use of the time to write can also be read, but the writing can not have another thread to write. This will involve a concurrency problem, is writing a key, The wire is just read the value of the key, this will lead to data errors, the solution is simply version number check or double buffer switch, after the article will be detailed analysis. data structure and algorithm
This storage uses a number of data structures and algorithms, in the index section will use the hash algorithm, including some number theory analysis, string hash. In block will use similar memory allocation algorithm, TREAP application and so on, will do detailed analysis later.