timestamp: Each piece of data needs to specify a timestamp that is treated specifically in the TSM storage engine in order to optimize subsequent query operations.
2) PointThe data structure of a single insert statement in InfluxDB, series + timestamp can be used to differentiate a point, meaning that a point can have multiple field name and field value.
3) SeriesSeries is equivalent to a collection of some data in the InfluxDB, in the same database, retention policy, measurement, tag sets identical data belong to a series, the same series of data in the physical are stored together in chronological order.
The series key is Measurement + the serialized string of all tags, and this key is often used later.
The structure in the code is as follows:
Type Series struct { mu sync. Rwmutex Key string //Series Key tags map[string]string //tags id uint64 //ID measurement *measurement //measurement}
4) ShardShard is a relatively important concept in InfluxDB, and it is associated with retention policy. There are many shard under each storage policy, each shard stores data for a specified period of time, and does not repeat, for example, 7 o'clock-8 points of data fall into SHARD0, 8 points-9 points of data fall into shard1. Each shard corresponds to a lower-level TSM storage engine with a separate cache, Wal, and TSM file.
When you create a database, a default storage policy is created automatically, and the data is persisted for a period of 7 days for the data saved by Shard under this storage policy, and the functions are calculated as follows:
Func shardgroupduration (d time. Duration) time. Duration { if D >= 180*24*time. Hour | | D = = 0 {//6 months or 0 return 7 * time. Hour } else if D >= 2*24*time. Hour {//2 days return 1 * * time. Hour } return 1 * time. Hour}
If you create a new retention policy setting data that is retained for 1 days, the data stored by a single shard is 1 hours long, and more than 1 hours of data is stored in the next shard.
Third, storage engine-TSM TreeFrom LevelDB (LSM tree), to Boltdb (mmap B + Tree), now InfluxDB uses its own implementation of the TSM tree algorithm, similar to the LSM tree, specifically optimized for the use of InfluxDB.
The TSM tree is InfluxDB based on the actual needs of the LSM tree, based on a slight modification.
The TSM storage engine consists mainly of several parts: cache, Wal, tsm file, compactor.
1) ShardShard is not considered one of these components, as this is a concept on the TSM storage engine. In InfluxDB, depending on the scope of the data timestamp, a different shard is created, each shard has its own cache, Wal, TSM file, and compactor, This is done in order to quickly locate the relevant resources to query the data through time, speed up the query process, and also make the subsequent bulk delete data operations become very simple and efficient.
Deleting data in the LSM tree is done by inserting a delete tag into the specified key, the data is not immediately deleted, and the file is then compressed and merged to really delete the data, so deleting large amounts of data is a very inefficient operation in the LSM tree.
In InfluxDB, the retention time of the data is set by retention policy, and when the data in one shard is detected to expire, only the resource of The Shard is released and the related files are deleted, which makes it very efficient to delete outdated data.
2) CacheThe cache is equivalent to the memtable in the LSM Tree, which is a simple map structure in memory, where the key is Serieskey + delimiter + filedname, and the delimiter in the current code is #!~#,entry equivalent to a chronological order An array of actual values, with the following structure:
Type Cache struct { commit sync. Mutex mu sync. Rwmutex Store map[string]*entry size uint64 //Current Memory usage maxSize UInt64 //Cache Max //Snapshots is the cache objects that is currently being written to TSM files //They ' re kept in memory Whil e Flushing so they can is queried along with the cache. They is read only and should never is modified //Memtable snapshot, used to write to TSM file, read-only snapshot *cache snaps Hotsize UInt64 snapshotting BOOL//This number is the number of pending or failed writesnaphot attempts since t He last successful one. snapshotattempts int stats *cachestatistics lastsnapshot time. TIME}
When inserting data, you are actually writing data to both the cache and the Wal, and you can assume that the cache is a cached data in the Wal file in memory. When InfluxDB starts, it iterates through all the Wal files and reconstructs the cache so that it does not cause data loss even if the system fails.
The data in the cache is not infinitely growing, and there is a maxSize parameter that controls how much memory is consumed by the data in the cache and writes the data to the TSM file. If not configured, the default upper limit is 25MB, each time the cache data reaches the threshold, the current cache will be a snapshot, then empty the contents of the current cache, and then create a new Wal file for writing, the remaining Wal files will eventually be deleted, the data in the snapshot will go through the row Write to a new TSM file.
The current cache design has a problem, when a snapshot is being written to a new TSM file, the current cache due to a large amount of data written, and reached the threshold, the previous snapshot has not been fully written to disk, InfluxDB practice is to let the subsequent write operation failed, the user needs to handle , waiting for the recovery to continue writing the data.
3) WALThe content of the Wal file is the same as the cache in memory, which is intended to persist data, which can be recovered through the Wal file after a system crash and not yet written to the TSM file.
Because the data is inserted sequentially into the Wal file, the write efficiency is very high. However, if the data written is not in chronological order, but is written in a haphazard manner, the data will be routed to different shard according to the time, each shard has its own Wal file, so it is no longer a complete sequential write, the performance will have a certain impact. See what the official community has to say follow-up will be optimized, using only one Wal file instead of creating a Wal file for each shard.
Wal a single file reaches a certain size after it is partitioned, creating a new Wal shard file for writing data.
4) TSM FileA single TSM file size of up to 2GB for storing data.
TSM file uses its own design format to optimize query performance and compression, and in later chapters it describes its file structure.
5) compactorThe compactor component runs continuously in the background, checking every 1 seconds for the need to compress the merged data.
There are two main operations, one is to take a snapshot after the data size in the cache reaches the threshold, and then dump it into a new TSM file.
The other is to merge the current TSM file, merging multiple small TSM files into one, so that each file reaches the maximum size of a single file, reduces the number of files, and some data deletion is done at this time.
Iv. Directory and file structureThe InfluxDB data store has three main directories.
By default, it is Meta, Wal, and data three directories.
Meta is used to store some metadata for the database, and there is a meta.db file under the meta directory.
The Wal directory holds a pre-written log file ending with a. Wal. The data directory holds the actual stored file, ending with. Tsm. The structure under these two directories is similar, with the following basic structure:
# WAL directory Structure--Wal --mydb --Autogen --1 --_00001.wal --2 --_00035.wal --2hours --1< c8/>--_00001.wal# Data Directory structure--Data --mydb --Autogen --1 --000000001-000000003.TSM --2 --000000001-000000001.TSM --2hours --1 --000000002-000000002.TSM
Where MyDB is the database name, Autogen and 2hours are the storage policy names, and the next level directory is named after the directory is the Shard ID value, such as Autogen Storage Policy has two Shard,id respectively 1 and 2,shard stored a certain Data within the time period range. The next level of the directory is the specific file, the files that end with. Wal and. TSM, respectively.
1) WAL FileA data in the Wal file corresponds to all the value data under a key (measument + tags + fieldName), sorted by time.
Version (1 byte): Currently the TSM1 engine, this value is fixed to1
。
BlocksInside the Blocks are some contiguous block,block that are the smallest read object in the InfluxDB, and each read operation reads a Block. Each Block is divided into CRC32 value and data two parts, CRC32 value is used to verify the content of data whether there is a problem. The length of Data is recorded in the following Index section.
The content in data is different in the InfluxDB, depending on the type, and the float value is Gorilla float compression, and timestamp because it is an ascending sequence, So it is actually only necessary to record the time offset information when compressing. String type value is compressed using the snappy algorithm.
Data extracted in the format of a 8-byte timestamp and the following value,value depending on the type, will occupy a different size of space, where the string is indefinite, the data will be stored at the beginning of the length, which is the same format as the WAL file.
IndexIndex stores the contents of the previous Blocks. The order of index entries is sorted first by the dictionary order of key and then by time. InfluxDB in the query operation, you can quickly locate the location of the block to be queried in TSM file based on the information of Index.
This diagram shows only a few of them, which are represented by a struct, similar to the following code:
Type blockindex struct { mintime int64 maxtime Int64 Offset Int64 Size UInt32} Type KeyIndex struct { keylen uint16 Key string Type byte Count UInt32 Blocks []*blockindex}type Index []*keyindex
Key Len (2 bytes): The length of the key in the following field.
Key (N bytes): Here The key refers to the Serieskey + delimiter + fieldName.
Type (1 bytes): The types of fieldvalue that fieldName corresponds to, that is, the type of data within the block.
Count (2 bytes): The number of Blocks indexes that follow immediately after.
The next four sections are the index information of the block, which repeats according to the number in count, and each block index is fixed at 28 bytes, sorted by time.
Min Time (8 bytes): The minimum timestamp of value in the block.
Max Time (8 bytes): The maximum timestamp of value in the block.
Offset (8 bytes): The block's offset in the entire TSM file.
Size (4 bytes): Block sizes. Depending on the Offset + Size field, you can quickly read the contents of a block.
Indirect indexAn indirect index exists only in memory and is created to quickly locate a key in the detailed index information, which can be used for binary lookups for fast retrieval.
Offsets is an array in which the value stored is the position of each key in the Index table, and because the key is fixed to 2 bytes in length, the content of the corresponding key at that location can be found.
When you specify a key to query, you can search by binary, locate its position in the Index table, and then according to the time of the data to be queried to locate, because the BLOCKINDEX structure in KeyIndex is fixed length, so you can also do a binary search, Locate the content of the Blockindex that contains the data to be queried, and then quickly read the contents of a block from the TSM file based on the offset and block length.
FooterThe last 8 bytes of the TSM file hold the offset of the starting position of the index part in the TSM file, facilitating the loading of the index information into memory.
V. Data Query and index structureBecause LSM Tree works by converting a large number of random writes into Sequential writes, it greatly improves the performance of data writes while sacrificing some of the read performance. The TSM storage engine was developed based on the LSM Tree, so the situation is similar. You typically design a database with an index file (such as a mainfest file in LevelDB) or Bloom filter to optimize read operations for data structures such as LSM Tree.
There are two main types of indexes in InfluxDB, which are optimized by index.
1) meta-Data indexA database's metadata index is stored by databaseindex the struct, initialized at database startup, loading index data from TSM file under All Shard, obtaining information about all measurement and Series in it, and caching into memory.
Type Databaseindex struct { measurements map[string]*measurement//All Measurement Object series under the database map[ String]*series// all Series objects, Serieskey = measurement + tags name string//database name}
The most important thing in this structure is the content of all the measurement and Series under the data, and its data structure is as follows:
Type measurement struct {Name string ' JSON: ' Name,omitempty ' ' fieldnames map[string]struct{}//This measure All Filednames//In-memory index information in ment//ID and its corresponding series information is primarily designed to save memory in Seriesbytagkeyvalue storage ID Seriesbyid Map[uint64]*series//lookup table for Series by their ID//based on the double index of TAGK and TAGV, save the sorted seriesid array// This map is used to quickly filter out all the seriesid you want to query based on tags when querying an operation, and then reads the content from the file according to Serieskey and time range Seriesbytagkeyvalue Map[string]map[str Ing]seriesids//Map from tag key to value to sorted set of series IDs//ID of all series in this measurement, sorted by ID ser Iesids seriesids//sorted list of series IDs in this measurement}type series struct { Key string//Series Key tags map[string]string//tags ID UInt64 ID measurement *measurement//belongs to Measurement//in which Shard exist shardids map[uint64]bool//Shards That has this series deFined}
Meta-data queryInfluxDB supports a number of special query statements (support regular expression matching), you can query some measurement and tags related data, such as
Show Measurementsshow tag KEYS from ' measurement_name ' show tag VALUES from ' measurement_name ' with KEY = ' Tag_key '
For example, we need to query cpu_usage this measurement upload the data of the machine, a possible query statement is:
SHOW TAG VALUES from ' cpu_usage ' with KEY = ' host '
First of all, according to measurement can get cpu_usage corresponding measurement object in databaseindex.measurements.
The map object with TAGV as the key for Tagk=host is obtained by Measurement.seriesbytagkeyvalue.
Traversing this map object, all the keys are the data we need to get.
The location of common data queryFor normal data query statements, you can quickly navigate to all serieskey,fieldname and time ranges contained in the data you are querying by using the metadata index above.
For example, suppose the query statement is to get data for the last hour of the cpu_usage indicator on this machine SERVER01:
' SELECT value from ' cpu_usage ' WHERE host= ' Server01 ' and Time > Now ()-1h '
The measurement object corresponding to Cpu_usage is obtained from databaseindex.measurements based on Measurement=cpu_usage.
The ID value of all matching series is then obtained by databaseindex.measurements["Cpu_usage"].seriesbytagkeyvalue["host" ["Server01"] Measurement.seriesbyid This map object obtains their actual objects based on the series ID.
Note that although the HOST=SERVER01 is only specified here, it does not mean that there is only one series under Cpu_usage, there may be other tags such as user=1 and user=2, so that the series ID obtained is actually two, and the data needs to get all The data under the series.
Shardids in the series structure this map variable holds the data for which the series exists in the Shard. And measurement.fieldnames This map can help filter out the situation fieldName not exist.
At this point, in the time Complexity of O (1), we get all the required series keys, the Shardid of these series keys, the time range to query the data, then we can create a data iterator to get each series ke from different shard Y the data within the specified time range. Subsequent queries are related to the in-memory cache of Index in TSM file.
2) TSM File IndexThe index portion of the TSM file above will be indexed indirectly in memory, allowing for fast retrieval purposes. Here's a look at the specific data structure:
Type indirectindex struct { b []byte ///down-level verbose index byte stream offsets []int32 //offset array, which records the offset of a key in B Minke Y, Maxkey string mintime, maxtime Int64 //The minimum and maximum time in this file, based on this can quickly determine whether the data to be queried exists in this file, whether it is necessary to read this file Tombstones Map[string][]timerange //is used to record which key data within the specified range has been deleted}
b directly corresponds to the index part of the TSM file, and by binary lookup of the offsets, the index information of all blocks of the specified key can be obtained, and then the offset and size information can be used to fetch all the data from a specified block.
Type indexentries struct { type byte entries []indexentry}type indexentry struct { //Poin in a block T all within this minimum and maximum time range Mintime, maxtime Int64 //block offset Int64//block in TSM file size size UInt32}
As explained in the previous section, the metadata index allows you to obtain all of the required series keys, their corresponding shardid, and the time range. With the TSM file index, we can quickly navigate to the location of the data in the TSM file based on the series key and time range.
Reading data from TSM fileAll data read operations in the InfluxDB are done through Iterator.
Iterator is an abstract concept and supports nesting, and a Iterator can fetch and process data from other Iterator in the underlying, and then pass the results to the upper Iterator.
This part of the code logic is more complex, this does not expand the description. In fact, the main Iterator is the cursor to get the data.
Type cursor Interface { next () (t int64, v interface{})}type Floatcursor interface { cursor nextfloat () (t int V float64)}//The underlying is keycursor, each time reading a block of data type floatascendingcursor struct { //Memory Value object cache struct { values values pos int } tsm struct { tdec timedecoder //Time-serialized object Vdec Floatdecoder //Value serialized object buf []floatvalue values []floatvalue// from TSM The Floatvalue cache pos int keycursor *keycursor }} that is read in the file
The cursor provides a next () method to get a value. Each data type has a cursor implementation of its own.
The underlying implementation is that keycursor,keycursor caches the data for each block, returning it sequentially through the next () function, and then reading the contents of the next block through the Readblock () function when the contents of a block are read.
For more Influxdb detailed tutorials See: INFLUXDB Series Learning Tutorials Catalogue
INFLUXDB Technology Group: 580487672 (click to join)