What is a time series database
Let's start by describing what is time series data. Timing data is a series of data based on time. These data points are connected to a line in the coordinates of time, which can be made into multi-latitude report, revealing its trend, regularity and anomaly. Looking to the future, you can make big data analysis, machine learning, realize forecast and early warning.
The time series database is the database that holds the time series data, and it needs to support the basic functions of fast writing, persistence, and multi-latitude aggregation query.
Compared to the traditional database, only the current value of the data is recorded, and the time series database records all the historical data. The query of time series data also always takes time as the filter condition.
Example of time series data
p1-North Guang Three 2015 temperature change chart
P2-the current temperature of the north-wide three-way display
Here are some basic concepts for the next time series database (different timing database salutation is slightly different).
Metric: A metric that is equivalent to a table in a relational database.
Data point, which is the equivalent of row in a relational database.
Timestamp: timestamp that represents the time the data point was generated.
Field: The different fields under measure. For example the location of this metric has longitude and latitude of two field. In general, data that changes over time is stored.
Tag: tag, or additional information. Typically, property information is stored that does not change with the timestamp. Timestamp plus all tags can be thought of as table primary key.
For example, the measure is wind, each data point has a timestamp, two field:direction and speed, two Tag:sensor, and city. Its first and third lines, the sensor number for the 95d8-7913 device, the attribute city is Shanghai. As time went on, wind and wind changed, the winds shifted from 23.4 to 23.2, and the wind speed shifted from 3.4 to 3.3.
Basic concept diagram of p3-timing database
Challenges for TIME series databases
Many people may think that adding a timestamp to a traditional relational database can be a time-series database. There is no problem when the amount of data is small, but a small amount of data is a limited latitude, less detail, can be believed to be low, more can not be used for large-scale data analysis. It is obvious that the time series database is designed to solve massive data scenarios.
You can see that the time series database needs to address several issues
- Write time series data: How to support the write of tens of millions of data points per second.
- Reading of time series data: How to support grouping aggregation operations of billions of data in seconds.
- Cost-sensitive: The cost is the result of massive data storage. How to store this data at a lower cost will become a priority for the time series database to be addressed.
These questions are not covered by an article, and each problem can be optimized from multiple perspectives. In this case, we only try to answer the question of how to solve the large volume of data writing and reading from the point of data storage.
Traditional database storage uses B tree, which is because it helps to reduce the number of seek paths in the form of queries and sequential insertions. We know that the disk seek time is very slow, generally around 10ms. Random reads and writes on the disk slow down the search path. A random write to B tree consumes a lot of time on disk seek, causing slow speed. We know that SSDs have faster seek times, but they don't solve the problem fundamentally.
B tree is obviously inappropriate for a time series database in which more than 90% scenarios are written.
The industry mainstream is the use of LSM tree to replace B tree, such as Hbase, Cassandra and other NoSQL.
Shard Design
The Shard design is simply about what to do with sharding, which is very tricky and directly affects the performance of write reads.
Combined with the characteristics of the timing database, according to the Metric+tags Shard is a better way, because often will follow a time range query, so that the same metric and tags of the data will be allocated to a machine for continuous storage, sequential disk reading is very fast. Combined with the above mentioned single-machine storage content, you can do a quick query.
Further, we consider the time range of timing data is very long, we need to be divided into several segments according to the time range, respectively, stored on different machines, so for the large-scale series of data can support concurrent queries, optimize query speed.
For example, the first row and the third row are the same tag (sensor=95d8-7913;city= Shanghai), so the same shard is assigned, and the five row is the same tag, but according to the time range again segmented, is divided into different shards. 第二、四、六 line belongs to the same tag (sensor=f3cc-20f3;city= Beijing) is the same reason.
p5-Timing Data Shard description
Real case
I'll use a batch of open-source timing databases as a description.
InfluxDB:
Very good timing database, but only stand-alone version is free open source, the cluster version is charged. From a stand-alone version, you can get a glimpse of its storage scheme: InfluxDB A storage structure like LSM tree on a single machine, and the scheme of sharding InfluxDB first by + (in fact, plus retentionpolicy) to determine Shardgroup, and then by the H Ash code determines the specific Shard.
KAIROSDB:
The underlying uses Cassandra as the distributed storage engine, as mentioned above on the single machine using the LSM tree.
OPENTSDB:
The bottom layer uses Hbase as its distributed storage engine, and it uses the LSM tree.
Hbase uses the partitioning method of the range. Use row key to make shards to ensure that they are globally ordered. There can be multiple column family under each row key. You can have more than one column per column family.
Conclusion
It can be seen that each distributed time-series database, although the storage scheme is slightly different, but is consistent in nature, due to the time series data write more than read the scene, on a single machine more suitable for large-throughput write single-machine storage structure, and in the distributed solution based on the characteristics of time series data carefully designed, The goal is to design the Shard scheme to facilitate the writing and reading of time series data, and to make the data distribution more uniform, so as to avoid hot spots.
The memory of the time series database in layman's--essence Lsmtree