Time Series Database compression

Source: Internet
Author: User

Lossless compression

Lossless compression is said to be compressed data and decompression of the data exactly the same, there is no loss of precision. The compression of data is ultimately a summary of the regularity of the data. The regularity of time series data can be summarized as two points: 1, timestamp steady increment, 2, the numerical regularity, the change is stable. Let me give you an example.

is a set of time series data, if we look at a row of the feeling compression is a bit difficult, but if we look in one column, the compression scheme is out of the way.

First look at timestamp that column is a linear increment of the series, you can use [1467627245000,1000,4] to express. 1467627245000 represents the first time, 1000 represents the latter time than the previous time of the big 1000, 4 represents the law appeared 4 times. If there are 100 such regular timestamp, it means that we can represent it with 3 Long types. Timestamp compression rate of up to 33.

Further observation of the value of the column, if the difference value can be obtained (6,-5,2,-5), all add 5 to get (11,0,7,0), these values can be expressed in 4bit. That is to say [23,5,4,0xb0700000] (23,22,24,25,24). 4 of them represent a total of 4 in the following. If such a rule has been maintained to 100 int value, it can be represented by 16 int, with a compression rate of up to 6.3.

The specifics are a lot more complicated, but here's a simple example. The InfluxDB lossless compression algorithm has a complete description on its page (note 3), which can be used in a more in-depth understanding with the open stream code. for the floating-point type, the very efficient lossless compression algorithm that Facebook mentions in the Gorilla paper (note 4) has been analyzed in many articles. InfluxDB also uses this algorithm for floating-point types.

lossy compression

lossy compression means that the extracted data and compressed data in the accuracy of the loss, mainly for floating-point numbers. A compression precision is usually set to control the loss of precision. The idea of lossy compression for time series data is fitting. That is, use a line as much as possible to match these points, which can be straight lines or curves.

The most famous time series data lossy compression is Soisoft company's SDA algorithm, Chinese called revolving door compression algorithm.

In, the red point is the point of the previous record, the hollow point is the point that is discarded, the green point is the current point, and the black point is the point that is currently being recorded.

You can see the left side of the graph, the current point and the previous record point, and the compression accuracy of the deviation value of the rectangle can contain the middle point, so these points can be discarded.

Looking at the right side of the graph, the rectangle formed by the current point and the previous record point cannot contain the middle point, so record the previous point. In so doing, you can see that most of the data points will be discarded. When querying, you need to find the points you lost in the interpolation according to the points recorded.

Lossy compression can significantly reduce storage costs. If combined with the ability of the device side, it can even reduce the data write, reduce network bandwidth.

Summarize

Although it is not possible to calculate the optimal compression algorithm, the designed compression algorithm is still a computable problem. As you can see, the lossless compression lossy compression algorithm of the time series data mentioned above will be based on the characteristics of time series data to adopt a scheme to achieve better compression ratio. Now deep learning very fire, people are very curious whether it can give data compression to bring new solutions.

Excerpt from: http://www.infoq.com/cn/articles/condense-in-sequential-databases

Time Series Database compression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.