Prometheus TSDB Analysis

Source: Internet
Author: User
Tags unique id

Prometheus TSDB Analysis Overview

Prometheus is a well-known open source monitoring project, its monitoring tasks are scheduled to the specific server, the server to the target crawl monitoring data, and then saved in the local tsdb. Customize powerful PROMQL language queries for real-time and historical timing data, supporting a rich set of queries.
The Prometheus 1.0 version of TSDB (V2 storage engine) is based on LEVELDB and uses the same compression algorithm as Facebook gorilla to compress 16-byte data points to an average of 1.37 bytes.
The Prometheus 2.0 release introduces a new V3 storage engine that provides higher write and query performance. This paper mainly analyzes the design idea of the storage engine.

Design ideas

Prometheus stores the timeseries data at a block of 2 hours. Each block consists of a directory containing: one or more chunk files (saving timeseries data), a metadata file, an index file (via metric Name and labels find the location of the timeseries data in the chunk file). The most recently written data is saved in the memory block, which is written to disk after 2 hours. To prevent data loss due to a program crash, the Wal (Write-ahead-log) mechanism is implemented to persist timeseries raw data appended to the log. When you delete timeseries, the delete entry is recorded in a separate tombstone file instead of immediately removed from the chunk file.
These 2-hour blocks are compressed into a larger block in the background, and data compression is merged into a block file at a higher level to remove the block files of the low-profile. This is consistent with the LEVELDB, rocksdb and other LSM trees.
These designs are highly similar to gorilla designs, so Prometheus is almost equal to a cache tsdb. The characteristics of its local storage determine that it cannot be used for long-term data storage, only for timeseries data saving and querying in short-term windows, and is not highly available (downtime causes historical data to be unreadable).
Prometheus the limitations of local storage, it provides API interfaces for integration with long-term storage and saves data to remote TSDB. The API interface uses a custom protocol buffer over HTTP and is not stable, and subsequent considerations switch to GRPC.

Block in memory of disk file structure

When the block data in memory is not brushed, the Wal file is mainly saved under the block directory.

./data/01BKGV7JBM69T2G1BGBGM6KB12./data/01BKGV7JBM69T2G1BGBGM6KB12/meta.json./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000002./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000001
Persistent block

The Wal file is deleted in the persistent block directory, and the timeseries data is saved in the chunk file. Index is used for indexing the location of the timeseries in the Wal file.

./data/01BKGV7JC0RY8A6MACW02A2PJD./data/01BKGV7JC0RY8A6MACW02A2PJD/meta.json./data/01BKGV7JC0RY8A6MACW02A2PJD/index./data/01BKGV7JC0RY8A6MACW02A2PJD/chunks./data/01BKGV7JC0RY8A6MACW02A2PJD/chunks/000001./data/01BKGV7JC0RY8A6MACW02A2PJD/tombstones
Mmap

Use Mmap to read the compressed merged large file (without taking too many handles), to establish a mapping between the process virtual address and the file offset, and to actually read the data to physical memory only when the query reads the corresponding location. Bypasses the File system page cache, reducing the data copy once. After the query is finished, the corresponding memory is automatically reclaimed by the Linux system based on the memory pressure and can be used for the next query hit before being recycled. Therefore, using MMAP to automatically manage the memory cache required for queries has the advantage of simple management and efficient processing.
As can be seen from here, it is not completely memory-based TSDB, and the difference between gorilla is that querying the historical data requires reading the disk file.

Compaction

Compaction main operations include merging blocks, deleting outdated data, and refactoring chunk data. Merging multiple blocks into a larger block can effectively reduce the number of blocks and avoid the need to merge many block queries when the query covers a longer time range.
In order to improve the efficiency of deletion, when the time series data is deleted, the location of deletion is recorded, and only the block entire directory is deleted when all the data of block is deleted. Therefore, the size of the block merge also needs to be limited to avoid retaining too much of the deleted space (additional space consumption). The best way to do this is to calculate the maximum length of the block by a percentage, such as 10%, based on the duration of the data retention.

Inverted Index

Inverted index (inverted index) provides a quick find of data items based on a subset of its content. In short, I can see all the data labeled app= "Nginx" without having to traverse every timeseries and check if the tag is included.
For this reason, each time series key is assigned a unique ID, which can be retrieved within a constant time, in which case the ID is the forward index.
For a chestnut: If the series with ID 9,10,29 contains the label app= "Nginx", then the inverted index of lable "Nginx" is [9,10,29] for quickly querying the series containing the label.

Performance

In the article writing a time Series Database from scratch, the author gives the benchmark test results for the MacBook Pro on write up to 20 million per second. This data provides higher single-machine performance than the target 700 million writes per minute (more than 10 billion per second) in the gorilla paper.

Reference

Writing a time Series Database from Scratch
Prometheus Official Introduction
Storage and computation of time series data-open source temporal database parsing (iv)

Prometheus TSDB Analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.