Objective
Prometheus to Kubernetes (monitoring field), such as kubernetes to container orchestration.
With the Heapster no longer developing and maintaining and the Influxdb cluster scheme no longer open source, Heapster+influxdb's monitoring scheme is only suitable for some smaller k8s clusters. And Prometheus the entire community is very active, in addition to the official community to provide a series of high-quality exporter, such as Node_exporter. TELEGRAF (centralized acquisition metrics) + Prometheus is also a great way to reduce the deployment and management of various exporter workloads.
Today we mainly talk about the Division I in the use of Prometheus process, storage aspects of some practical experience.
Prometheus Storage Bottleneck
As can be seen through the Prometheus Architecture diagram, Prometheus provides local storage, the TSDB time series database. The advantage of local storage is that operation is simple, the disadvantage is that there is no huge amount of metrics persistence and the risk of loss of data, we in the actual use of the process, there have been several times the Wal file corruption, no longer write problems.
Of course prometheus2.0 after the compression of data ability has been greatly improved. In order to solve the limit of single node storage, Prometheus does not implement the cluster storage itself, but provides the interface of remote reading and writing, so that the user can choose the appropriate time series database to realize Prometheus Extensibility.
The Prometheus can be interfaced to other Remote Storage systems in two different ways
- Prometheus writes metrics to the remote Storage in the standard format
- Prometheus reads from the remote URL in standard format metrics
The significance and value of metrics's persistence
In fact, monitoring is not only reflected in the real-time control of the system operation, timely alarm these. and monitoring the data collected in the following areas is valuable
- Audit and billing of resources. This needs to keep data for a year or even years.
- Tracing of fault liability
- Follow-up analysis and mining, and even the use of AI, can realize the alarm rules of the intelligent, root cause analysis and prediction of an application of the QPS trend, pre-HPA and so on, of course, this is now popular aiops category.
Prometheus Data Persistence Scheme
Solution Selection
Support for Prometheus remote read and write programs in the community
- Appoptics:write
- Chronix:write
- Cortex:read and Write
- Cratedb:read and Write
- Elasticsearch:write
- Gnocchi:write
- Graphite:write
- Influxdb:read and Write
- Opentsdb:write
- Postgresql/timescaledb:read and Write
- Signalfx:write
- Clickhouse:read and Write
The selection plan needs to have the following points
- Satisfy data security, need to support fault tolerance, backup
- Better write performance with Shard support
- Technical solutions are not complex
- Query syntax-friendly when used for post-analysis
- Grafana read support, priority consideration
- Need to support both read and write
Based on the above points, clickhouse satisfies our usage scenarios.
Clickhouse is a high-performance, column-oriented database that supports rich analytic functions because it focuses on analysis.
The following are some of the official recommended usage scenarios for Clickhouse:
- Web and APP Analytics
- Advertising networks and RTB
- Telecommunications
- E-commerce and Finance
- Information security
- Monitoring and Telemetry
- Time series
- Business Intelligence
- Online Games
- Internet of Things
CK is suitable for storing time series
In addition the community already has the Graphouse project, the CK as graphite storage.
Performance testing
Write test
Local Mac,docker start a single CK, to undertake the 3 clusters of metrics, the average value of 12,910/S. Write without pressure. In fact, in the Net Alliance and other companies, the actual use, to reach 300,000/s.
Query test
fbe6a4edc3eb :) select count(*) from metrics.samples;SELECT count(*)FROM metrics.samples┌──count()─┐│ 22687301 │└──────────┘1 rows in set. Elapsed: 0.014 sec. Processed 22.69 million rows, 45.37 MB (1.65 billion rows/s., 3.30 GB/s.)
One of the most likely time-consuming queries:
1) query aggregate sum
fbe6a4edc3eb :) select sum(val) from metrics.samples where arrayExists(x -> 1 == match(x, 'cid=9'),tags) = 1 and name = 'machine_cpu_cores' and ts > '2017-07-11 08:00:00'SELECT sum(val)FROM metrics.samplesWHERE (arrayExists(x -> (1 = match(x, 'cid=9')), tags) = 1) AND (name = 'machine_cpu_cores') AND (ts > '2017-07-11 08:00:00')┌─sum(val)─┐│ 6324 │└──────────┘1 rows in set. Elapsed: 0.022 sec. Processed 57.34 thousand rows, 34.02 MB (2.66 million rows/s., 1.58 GB/s.)
2) Group BY query
fbe6a4edc3eb :) select sum(val), time from metrics.samples where arrayExists(x -> 1 == match(x, 'cid=9'),tags) = 1 and name = 'machine_cpu_cores' and ts > '2017-07-11 08:00:00' group by toDate(ts) as time;SELECT sum(val), timeFROM metrics.samplesWHERE (arrayExists(x -> (1 = match(x, 'cid=9')), tags) = 1) AND (name = 'machine_cpu_cores') AND (ts > '2017-07-11 08:00:00')GROUP BY toDate(ts) AS time┌─sum(val)─┬───────time─┐│ 6460 │ 2018-07-11 ││ 136 │ 2018-07-12 │└──────────┴────────────┘2 rows in set. Elapsed: 0.023 sec. Processed 64.11 thousand rows, 36.21 MB (2.73 million rows/s., 1.54 GB/s.)
3) Regular Expressions
fbe6a4edc3eb :) select sum(val) from metrics.samples where name = 'container_memory_rss' and arrayExists(x -> 1 == match(x, '^pod_name=ofo-eva-hub'),tags) = 1 ;SELECT sum(val)FROM metrics.samplesWHERE (name = 'container_memory_rss') AND (arrayExists(x -> (1 = match(x, '^pod_name=ofo-eva-hub')), tags) = 1)┌─────sum(val)─┐│ 870016516096 │└──────────────┘1 rows in set. Elapsed: 0.142 sec. Processed 442.37 thousand rows, 311.52 MB (3.11 million rows/s., 2.19 GB/s.)
Summarize:
With well-built indexes, query performance is very good even under large data volumes.
Program Design
The following points are available for this architecture:
- Each k8s cluster deploys a prometheus-clickhouse-adapter. About Prometheus-clickhouse-adapter This component, we will read it in detail below.
- Clickhouse cluster deployment requires the ZK cluster to do consistency table data replication.
And the Clickhouse cluster is as follows:
- Replicatedgraphitemergetree + distributed. Replicatedgraphitemergetree, the tables that share the same ZK path, will each other, note that the data is synchronized with each other
- Each IDC has 3 shards, each accounting for 1/3 of data
- Each node, depending on ZK, has 2 copies of each
This detailed steps and ideas, please refer to Clickhouse cluster build from 0 to 1. Thanks to Sina's Peng brother pointing. But our actual scene is the time series data, so the Replicatedmergetree table engine is changed to Replicatedgraphitemergetree, so that the data has Graphite_rollup function.
Prometheus-clickhuse-adapter components
Prometheus-clickhuse-adapter (Prom2click) is an adapter that remotely stores clickhouse as Prometheus data.
Prometheus-clickhuse-adapter, this project lacks the log, for an actual production project, is not enough, moreover some database connection details implementation is also not perfect, already in the actual use process the improvement part as the PR submits.
In the actual use of the process, pay attention to the number of concurrent write data, timely adjust the size of the startup parameter Ch.batch, is actually the number of batch write CK, currently we set 65536. Because CK's merge engine has a 300 limit, more than will be an error
Too many parts (300). Merges are processing significantly slower than inserts
300 refers to processing, not the number of batches inserted at a time.
Summarize
This article mainly talk about our Prometheus in the storage of the exploration and practical experience. The following will talk about the high-availability architecture scheme of Prometheus Query and capture separation.