Prometheus Real---uber endorsement of the storage solution M3

Last Update:2018-08-09 Source: Internet

Author: User

Tags etcd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

We have been doing the Prometheus Remote Storage work, has been lack of a factory endorsement of the solution. There may be some of the following:

Manufacturers endorsement and Open source
Can undertake large-scale mass metrics
The non-weave cortex is particularly large in the original scheme of Prometheus, which can be upgraded separately Prometheus.

Thankfully, Uber has open source for Prometheus Storage solutions M3, including many components.

Summary

To facilitate the development of Uber's global operations, we need to be able to quickly store and access billions of metrics on our back-end systems at any given time. As part of our robust and scalable metrics infrastructure, we built M3, an indicator platform that has been used for many years in Uber.
M3 can reliably store large-scale indicators over a long retention period. To provide these benefits to others in a wider community, we decided to open the M3 platform as a remote Storage backend for Prometheus, Prometheus is a popular monitoring and alerting solution. As documented in its documentation, the scalability and durability of Prometheus is limited by a single node. The M3 platform is designed to provide secure, scalable, and configurable multi-tenant storage for Prometheus metrics.

M3 was released in 2015 and currently has more than 6.6 billion time series. The M3 aggregates 500 million metrics per second and continuously stores 20 million metrics per second on a global scale (using M3DB), and batches writes three copies that persist each metric to a zone. It also allows engineers to write metric strategies that tell M3 to keep time shorter or longer (two days, one months, six months, one year, three years, five years, etc.) with specific granularity (one second, 10 seconds, one minute, 10 minutes, etc.). This allows engineers and data scientists to intelligently store time series of different reservations in a granular and coarse-grained range using metric tags (tags) that match defined storage policies. For example, an agent can choose to store all the metrics labeled "Application" marked "Mobile_api" and "endpoints" labeled "registered", which are 30 days in 10 second granularity and 5 years in one hour granularity.

PS: The number of Uber Metrcis can be said to be a massive level. The scheme satisfies several requirements mentioned in the preface.

Multi-Region query

Cluster architecture

Component Introduction

M3 Coordinator

M3 Coordinator is a service that coordinates read and write operations between upstream systems, such as Prometheus and m3db. It is a bridge that users can deploy to access the benefits of m3db, such as long-term storage and multi-DC settings with other monitoring systems such as Prometheus.

M3db

M3DB is a distributed time series database that provides a reverse index of extensible storage and time series. It is optimized for cost-effective and reliable storage and indexing of real-time and long-term retention indicators

M3 Query

M3 Query is a service that contains a distributed query engine for querying real-time and historical metrics, and supports a variety of different query languages. It is designed to support low latency real-time queries and queries that may take longer to execute, aggregating larger datasets for analyzing use cases

M3 aggregator

M3 aggregator is a service that runs as a dedicated metric aggregator that provides stream-based down-sampling based on dynamic rules stored in ETCD. It uses leader election and aggregation window tracking to manage this state using ETCD to reliably send at least one aggregation to long-term storage for low sampling metrics. This provides cost-effective and reliable down-sampling and summary metrics. These features also exist in the M3 coordinator, but the dedicated aggregator is fragmented and replicated, while the M3 coordinator does not need and needs to be deployed carefully and run in a high-availability manner. There is also work to make it easier for users to access aggregators without having to write their own compatible producers and consumers.

Integration with Prometheus

schema example

M3 Coordinator Configuration

The simplest configuration to write to a remote m3db cluster is to run M3coordinator as a Prometheus next to it.

Download the configuration template first. Update the namespace and client portion of the new cluster to match the configuration of the cluster.

You need to specify the static IP or host name of the m3db seed node, and the name and retention value of the namespace you are setting. You can leave the namespace storage metric type as not aggregated, because by default you need to have a cluster that receives all the Prometheus metrics that are not aggregated. In the future, you may also want to aggregate and reduce the sampling metrics to achieve a longer retention period, and you can return and update the configuration after you set up these clusters.

Listenaddress:0.0.0.0:7201metrics:scope:prefix: "Coordinator" Prometheus:handlerpath:/metrics listenAddres s:0.0.0.0:7203 # until https://github.com/m3db/m3/issues/682 is resolved Sanitization:prometheus samplingrate:1.0 ex       Tended:noneclusters:-namespaces:# We created a namespace called "Default" and had set it to retention "48h". -Namespace:default retention:48h storagemetricstype:unaggregated Client:config:ser           Vice:env:default_env zone:embedded service:m3db Cachedir:/var/lib/m3kv Etcdclusters:-zone:embedded endpoints:# We have five m3db nodes if only three is seed n                 Odes, they is listed here. -M3db_node_01_static_ip_address:2379-m3db_node_02_static_ip_address:2379-m3db_node_03 _static_ip_address:2379 writeconsistencylevel:majority Readconsistencylevel:unstrict_majority writetimeout:10s fetchtimeout:15s connecttimeout:20s writeretry:initialbackoff:50         0ms backofffactor:3 maxretries:2 jitter:true fetchretry:initialbackoff:500ms Backofffactor:2 maxretries:3 jitter:true backgroundhealthcheckfaillimit:4 background healthcheckfailthrottlefactor:0.5

Prometheus Configuration

remote_read:  - url: "http://localhost:7201/api/v1/prom/remote/read"    # To test reading even when local Prometheus has the data    read_recent: trueremote_write:  - url: "http://localhost:7201/api/v1/prom/remote/write"

Summarize

First, a brief introduction, follow-up demo, test, source interpretation and other work.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More