Secret mirror-Excellent soil big data platform application level monitoring artifact

Source: Internet
Author: User

Transferred from: http://chuansong.me/n/1208635

Motivation

In the early days of business system development, we tended to focus only on the core logic, ignoring the monitoring of the system itself. The Zenoss (ganglia) provided by OPS can well meet our monitoring of hardware resources (IO, CPU load, memory, load, number of connections, etc.). But the monitoring of the system metrics between the core function and the hardware metrics is blank, such as the load of the service itself, the JVM state, the Qps,tps, the queue size, and so on. Although this data is not a business function, but the subsequent service expansion, positioning problems can provide a good basis.

The design of the secret mirror is designed to solve this part of the demand, provide a lightweight data acquisition interface, the acquisition of various indicators of the business system, and these indicators in the form of a chart to visualize clearly. It also supports real-time monitoring and alerting of key indicators, as well as providing users with simple operational Reporting Services.

Cat Mirror Online More than a year, after several iterations of the version, the group is currently hundreds of big data application scenarios to provide a minute-level indicator monitoring services, daily collection of 500 million metric data, minute-level monitoring data can be stored for up to 30 days.

Scenario Example

Kafka Complete Cluster load flow (byte) comparison chart

Each IP represents a Kafka node that can visually see whether the traffic is balanced or stable.

Storm application Memory leak

The curve name is IP::p ID, you can see that 106 of the process is stable, and 107 of the process memory to a certain value after oom, and then reboot, process number changes.

Response time-consuming distribution of Web service pages

The significance of p999=0.196 is that in the last 1024 samples, there were two (0.01%) requests of more than 190 milliseconds. As you can see, 99.9% of the request latency is basically at the millisecond level, but occasionally there are a few requests that are more than 190 milliseconds. You can also make comparisons based on indicators such as P99,P98,P75,P50.



Measure

The Secret mirror reference metrics has designed four types of statistical measures:

Absolute value: Queue size, cache usage, online users (usually some instantaneous values)

Count: GC number, number of errors, cumulative time, total sales, etc. (usually some summation values)

Rate: TPS,QPS, number of users per second on-line (usually a few ratios)

Distribution: Can be a time distribution, a numerical distribution, such as: A request call time required 99.99% under 100 milliseconds, through this indicator to define the response performance.

Each indicator of a monitoring acquisition must belong to one of the above metrics, or a value or a distribution. In addition, we propose a concept of the scene, different business personnel on the same system monitoring indicator focus will not be the same, through the concept of the scene, the indicators are grouped to facilitate business people to view the analysis.

Data model and query interface

The design of the data model should weigh the function and access efficiency, while the query interface needs to combine the model to visualize the data. In designing the monitoring data structure, we refer to the real-world detection Method-site recovery. Because the original design motive is to quickly locate the problem of the system, look for clues (people, time, place, event) of the crime scene. corresponding to the program troubleshooting is: (Application, timestamp, process unique identifier, indicator name, indicator value).

We can go back and look at the example of Oom above, in the days when the visual image is completely brain-mended, you can only use the ugly command line to view the system log from the black and white console. After the advent of the secret mirror, a few simple clicks on the interface, it can help you to replay the scene information.

Storage table:

Query interface is very simple, we need to set a condition: time interval, which indicator, which process (IP or ip+pid). We also offer a variety of presentation methods that can compare the same metrics from different sources (for example, load balancer comparisons) or compare different metrics from the same source (flow comparison of message system inflow, hits vs. misses).

Acquisition Client Design

The design of the acquisition client determines the ease of use of the monitoring platform, which is often used by business developers. For them, the minimum cost in exchange for the greatest benefit. So when we design the client we consider its ease of use from a different perspective:

1. Lightweight client: For the completion of API level monitoring, we first need to embed the acquisition client into the host application. Here we choose to do a lightweight statistical calculation on the client side, and open a silent thread every minute to send the current calculation results to the back-end storage, the monitoring module will never affect the operation of the host program, even if the network is not smooth, the host client is not aware of the existence of the exception. Synchronous monitoring of statistical results too frequently not only results in too much back-end storage pressure, but also affects the performance of user applications. A more important premise is that 1 minutes is sufficient for real-time requirements.

2. Ultra-Simple API: Users most hope is to write a line of code to complete the monitoring work, and in reality we did do so. The reason we can do this is because we comb out 80% of the common requirements to design the API, while another 20% needs to call the more complex API to meet. In addition, some general-purpose monitoring is not required, such as JVM-related monitoring.

For the collection of monitoring data, our design objectives are: Long archive time, allow loss, near real-time, rich in statistics. It might be appropriate to describe the monitoring data in one word: "Visualize the Application log."

Service-Side design

HBase is a great choice for scenarios where a simple table structure stores large amounts of data. In order to meet the requirements of the request, we installed the Phoenix plugin on the HBase cluster. Phoenix supports the SQL-like language and is easy to integrate with the front-end interface.

For the receiving server, we simply use the Nginx+webserver method. For larger concurrency, you can do some batch and throttle on the receiving server. The receiving server component is well decoupled from the acquisition and storage layers. Thanks to the decoupled design, the secret mirror supports MySQL storage in addition to hbase storage. In addition, for different data sources, the receiving server can also support the acquisition of JMX monitoring data.

Data is always useful, not just for monitoring. We have made a certain encapsulation of the basic service layer of the data platform and built up a lot of common indicators to monitor the usage of all the platforms, such as the traffic contribution of the message system, the checking of consumption and production messages, the quantity of requests, the cache hit rate, the amount of data scanning and so on. The Secret mirror opens up the data access interface, users can customize the report, and the platform administrator can generate the consumption resource report. In addition, using its near real-time (within a minute) features to do SMS and email alerts and so on.

Conclusions and recommendations

In general, the work of the Secret mirror is the application of the operation of the log graphic display, and can be based on any time in a multi-way comparison of the presentation, greatly simplifying the difficulty of troubleshooting, and through the report can also let us more intuitive understanding of the program, early warning function to avoid some problems. The secret image is a data engine that portrays the state of the data platform ecosystem, which, of course, requires careful design of a better interactive UI or report.

Client

The needs of the comb, the simplest API to meet the needs of the most popular, if you want to balance, then inevitably make the API more complex difficult to use;

Do not need to deliberately pursue the high real-time data, increase the cost of 80% but increase the yield of 1% this is not worth the candle;

Silence, do not because of monitoring the impact of their own applications to run;

Service side

Do decoupling, so whether you are expanding the upgrade, or feature upgrade, are easy to operate;

The data processing strategy of middleware will make your basic service more stable, efficient and flexible.

Storage-side

Phoenix on HBase lets you use SQL instead of tedious scan queries to understand the storage principles of hbase and help you design more efficient Phoenix library tables by placing high-frequency fields of query criteria in front. For the storage of a larger number of levels of data, you can use the Time Division table, delete operations and append operations separated, so as to avoid IO storms.

Secret mirror-Excellent soil big data platform application level monitoring artifact

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.