Secret Mirror--The Application level monitoring tool of excellent soil big data platform

Source: Internet
Author: User

One of the ten artifacts of antiquity: The Mirror is also known as the Kunlun Mirror. Kunlun Shanxi all, can insight into the secret, know the ancient and modern!

1. Motivation

In the early stages of business system development, we tend to focus only on the business logic, ignoring the monitoring of the system itself. Monitoring of hardware resources the ganglia and zenoss provided by the students are able to meet our needs, monitor the machine's disk, CPU load, memory, load, number of connections and so on. But part of the monitoring data between the core functionality and the hardware metrics is currently blank, such as the load on the service itself, JVM usage, Qps,tps, queue size, and so on. These data do not belong to the business function, but to the subsequent service expansion, positioning problems can provide a good basis.

The birth of the secret mirror is to solve this part of the demand, we provide a lightweight data acquisition interface, the acquisition of various indicators of the business system, and these indicators are shown in the form of a chart, can be visualized and clearly display the information of each indicator. We also provide real-time monitoring and alerting of the user's metrics, as well as the ability to provide customized Reporting Services to users.

Currently, the secret mirror provides services for hundreds of monitoring scenarios for big data applications, collecting 500 million of monitoring data per day for up to 30 days of storage. At present, however, only four nodes are used as storage, and the cluster can still add three times times more traffic.

2. Functional design

The secret Mirror provides a graphical query interface to present the data in the form of curves, and here are some use cases:

Figure 1 Kafka cluster Global load balancer comparison graph (above shows byte traffic for different IPs)

Figure 2. Storm application memory leak case (the curve name is IP::p ID, you can see that 106 of the process is stable, and 107 of the process memory to a certain value after oom, and then reboot, process number changes)

Figure 3. A method call time-consuming monitoring (the point meaning is that the most recent sample pool of 99.9% calls are below 0.19s, of course, you can see the average, P50, P75, p98, P99, etc.)

Do you have some feeling after reading it? This is some of the indicators that we may be concerned about in our daily work. Here, we give these indicators a professional name: "Dimension", that is, the observation of the application of an angle. An application has multiple metrics to focus on, so we can monitor it from multiple dimensions. The Secret Mirror uses the Java Metrics (an open source metric package https://dropwizard.github.io/metrics/3.1.0/) to classify the monitoring behavior in several categories: a. absolute value; b. counting; c. rate; d. time distribution; E. Numerical results distribution; Basically, any different type of measurement needs will be met by these five metric types. Here are some simple examples:

Absolute: Queue size, buff use (basically some size class)

Count: GC number, Cumulative time, 403 occurrences, return error error1 number of times

Rate: Number of calls per second of Tps,qps,function1

Time Distribution: Function1 call time 50% (75%,98%,99%,99.9%) is below the number of seconds, the maximum time-consuming, average time-consuming

Numerical result distribution: The function1 return value of 50% (75%,98%,99%,99.9%) is below the number.

These example descriptions above can basically cover 80% of the requirements. In fact, when we design the acquisition client, we are trying to meet these 80% requirements. Under this premise, the default settings of the common API are guaranteed to work in most cases.

Data model and query interface

the design of the data model needs to consider the function and the corresponding access efficiency, and the query interface should skillfully utilize the data in the model to present to the user intuitively and multivariate. We are considering the design of the monitoring data structure reference to the real-world crime scene, because the original design motivation is to quickly locate the problem of the system, in fact, we need is: (People, time, location, events), and then straightforward is: (Application, timestamp, process unique identifier, dimension and dimension size). You can go back and look at the example of oom above. In the days when the visual image was completely on the brain, we could only look at the system log from the black and white console using the Ugly command line. After the advent of the secret mirror, we simply clicked on the interface and he would replay the scene. The following is the details of the storage table structure:

AppID: The unique identity of the app Sceneid: Scene ID, applying the following unique identification timestamp: Timestamp location: report where the indicator is located, either an IP or a ip+ port, can also be a user-defined a specific identity dimvalue: Specific indicator names, such as in the load scenario, the specific indicators are: QPS, brush disk rate, buffer size, etc. kpivalue: The value of the corresponding indicator, can be a rate type, can be a percentage type, can also be an absolute size

The query interface is very simple, we need to set a condition: time interval, which dimension, which process (IP or ip+pid). In addition, we provide a variety of display methods, you can put different curves of the same dimension on one chart (for example: Load balancer comparison), you can also put a group of IP different dimensions in a graph (the message system inflow outflow of traffic comparison, hit and miss number comparison).

Acquisition Client Design

The design of the acquisition client determines the ease of use of the monitoring platform, the user is often the business developers, to the minimum cost in exchange for the greatest benefit. So when we design the client we consider its ease of use from a different perspective:

1. Lightweight client: For the completion of the API level of monitoring, we first want to the acquisition client implantation application, here we choose to do a lightweight statistical calculation on the client side, and open a silent thread every minute to send the current calculation results to the back-end storage, in the case of network is not smooth, The client does not perceive the presence of an exception. Simultaneous monitoring of statistical results is too frequent not only results in back-end storage pressure is too large, it will affect the performance of user applications, more importantly, real-time requirements of 1 minutes enough.

2. Ultra-Simple API: Users most hope is to write a line of code to complete the monitoring work, and in reality we did do so. The reason we do this is because we comb out 80% of the demand, and another 20% needs to call the more complex API, and some common monitoring rooms do not need to set up, such as the JVM-related monitoring.

So for the collection of monitoring data, our position is: Archive time is long, allow to lose, near real time, rich in statistics. It might be appropriate to describe the monitoring data in one word: "Visualize the Application log."

Service-Side design

HBase is a great choice for scenarios where a simple table structure stores large amounts of data. In order to meet the requirements of the request, we installed the Phoenix plugin on the HBase cluster. Phoenix supports the class SQL language, it is easy to integrate with the front-end interface, and for the receiving server, we simply use the Nginx+webserver way. For larger concurrency, you can do some batch and throttle on the receiving server. One of the benefits of receiving servers is the decoupling of the acquisition and storage tiers, which in addition to the HBase storage supports MySQL storage. In addition, for different data sources, the receiving server can also support the acquisition of JMX monitoring data.

Fig. 3 The whole structure of the secret mirror

Data is always useful, not just for monitoring. At present, we have made a certain encapsulation of the basic service layer of the data platform, built up a lot of common indicator monitoring, so we can make the approximate resource occupancy monitoring for all the users of the platform, such as the traffic contribution of the message system, the check of consumption and production message, the request quantity statistic, the cache hit rate, the data scan amount, etc. The Secret mirror opens up the data access interface, users can customize the report, and the platform administrator can generate the consumption resource report. In addition, using its near real-time (within a minute) features to do SMS and email alerts and so on.

3. Some conclusions and recommendations

In general, the work of the Secret mirror is the application of the operation of the log graphic display, and can be based on any time in a multi-way comparison of the presentation, greatly simplifying the difficulty of troubleshooting, and through the report can also let us more intuitive understanding of the program, early warning function to avoid some problems. The secret image is a data engine that portrays the state of the data platform ecosystem, which, of course, requires careful design of a better interactive UI or report.

Client

The needs of the comb, the simplest API to meet the needs of the most popular, if you want to balance, then inevitably make the API more complex difficult to use;

Do not need to deliberately pursue the high real-time data, increase the cost of 80% but increase the yield of 1% this is not worth the candle;

Since it is a "visual log", then it is allowed to be lost, ditto;

Silence, do not because of monitoring the impact of their own applications to run;

Service side

Do decoupling, so whether you because of the scale upgrade system, or because the function upgrade system has great benefits;

The data processing strategy of middleware will make your basic service more stable and efficient;

Storage-side

  The biggest problem we encountered when using hbase was that it caused some IO storms after the data was deleted, and there was a situation where the Phoenix0.4.0 ran out of CPU (0.4.2 has been resolved). Our solution for these situations is to have the table split in time. In other words, a table is rolling like a log file by time, and the benefit is that deleting old data will never have an IO storm, because the direct drop tables are not related to the table currently being written. The amount of data in a single table can also be greatly reduced, and queries can be very efficient. But the drawback is that the query needs to do some brief time interval judgment, in the cross-table query will be very cumbersome, need to do two SQL results to merge.

Secret Mirror--The Application level monitoring tool of excellent soil big data platform

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.