Zhu Ye's Internet Architecture practice experience S1E4: Easy to use monitoring six brothers

Source: Internet
Author: User
Tags kibana grafana logstash influxdb

Zhu Ye's Internet Architecture practice experience S1E4: Easy to use monitoring six brothers

"Download this PDF for reading"

The six brothers mentioned here refer only to the ELK Kit (Elasticsearch+logstash+kibana) and the TIG Kit (telegraf+influxdb+ Grafana).

Shows two separate sets of systems, elk and TIG (TIG is my own, there is no such a conventional version of Elk):

Both sets of systems are composed of collectors + storage + display sites, green-green collectors, blue-green storage, and red display sites.

Both systems have free components to use, the installation configuration is relatively simple (of course, the company also want to make money, they certainly have to push the cloud version, generally will not use the cloud version, definitely local deployment).

Elk System is more used to collect, save, search, view and alarm log data.

TIG system is more used to collect, save, view and alarm the data of various metrics indicators.

For elk, because the amount of log data is often large, and burst log explosion is common, the write index is not so fast, so the general introduction of Kafka and other Message Queuing in the previous block.

For elk, there is a need for filtering parsing and additional alarms before entering ES, so you can use Logstash as a converged processing layer and use rich plugins to do all sorts of processing. But Logstash performance is not so high, the consumption of resources is very strong, when using the need to pay attention to.

About Elk

Is the Kibana interface, here we can see the micro-service components of the logs are collected in ES, Kibana can use expressions for various searches, the most common is to follow the whole process of tandem microservices RequestID or user UserID search related logs. A lot of the company's development habits to the server up a search log, a good bit will be used ansible bulk search, this is actually very inconvenient:

    • The search for text is much slower than the search for the ES index database.
    • The search for text encounters a large number of files, which occupies a considerable amount of memory and CPU resources on the server, affecting the business.
    • File logs are generally archived and compressed, and it is not convenient to search for non-day logs.
    • The permissions are not well controlled, and the original file log open query may have security issues with the risk of information disclosure.
    • In the process of collecting data into ES, we can do a lot of extra work, including desensitization, storing to other data sources, sending emails and IM notifications (such as can be integrated with slack or nail-nailing robots) and so on.

About exceptions

I have always had a point of view, and I think it is not too much to emphasize the exception, especially the unhandled exception that has been thrown onto the business surface and the system exception in the service. We can differentiate between exceptions as a business exception that is proactively generated by business logic and a system exception that cannot be known beforehand. For system anomalies often means that the underlying infrastructure (such as network, database, middleware), such as jitter or failure or the code has a bug (even if it is not a bug is also a logical imperfect case), each exception, we need to investigate the root cause of each one, if there is no time to investigate, Need to record time to investigate again. For some of the most business-specific systems, there will be hundreds of thousands of anomalies per day, probably more than 100+. The worst and the worst.

    • Comb the code thoroughly, do not eat the exception, often many times the bug can not find the reason is not know what is eaten here is what is abnormal. Using elk we can easily search the filter log, remembering a bit of abnormal or irregular process error is very helpful for us to fix the bug.
    • We need to monitor and alarm the frequency of anomalies, such as Xxexception the last 1 minutes have 200 anomalies, long time we will have feelings about these anomalies, see such a quantity we know this must be jitter, if appear xxexception the last 1 minutes have 10,000 anomalies, Then we know that this is not necessarily the network jitter, this is dependent on the service hanging rhythm, immediately need to start the emergency response of the troubleshooting process.
    • Make sure 100% focus on and handle the null pointer, array out of bounds, concurrency errors and other exceptions, each exception is basically a bug, will cause the business can not continue, sometimes these exceptions because the absolute number of small will be buried in a number of exceptions, need to look at these exceptions each day to solve each. This exception if it affects a user's normal process, then the user may be lost, although this user is only a member of tens of thousands of users, but the feelings for this user is very poor. I have always felt that we have to solve the problem before the user found the problem, it is best to wait until the customer feedback (most non-paid Internet products users do not encounter a blocking process problems to call customer service, but choose to give up this product) is already a fixed point of time known issues.

Do a little better even if we can assign an ID for each error, if this error has the opportunity to upload to the user this end, on the 500 page is not so obvious to show this ID, if the user screenshot feedback problem, you can easily through the error ID in the elk to find the corresponding error, one key to locate the problem.

About TIG

is Grafana, Grafana support quite a lot of data sources, Influxdb is also one of the data sources, similar to influxdb products and graphite, is also a good choice. Telegraf is the Influxdb company's collection of data Agent suite, there will be a lot of plug-ins, these plugins are not complex, they can be easily written by Python, is a bit of time, there is ready to use, Plainly, the stats interface that is exposed from each middleware collects formatted data and writes it to INFLUXDB. Let's take a look at the plugin supported by Telegraf (image interception from Https://github.com/influxdata/telegraf):

Using these plugins to work or develop yourself doesn't take much effort to monitor all of our basic components.

About DOT

As shown in the schema diagram at the beginning of the text, in addition to the various plugins we can use TELEGRAF to collect a variety of storage, middleware, system-level indicators, we also made a metricsclient small class library, so that the program can save a variety of data to influxdb. In fact, each entry into the INFLUXDB measurement record is just an event, with the following information:

    • Time stamp
    • A variety of tags for search
    • Value (time elapsed, number of executions)

As we can see in this bankservice, we have documented the success of various asynchronous synchronization operations, business anomalies, system exception events, and then in the Grafana for a simple configuration, you can present the required diagram.

For metricsclient, it can be called manually in code or in the form of AOP, and we can even add this focus to all methods, automatically collect the execution count, time, result (normal, business exception, system exception) to the INFLUXDB, Then configure the dashboard you need in Grafana for monitoring.

For the RPC framework is also recommended within the framework of auto-integration, save the RPC method each execution of the situation, refinement to the granularity of the method to configure a number of graphs, in the event of an accident, a key to locate the suspected problem. The AOP square +rpc Framework Auto-dot can already cover most of the requirements, but it would be nice if we added some business-level operations to the code.

If we configure two graphs for each business behavior, one is the number of calls, and the other is call performance, such as:


    • In the event of a problem, we can determine which block is problematic in a short period of time.
    • It is also possible to initially determine whether the cause of the problem is caused by an abnormal or sudden increase in pressure.

The recommended configuration is to configure the number and performance of data processing according to the flow, from front to back, each link:

    • Data coming in upstream
    • Data sent to MQ
    • Data received by MQ
    • MQ processing of completed data
    • Requests for interaction with external
    • Request to get an external response
    • Request to drop the library
    • Check for cached requests

Problems can be fixed in time to the problem of the module, or at least the line of business, will be much better than the headless flies (of course, if we do not have to configure their own needs dashboard that is useless). Dashboard must be with the business of the iteration constantly to maintain, do not go through several rounds of iteration before the pre-abandoned, to the problem of the time to see the dashboard are all 0 calls.


Grafana docking influxdb data source very good, but the docking MySQL do some query total feeling is not particularly convenient, here is recommended an open source system metabase, we can easily save some SQL to do some business or monitoring statistics. You might say that these business statistics are operational concerns, and we have to do it by BI, what we need to do with these charts, and I want to say that even if we do technology, we'd better have a small business panel, not a focus on business, but a place where we know the business is running. Take a look at the critical time to determine the scope of the impact.

Well, speaking of here, whether you have seen through the six brothers, in fact, we are building a three-dimensional monitoring system, sharing a few steps to troubleshoot problems, after all, when the big problem of our time is often only a few minutes:

    • Attention to abnormal or system-level pressure alarm, concern about 0 of the business volume (refers to the sudden fall of more than 30%) alarm.
    • Grafana panel Configuration of the business dashboard determine the system which module has a pressure problem, performance problems.
    • Through the Grafana panel configuration of service calls and traffic, to eliminate upstream and downstream problems, locate the problem module.
    • Check the module for errors or exceptions by Kibana.
    • Based on the error screenshot of customer feedback, find the error ID and search for the full link log in Kibana to find the problem.
    • For details, another trick is to check the request log. We can do a switch on the web side of the system, according to certain conditions can be opened to record detailed request and response HTTP log switch, with each request detailed data, we can according to user information "see" User access to the entire process of the site, This is very helpful for us to troubleshoot the problem. Of course, this amount of data can be very large, so you need to carefully turn on such a heavy trace function.

Have a dot, have error log, have detailed request log, still afraid to locate the problem?

Zhu Ye's Internet Architecture practice experience S1E4: Easy to use monitoring six brothers

Related Article

Cloud Intelligence Leading the Digital Future

Alibaba Cloud ACtivate Online Conference, Nov. 20th & 21st, 2019 (UTC+08)

Register Now >

Starter Package

SSD Cloud server and data transfer for only $2.50 a month

Get Started >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.