Home > Industries > Internet
Tags: bank RAC indicator search Alarm Red network start support
Zhu Ye's Internet Architecture practice experience S1E4: Easy to use monitoring six brothers
"Download this PDF for reading"
The six brothers mentioned here refer only to the ELK Kit (Elasticsearch+logstash+kibana) and the TIG Kit (telegraf+influxdb+ Grafana).
Shows two separate sets of systems, elk and TIG (TIG is my own, there is no such a conventional version of Elk):
Both sets of systems are composed of collectors + storage + display sites, green-green collectors, blue-green storage, and red display sites.
Both systems have free components to use, the installation configuration is relatively simple (of course, the company also want to make money, they certainly have to push the cloud version, generally will not use the cloud version, definitely local deployment).
Elk System is more used to collect, save, search, view and alarm log data.
TIG system is more used to collect, save, view and alarm the data of various metrics indicators.
For elk, because the amount of log data is often large, and burst log explosion is common, the write index is not so fast, so the general introduction of Kafka and other Message Queuing in the previous block.
For elk, there is a need for filtering parsing and additional alarms before entering ES, so you can use Logstash as a converged processing layer and use rich plugins to do all sorts of processing. But Logstash performance is not so high, the consumption of resources is very strong, when using the need to pay attention to.
About ElkIs the Kibana interface, here we can see the micro-service components of the logs are collected in ES, Kibana can use expressions for various searches, the most common is to follow the whole process of tandem microservices RequestID or user UserID search related logs. A lot of the company's development habits to the server up a search log, a good bit will be used ansible bulk search, this is actually very inconvenient:
I have always had a point of view, and I think it is not too much to emphasize the exception, especially the unhandled exception that has been thrown onto the business surface and the system exception in the service. We can differentiate between exceptions as a business exception that is proactively generated by business logic and a system exception that cannot be known beforehand. For system anomalies often means that the underlying infrastructure (such as network, database, middleware), such as jitter or failure or the code has a bug (even if it is not a bug is also a logical imperfect case), each exception, we need to investigate the root cause of each one, if there is no time to investigate, Need to record time to investigate again. For some of the most business-specific systems, there will be hundreds of thousands of anomalies per day, probably more than 100+. The worst and the worst.
Do a little better even if we can assign an ID for each error, if this error has the opportunity to upload to the user this end, on the 500 page is not so obvious to show this ID, if the user screenshot feedback problem, you can easily through the error ID in the elk to find the corresponding error, one key to locate the problem.
About TIGis Grafana, Grafana support quite a lot of data sources, Influxdb is also one of the data sources, similar to influxdb products and graphite, is also a good choice. Telegraf is the Influxdb company's collection of data Agent suite, there will be a lot of plug-ins, these plugins are not complex, they can be easily written by Python, is a bit of time, there is ready to use, Plainly, the stats interface that is exposed from each middleware collects formatted data and writes it to INFLUXDB. Let's take a look at the plugin supported by Telegraf (image interception from Https://github.com/influxdata/telegraf):
Using these plugins to work or develop yourself doesn't take much effort to monitor all of our basic components.
About DOTAs shown in the schema diagram at the beginning of the text, in addition to the various plugins we can use TELEGRAF to collect a variety of storage, middleware, system-level indicators, we also made a metricsclient small class library, so that the program can save a variety of data to influxdb. In fact, each entry into the INFLUXDB measurement record is just an event, with the following information:
As we can see in this bankservice, we have documented the success of various asynchronous synchronization operations, business anomalies, system exception events, and then in the Grafana for a simple configuration, you can present the required diagram.
For metricsclient, it can be called manually in code or in the form of AOP, and we can even add this focus to all methods, automatically collect the execution count, time, result (normal, business exception, system exception) to the INFLUXDB, Then configure the dashboard you need in Grafana for monitoring.
For the RPC framework is also recommended within the framework of auto-integration, save the RPC method each execution of the situation, refinement to the granularity of the method to configure a number of graphs, in the event of an accident, a key to locate the suspected problem. The AOP square +rpc Framework Auto-dot can already cover most of the requirements, but it would be nice if we added some business-level operations to the code.
If we configure two graphs for each business behavior, one is the number of calls, and the other is call performance, such as:
So:
The recommended configuration is to configure the number and performance of data processing according to the flow, from front to back, each link:
Problems can be fixed in time to the problem of the module, or at least the line of business, will be much better than the headless flies (of course, if we do not have to configure their own needs dashboard that is useless). Dashboard must be with the business of the iteration constantly to maintain, do not go through several rounds of iteration before the pre-abandoned, to the problem of the time to see the dashboard are all 0 calls.
OtherGrafana docking influxdb data source very good, but the docking MySQL do some query total feeling is not particularly convenient, here is recommended an open source system metabase, we can easily save some SQL to do some business or monitoring statistics. You might say that these business statistics are operational concerns, and we have to do it by BI, what we need to do with these charts, and I want to say that even if we do technology, we'd better have a small business panel, not a focus on business, but a place where we know the business is running. Take a look at the critical time to determine the scope of the impact.
Well, speaking of here, whether you have seen through the six brothers, in fact, we are building a three-dimensional monitoring system, sharing a few steps to troubleshoot problems, after all, when the big problem of our time is often only a few minutes:
Have a dot, have error log, have detailed request log, still afraid to locate the problem?
Zhu Ye's Internet Architecture practice experience S1E4: Easy to use monitoring six brothers