Case | Service Architecture System Monitoring Challenge solution

Last Update:2015-11-02 Source: Internet

Author: User

Tags failover zookeeper elastic search

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original URL Link: Http://url.cn/kVjUVO

As we all know, system monitoring has always been an important issue faced by enterprises with complex it architectures, and this is not a technical challenge that every enterprise can easily solve. Oppo as an international intelligent terminal equipment and mobile internet service providers, launched a variety of fine appearance, functional and reliable smart phone products, and its brand awareness has been ranked top. But in fact, the Oppo company and other fast-growing modern enterprises face its own it challenges, but even more obscure, is the same excellent IT team and information technology support capabilities behind its brand.

Oppo back-end system scale in recent years rapid development, the system after the reconfiguration of the service-based architecture, the coupling between the system is reduced, the development efficiency has been greatly improved. However, in the service of the benefits of the same time, difficult to monitor the problem also appeared. Because of the complexity of the call relationship between services, interface problems, multiple systems error, it is difficult to locate the true source of failure. The entire request call chain is like a black box, unable to track the entire call path of the request and discover the performance bottleneck point.

In order to solve these problems, Oppo company developed a set of monitoring system, combined with third-party monitoring system, formed a complete monitoring system from application request to backend processing process. OPPO Monitoring System is referred to as OMP (OPPO monitor Platform), lasted half a year to develop, divided into two phase on-line, now has full access to the OPPO line project.

Three reasons to decide to develop independently

The choice of independent research and development monitoring system, mainly considering the three reasons: customization requirements, ease of use, and low development costs.

first, after the comparison, it is found that the existing open source monitoring software cannot meet the requirements of oppo. one of the core requirements for a monitoring system is the ability to monitor the complete call chain for each app request, from app initiation requests to back-end load-balanced access, API Server, microservices invocation, caching, Message Queuing, database access time, and more. After the system architecture microservices, service tracking and service call chain monitoring is particularly important, otherwise the system failure and performance bottleneck is difficult to troubleshoot.

In order to get through the complete call chain of the user request, the API framework, RPC framework, cache operations, database operations, queue consumption and other code buried, as well as high-performance processing and storage systems, and the current open source software can not meet the needs of the major companies to develop their own monitoring platform. Because the service invocation tracking feature is deeply associated with the development framework, companies are not using the same framework, so there are few similar open source products in the industry.

The second reason is to take into account the requirements of the Authority and the Integrated Management interface. The monitoring platform is not only for operations personnel, but also for developers, operators, and testers to use frequently. For example, based on the monitoring platform to capture the JVM young gc/full GC times and time, the top 10 thread stack, time-consuming information, often view the monitoring platform, development, testers can evaluate the code quality, to eliminate the hidden dangers.

Monitoring platform for a large number of users, security and rights management requirements, and the need for an integrated management interface, simple and easy to use, and the combination of multiple open source software, permissions and management convenience is difficult to meet the needs of.

Thirdly, the development of monitoring system is less difficult. the self-developed monitoring platform has thousands of benefits, but it is meaningless if the development is too difficult to sustain. Based on the SIGAR, Kafka, Flume, HBase, Netty and other technologies, the development of high-performance, scalable system difficulty is actually not large, need to invest a lot of resources.

Six target content for full online application monitoring

The ultimate goal of OMP is to provide an integrated monitoring system for multi-dimensional monitoring of online application systems under the same set of management interfaces and authority systems. The main monitoring content of OMP at present stage includes: Host performance index monitoring, middleware performance index monitoring, service call chain real-time monitoring, interface performance index monitoring, log real-time monitoring, real-time monitoring of business indicators.

Host performance metrics monitoring of open source software is very many, such as Zabbix, cacti and so on. The main collection of host CPU load, memory usage, each network card up and down traffic, each disk read and write rate, the number of disk read and write (IOPS), each disk space utilization, and so on.

With the help of open source SIGAR library, can easily collect host information, in order to ensure the consistency of the entire monitoring system experience, as well as the system scalability, stability requirements, we do not directly adopt Zabbix and other open source monitoring system, but the development of their own agent program, deployed on the host to collect information.

SIGAR (System Information Gatherer and Reporter) is an open source tool that provides a cross-platform system Information collection API. The core is implemented by the C language and can be called by the following languages: C + +, Java, Perl, NET C #, Ruby, Python, PHP, Erlang.

The information that SIGAR can collect includes:

CPU information, including basic information (vendor, model, MHz, cacheSize) and statistical information (user, sys, idle, nice, wait);
File system information, including filesystem, Size, used, Avail, use%, Type;
Event information, similar to service Control Manager;
Memory information, the total number of physical memory and swap memory, the number of uses, the number of remaining, the size of RAM;
Network information, including network interface information and network routing information;
Process information, including memory of each process, CPU usage, status, parameters, handle;
IO information, including the status of Io, read and write size, etc.;
Service status information;
System information, including operating system version, System resource constraints, system uptime and load, Java version information, etc.

For the monitoring of middleware performance indicators, according to the business use of middleware, the main collection of middleware including Nginx, MySQL, MongoDB, Redis, Memcached, JVM, Kafka and so on. Implementation of the deployment of a separate acquisition server, through the middleware of the Java Client Execution Status query command, the corresponding performance indicators, the acquisition of some of the indicators as shown in the following table:

Jvm	Heap memory, permanent memory, old generation memory, thread CPU time, thread stack, Yong GC, full GC
Mysql	Slow query, QPS, TPS, number of connections, space size, table lock, row lock ...
Redis	QPS, hit rate, number of connections, number of entries, memory usage ...
Memcached	QPS, hit rate, memory usage, number of entries, number of connections ...
Nginx	Requests per second, number of connections, keepalive connections, persistent connection utilization ...

After the system architecture microservices, the service calls are complicated, problems or performance bottlenecks are often difficult to locate. So the real-time monitoring of service call chain is very important.

Service call chain monitoring is the start of an app initiation request, analysis of the time and error conditions, including load-balanced access, API server time-consuming, micro-service call time, cache access time, database access time, Message queue processing time, and other links of error information, easy to track performance bottlenecks and errors.

Because of the huge number of service calls and the ease with which administrators can view them, the monitoring system cannot store all the requested call chains, primarily storing the following requests:

Slowest top 1000 requests during the cycle: by analyzing the slowest top 1000 requests, you can identify major performance bottlenecks, such as database access, or call a third-party corporate interface for too much time.
Sampling request: Randomly selects part of the request and stores the call chain of the request according to the set sampling scale.
Keyword: satisfies the keyword rule and stores the call chain for the request.

Interface performance metrics monitoring, the main monitoring interface availability and response time, by the internal monitoring and external monitoring two components:

External monitoring: External monitoring by a third-party company, divided into two kinds, one is the application of buried points, the acquisition of real business request performance indicators. The second is to proactively monitor the availability and performance metrics of the interfaces in each region through the acquisition sites deployed by third-party companies around the world. External monitoring can only monitor the Load Balancer external Terminal interface service address availability and performance indicators, if you want to monitor the internal interface server, you need to deploy a third-party company's agent inside the room, which will bring a very large security risks, so the internal monitoring of the computer room is completed by internal control.
Internal monitoring: Internal monitoring using OMP, monitoring the interface server behind the load Balancer availability and performance indicators, timely detection of abnormal nodes, and omp according to the abnormal reasons, callback service system to provide recovery URL, try to restore the system.

The logs generated by the application are scattered across the application servers, and due to the strict security management, it is very inconvenient for developers to view the logs on the online system, and the log content matching keyword needs to send an alarm notification to the relevant person. OMP stores the log collection to the elastic search cluster for log retrieval. The OMP log real-time monitoring mainly includes the following functions:

Log real-time online view: Monitoring platform can view the contents of log files in real-time, similar to the tail–f command, while shielding the content of sensitive information (such as passwords, etc.);
Full-Text Search: Full-Text Search log content and highlighting;
Correlation log View: View log generation time, log belongs to the application associated components and application log;
Keyword Alerts: Users define their own alarm rules, and match the rules to send messages and SMS notifications.

The last monitoring content is the real-time monitoring of business indicators. In addition to the monitoring system active acquisition of information, there are business layer indicators need to be monitored, such as the number of orders in the cycle, third-party data synchronization results. The metrics data of these business layers are collected by each business system, then escalated to the monitoring system, the monitoring system completes the chart display and the alarm notification.

Four aspects of the OPM system design

First, look at OPM's system architecture, as shown in:

Middleware Collector: Independent deployment of multiple middleware performance indicator collectors, through zookeeper to achieve failover and task allocation. The middleware collector executes the state Query command through the Java client of the middleware, parses the command result to obtain the performance index, because the state query obtains the newest accumulative value, the collector also is responsible for calculates the period the mean value, the maximum value, the minimum value and so on period data. The middleware collects the collected data into the receiver cluster in real time.
Agent monitoring agents: Agent Monitoring Agent deployed on each server, real-time acquisition of the server's log file content, CPU load, memory usage, network card up and down traffic, disk read and write rate, disk read and write times (IOPS) and so on. The data collected by the agent is escalated to the receiver cluster in real time, and the uploading process also needs to do the flow control and discard policy for the log file to prevent blocking.
Code burying point: The code burying point mainly collects the service call chain data, obtains the service call chain time-consuming and the error information through the encapsulated cache access layer, the database access layer, the message queue access layer, and the Distributed Service Framework (RPC). Code buried point capture data native staging, one-minute merge escalated to the sink cluster.
Business metrics reporting: Business metrics are collected by each business system, escalated to the receiver cluster, and the escalation cycle and strategy is determined by each business.
Receiver cluster: Oppo self-developed data flow component, architecture reference flume, including input, channel, output three parts, the received data output to the Kafka queue, the following will be described in detail.
Kafka Message Queuing: Because monitoring data allows for loss and re-consumption, select High-performance Kafka as Message Queuing, buffering message processing.
Message processing cluster: Message processing Cluster Subscription Kafka topic, processing messages in parallel, handling alarm rules, sending notifications, storing to HBase and es.
Hbase:hbase stores indicator class data, and the management console generates real-time charts by querying HBase.
Elastic Search: Log content is stored for full-text retrieval.

OPPO data flow enables configuration and management of traffic, design reference flume, including the source (input), channel (channels), output (Sink) Three parts, the channel is a queue, with the ability to buffer data. The reason why do not adopt Flume, mainly consider the following several reasons:

Flume provides a good sourceàchannelàsink framework, but the specific source, sink need to be implemented on their own to be compatible with the Oppo line using the software version, as well as optimized parameter configuration.
Flume resource-intensive, does not work with agents deployed on the Business Server
The flume configuration file is not as intuitive as the XML configuration, and cannot be configured with a management interface.
Flume management interface is not friendly, can not see the input, output of the real-time flow chart and the number of errors.

Referring to Flume's design philosophy, OPPO data flow is a more manageable, more convenient, and easy-to-use dataflow tool. Using open source software is not just a way to learn the essence of the design, it is also a way to further improve.

In fact, agent monitoring agent, middleware collector, receiver cluster are oppo Data flow components, combining different source and sink. Source and Sink are developed using OSF Service framework to realize the automatic discovery, load balancing and failover function of agentà receivers.

	Input (Source)	Channels (channel)	Output (Sink)
Agent Monitoring agents	Tailfilesource Cpusource Memorysource Networksource Disksource	Memorychannel	Httpsink
Middleware Collector	Nginxsource Mysqlsource Mongodbsource Redissource Jvmsource Memcachedsource	Memorychannel	Httpsink
Receiver	Httpsource	FileChannel	Kafkasink

For the data flow embedded management interface, you can view traffic and error messages, click on the name to view historical traffic.

Service call chain is the focus of monitoring, core core, in order to get through the service call chain, OPPO developed the OSF (OPPO Service Framework) Distributed service Framework, and the cache, database, Message Queuing operations to encapsulate the buried point, the purpose is to transparently implement service call tracking. Here's how it's implemented:

Create a unique RequestID at the entrance of the app request and put Threadlocal
Cache Access Layer Code burial point, remove RequestID from thradlocal, record cache operation time
Database access layer Code burial point, remove RequestID from thradlocal, record database operation time
Call other MicroServices (RPC), pass the RequestID to the next microservices, the microservices will receive the RequestID deposit threadlocal, the Micro service internal cache, the database operation also records RequestID operation time and error information.
Message Queuing writes, consumes code buried points, passes RequestID, and records message consumption time consuming.

The call chain data is large and cannot be stored in full volume, the monitoring system stores the slowest Top1000 requests in the cycle, the sampled parts of the request, and the service call chain that matches the keyword rule request is stored in HBase, and the management console can quickly analyze the view.

The Distributed service Framework is the key to the service invocation chain. Open source Dubbo is widely used, considering that the Dubbo version has not been updated for a long time (some Dubbo rely on libraries that conflict with other open source components of the development ecosystem), have a large amount of code, and have a weak service governance capability, which makes it difficult to fully control all the details of Dubbo The previously mentioned Oppo self-developed Distributed service Framework OSF, code streamlining to meet the core requirements, and monitoring system depth integration.

OSF implements the delivery of the microservices RPC call RequestID, records the call time and error messages for each service, and the framework summarizes the time-consuming and error messages of the micro-service calls to the monitoring platform every minute.

The main features of OSF are as follows:

Support RESTful protocol, container support Tomcat, Netty, JDK Http Server;
Support TCP binary protocol, container support Netty;
Support HTTP/2 protocol, test;
Supports PROTOBUF, Jprotobuf, Kryo, FST, Messagepack, Jackson, Gson, Hessian serialization implementations.

The consumer determines the serialization mode:

Registry based on MySQL, while providing push, client pull two ways to ensure the reliability of service discovery;
The registry provides HTTP APIs to support multiple languages and mobile devices;
Support multi-data center deployment;
The I/O thread is detached from the worker pool and the provider is busy immediately responding to the client retrying other nodes.

From the point of view of reliability and scalability, it mainly includes the following elements:

Receiver: Receiver input using OSF RESTful protocol development, through the registration center, the client can automatically discover the change of the receiver node, through the client to achieve load balancing and failover, so as to ensure the receiver reliability, scalability.
Middleware Collector: Middleware collector through the Zookeeper election master, by the Mater to allocate the acquisition task, the collector node changes, the election master re-assign the acquisition task, so that any increase or decrease the collector node, can re-balance the acquisition task, to ensure the continuous reliable operation of the acquisition task.
Message processing: Due to multiple nodes sharing the same Kafka topic message and achieving high availability is difficult, omp predefined several Kafka topic, the Message Processing node by zookeeper election Master, by the master to allocate topic quantity, When a message processing node goes down, the node responsible for the topic is transferred to another node to continue processing.
Agent monitoring agents: the server on the shell script periodically check the agent status, not automatically restart the agent, while the OMP to maintain the heartbeat message between the agent, more than 3 cycles did not receive the agent's heartbeat message, Omp send alarm notify the relevant personnel processing.

From the case of Oppo's independent research and development monitoring system, everything should be based on business needs, the purpose is to solve the problems encountered by the business. Faced with the choice of open source software, to have "for", some "not". The industry has a lot of mature open source software, there are some more bold design ideas can be used for reference, but open source software is not taken to be able to use so simple, the principle of choice can be "control". An open-source software, if not "control", not simple enough, it is better not to use the soil method may instead, the problem can at least think of emergency solutions. Also must have "management" sex, otherwise the black box runs, the heart does not have the bottom, that as IT management personnel to sleep is not relieved.

The author of this paper, Rodechon, is currently working in the Oppo Foundation technical team, which is engaged in the basic technology development of monitoring platform and service framework. After graduating in 2005, he has led several product system design and development, project management work in the fields of communication, mobile finance, application store and PAAs platform. This article is licensed by the author to be exclusive to the INFOQ public platform.

Case | Service Architecture System Monitoring Challenge solution

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More