Monitor multiple servers

Source: Internet
Author: User

Main sub-system monitoring and business monitoring

System Monitoring is the CPU usage, memory network bandwidth, and other usage of each host, as well as MySQL, redis, the core indicators of nginx and other services are relatively basic monitoring. If this monitoring is a good production environment, many problems can be found in advance to prevent problems.

Business monitoring is a service-related indicator, such as the number of calls to an API per second. The average response time of this API per minute indicates the number of online users of the service, and even some operation-related data, such as the retention rate of seven days. New users are added each day and users are lost every day. This data is also very important. It is a barometer of your entire business and serves as a basis for you to make important decisions.

There are many open-source software available for system monitoring, such as Nagios, cacti, zabbix and other deployment operations are complicated. The agent must have a center to collect, store, and display data, and many plug-ins need to be maintained. Messages are sent to the primary service through a queue. If the data center is the same, it is better to write the Nagios plug-in. In this way, unified management is required. You only need to write the plug-in. If the data center is distributed, you can consider writing some scripts for message transmission between Nagios. writing your own code is not time-consuming and difficult to manage. However, collectD comes with a variety of plug-ins, such as system CPU disk utilization, MySQL, nginx, common services such as redix can be monitored and you are automatically advised which metrics to monitor. It is easy to install./configuration & make install.

For business monitoring, you must write your own code to report business data. Currently, the popular solution is statsd + graphite, Which is lightweight and sdks with many languages can easily monitor various metrics.

The selection of monitoring generally depends on the distribution of your servers:
If it is a distributed data center with many data centers, ganglia has a distributed feature and is the first choice for centralized monitoring and processing; nagios requires some plug-in optimization and Structure Adjustment to better support distributed requirements. because distributed systems face centralized management and reliability, Reliability: monitoring should be avoided to ensure accurate monitoring. Centralized Management can reduce the workload.
If it is centralized, ganglia is recommended in the case of a large amount of monitoring. If it is small, many other monitoring systems can choose to use Nagios for alarm monitoring, it seems that there are few such flexible tools, but it is better to change the configuration to be the most suitable for your environment, and the simplest and fastest configuration requires you to develop some rules yourself.


Most monitoring systems are similar to the following:

  1. An agent is installed on each machine to collect performance data of the local machine.

  2. The business deployed on each machine submits data related to this business to the center based on an SDK.

  3. Each agent can dynamically load some plug-ins as needed to monitor new metrics.

  4. Generally, a data center has a center to collect metrics reported by agents and businesses.

  5. Center stores and archives collected metric data. Generally, RRD database is used.

  6. The center also has a Web interface to view the historical charts of each indicator, and even various views and dashborad to display a set of related indicators.

  7. The center also sends custom key indicator production reports to O & M personnel or related personnel every day.

  8. The center also needs to save various alarm rules. For example, an alarm is generated when an indicator exceeds a threshold several times in a row or when the fluctuation exceeds a certain range, or when an indicator exceeds the threshold, no data is reported to generate an alarm.

  9. Center also needs to converge various types of alarms, such as the merger of similar alarms to temporarily shield some types of alarms to prevent a large number of alarms caused by network jitter. Without these O & M personnel, they will be drowned in various alarm sounds.

  10. The center sends alerts to O & M personnel in various ways, such as text message and email voices.

  11. The center also reviews and statistically analyzes each alarm to find the online time stability of the weak points and availability of each system.

Therefore, it is difficult to build a sound and reliable monitoring system by yourself, which requires a lot of manpower and energy for development and maintenance.

At present, some foreign manufacturers dedicated to O & M outsourcing center hosting are saving them a lot of work. The remaining agents and plug-ins have to be installed on their own, but this is simple. There are a lot of options. O & M tools for batch deployment.

Well-known include newrelic, stathat, hostedgraphite can be used to check whether an agent can be installed to report data to their center or use their SDK to submit custom data. They are responsible for storing and displaying alerts, saving a lot of money. manpower.

In China, someone has done similar things, such as the dnspod D monitoring. Recently, the custom monitoring feature is released. It is compatible with the graphite reporting interface. You can deploy a collectD to monitor various system monitoring metrics. if you want to perform business monitoring, graphite also has sdks in various languages. Graphite's open-source utilities and software can meet many requirements.

If the number of servers is not large, for example, less than 200 servers, it is recommended to try Nagios monitoring system CPU, memory, hard disk and other fundamentals are very convenient. It is also easy to monitor your own services. You can combine some plug-ins and write simple scripts.

If the number of servers is large, more than 1000 servers are supported. Efficient collection of such information is complicated. It is a headache to think about how much data is sent to your monitoring server every second. Therefore, you need to carefully design the topology and write a lot of code. Of course, this information can also be collected using the existing open-source framework.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.