"Editor's note" based on Dataloop.io's founder and CEO David Gildeh on the current situation of the monitoring tool market and the prospect of future development, we will expand the discussion.
Why is surveillance still a mess?
To investigate the market and make better monitoring tools, David Gildeh has interviewed more than 60 European and American online service providers, such as the BBC's online service giants, small startups in London and the United States. discover that most services are running on top of the public cloud infrastructure (like AWS) and take a DevOps practice scenario.
As more and more businesses use cloud services, and try to build a DevOps environment, cloud monitoring has become a just-in-demand.
To develop a better monitoring tool, we must first answer two questions:
- Enterprises are currently using the monitoring tool is what, they have how many servers;
- These monitoring tools address what issues they have, and how they relate to the number of servers and the deployment environment.
In David Gildeh's findings, we learned two things.
- First, in some respects, monitoring is still poor and this will be explained in more detail below.
- Second, monitoring is still a problem because more and more companies are starting to turn to MicroServices (microservices).
What monitoring tools are being used by the enterprise?
Although many new tools have emerged in the market since 2011, many "old" open source tools, such as Nagios and Zabbix, still dominate the market. The interview found that 70% of companies are still using these traditional tools for core system monitoring and alerting (see Figure 1). Data from Dzone shows that 43% of the site's enterprise development users have used Nagios, Zabbix, or Ichinga. In addition, Nagios is one of the most popular monitoring tools, even occupying 29% of the market share [1].
Data statistics, about 70% of companies use multiple monitoring tools at the same time, and many companies use two or more than two products. Of course, in the European and American markets, Nagios and Graphite configurations are the most common. According to small understanding, the country has a large number of users using Zabbix+grafana. For some paid performance monitoring tools, such as New Relic, most users are reluctant to upgrade from the free edition to the paid version for price considerations.
However, there are many "trendy" monitoring tools on the market, primarily for startups, which include newer SaaS monitoring tools such as Cloud insight (System monitoring platform), Application insight (application performance monitoring platform), and Browser Insight (front-end performance monitoring platform). Old-fashioned open source tools, such as Cacti and Munin, are also prominent representatives of this group. The cost of such tools is lower, such as the free version of Cloud Insight, which basically meets the requirements, and is very well suited for startups and small and medium-sized teams, which can greatly save the human and time costs of operations, because deployment and learning are easier, more flexible than Zabbix, and because of the high visibility 10-40 servers are very popular among small and medium-sized teams.
For the Cloud Insight hostmap feature.
How many servers are monitored by the enterprise?
If you look at tool usage and the number of servers managed by your company (from startups with fewer than 20 servers to large online services with more than 1000 servers), you'll find old-fashioned open source tools (such as Nagios) and paid localization tools. Tend to occupy a larger proportion of companies with larger service sizes, while smaller firms prefer to use development-focused tools such as Graphite,logstash and OneAPM.
On the other hand, small teams often do not practice DevOps and the company does not have dedicated OPS personnel, so developers tend to use simple, easy-to-install SaaS monitoring tools or tools that are popular in the developer community, such as [Cloud Insight] (HTTP/ docs-ci.oneapm.com) (China), Datadog (Europe and America) and LogStash. When the number of servers in a company reaches a critical point between 50 and 100, they often have the ability to introduce devops/operations personnel or teams, and then start using time-tested, broad-user-based monitoring tools like Nagios and Zabbix.
Key trends
Based on an interview with more than 60 online service companies on the monitoring strategy, David Gildeh summarizes the following four major trends.
1. Build and expand
78% of online services run their own open source monitoring solutions, and many companies spend 4-6 months building monitoring solutions using open source components and then tuning to the appropriate work environment. The key issue is that many tools were originally designed 10-15 years ago, much earlier than the advent of cloud architecture, DevOps, and MicroServices (microservices). So companies need to spend a lot of time tweaking these old-fashioned tools to make them compatible with today's dynamic environments (very tiring).
After companies have built and optimized their monitoring systems, they need more time to modify their monitoring systems to handle the growing volume of data as their business grows. For example, a large online service with more than 1000 instances on AWS and a Zabbix server outage after the data in the background MySQL database 2Tb fills up. In the end, they just keep restarting the database without trying to scale up the Zabbix.
2. Spam Alerts
The companies interviewed are complaining about the same problem-over-alerting. It is clear that all tools, even those that claim to have advanced machine learning algorithms, do not solve the problem of alarm fatigue. The problem will only worsen as businesses continue to add servers and run microservices on a constantly changing cloud environment. While many businesses are boast in marketing, in practice these machine learning algorithms for anomaly detection or alarm prediction are not really what people want. This means that there is still a long way to go if you want these tools to automatically filter out alarm noise during monitoring.
At a company, they receive about 5000 email reminders a day. Such a large number of messages makes the alarm gradually become noise, most teams will only filter these alarms into a folder or simply automatically delete the alarm.
3. Data silos
Many of the companies we interviewed are collecting real-time data. These data sources include business metrics such as registration, number of payments, or revenue data that the team uses to further understand the company's service situation. However, most of the monitoring tools they use are poor usability, outdated UI, and so the data collected is isolated and cannot be used by the operations team. So it's not easy for other stakeholders to understand the value of these real-time data.
But there are also services that address data silos by creating custom dashboards, displaying them in the office's TV, or sharing them via URLs. such as Cloud Insight, take "devops + collaboration" concept, with API and SDK features, you can customize the dashboard upload data, including performance data, business data, operational data, and many other forms (discount chart, column chart, pie chart ...) ) for an integrated real-time display. The upcoming dashboard sharing feature will enable dashboards to be shared in real time. This is almost a consensus, if the company's monitoring data is easy to share, in the collaboration process of different teams, monitoring tools can reflect its value, such as identifying areas that need to be improved, real-time performance visibility across the business, and so on.
Will this be the next trend in system monitoring? We wait and see.
4. Micro-Service
A key trend in online services is the MicroServices deployment model, which includes independent, cross-functional development teams that deploy and support their services during production. This strategy enables a large and complex application to be highly scalable. However, this greatly increases the number of servers and services that the devops/operations team needs to support, so the deployment model only works if the development team becomes a front-line support in the event of a problem.
In this model, OPS becomes a "platform" team that provides common tools and processes for the development team. The platform provided by this operation includes self-service monitoring, in which the developer must be able to independently add monitoring and create their own dashboards and alarms.
For companies that are easy to share monitoring data, the monitoring tool becomes a more valuable tool.
Keeping up with high-speed deployments in the MicroServices model and fast-changing instances is a huge challenge for old-fashioned monitoring tools. At the same time, it is not easy for Zabbix and Nagios itself to visualize complex task flows across services and to handle highly dynamic extensions. To add insult to error, it is clear that the current monitoring tools are not designed around the microservices model, and most of the usability is poor and the adoption rate is low outside the OPS team.
As a result, new monitoring models and tools require new monitoring tools specifically for microservices, so that operations and development teams can collaborate around the same performance data source rather than developers using their own tools (such as New Relic or OneAPM), while OPS uses its own tools such as Nagios and Zabbix), are isolated from each other.
Conclusion
In the four years following the "#monitoringsucks" event, a wide variety of monitoring tools emerged. But David Gildeh's research and our own research show that many companies are still struggling to monitor the field. We believe that the main reason is that many new monitoring tools tend to focus only on technical monitoring and are not enough to promote adoption outside the OPS team. And we believe that connecting development, operations and even other departments, through reliable monitoring, enables everyone in the enterprise to make decisions based on data, and is a trend for future IT teams to monitor their products.
[1] dzone Performance & monitoring Survey
[2] Http://www.slideshare.net/adriancockcroft/software-architecture-monitoring-microservices-a-challenge
Cloud Insight integrates monitoring, management, computing, collaboration and visualization to help all IT companies, especially small and medium-sized teams, reduce human and time cost inputs to system monitoring, simpler deployments, more comprehensive data, and better visualization, making operations more efficient and simple.
This article was transferred from OneAPM official blog
Who will be Zabbix and Nagios's successor?