Zabbix performance optimization practices

Source: Internet
Author: User

Recently, I have been monitoring data on the data platform because I have been using zabbix and I personally prefer to store data in the database (nagios and cacti cannot compare with zabbix ), facilitate further analysis and processing (capacity planning, etc ). Considering the scalability and performance, the architecture of master --- proxy is adopted. The proxy adopts the active mode, which can reduce the pressure on the master. Several problems: 1. first, to understand the performance of zabbix, added the zabbix-related metric monitoring (see: http://1662935.blog.51cto.com/1652935/1345664) 2. monitoring addition problems, developed a front-end page for adding monitoring, using the zabbix api to add monitoring with one click, complete the link template, and assign groups. The host-to-template link is matched by the host name, which lacks maintainability. Because cmdb is currently unavailable, we discuss it with our colleagues and prepare to maintain a set of information (host -- process, process -- template), which is dynamically updated every day.
3. after adding 200 machines to a proxy, I began to encounter a broken graph problem, for example, the following: by analyzing the history data in the zabbix server database, I found data loss, the interval value is 60 s. There should be 60 pieces of data in an hour, but there are only a dozen pieces of data in the database. In this case, we can analyze the items table in the proxy database. There is no problem with the delay settings, eliminate the problem of config sync, analyze the logs on the agent side, and find that data acquisition is incomplete on the agent side (the agent uses the passive mode ), that is to say, proxy busy results in incomplete data acquisition. After you adjust StartPollers, the default value is 5. In the passive agent mode, this value is far from enough. We recommend that you change it to the hosts * 1.5 value. 4. unreachable problem 1) A large number of host unreachable alarms (agent. ping item), but the host is accessible. by deploying a network monitoring script, the network connection problem between the agent --- proxy --- master is eliminated. Increase StartPollersUnreachable and UnreachablePeriod. 2) There is an alarm. zabbix does not generate an alarm when OK ---> unknown status. Therefore, the unreachable alarm cannot detect the problem of obtaining the value of the host item, you can achieve this by adding host update percent monitoring (see http://1662935.blog.51cto.com/1652935/1345789 ). 5. The overall update percent of the cluster is very low.

Through the breakdown to the host, it is found that some of the host update percent causes (the agent of several machines has a problem, and the value is in the unknown state) repair, the overall update percent increases to about 98%.

6. proxy server load problem a proxy accesses about 350 clusters, about nvps200, but load is relatively high, because the agent is in the passive mode, data acquisition is the responsibility of the proxy, therefore, if there are many items, the pressure on the proxy will be relatively high. Consider switching the agent Mode to active, and distribute the pressure to the agent. The proxy is only responsible for data sync and config sync. After adjustment, the pressure on the proxy is much reduced, for details, see (the item is not changed to active if there is no data). At the same time, the problem of excessive queue is solved. After adjustment, there is no more than 5 minutes of delay. 7. the housekeeper problem exists on both the master side and the proxy side (the proxy cannot disable housekeeper). The master side can solve this problem through disable and partition db, because it requires downtime maintenance and has not been adjusted yet.

8. db partition

Http://caiguangguang.blog.51cto.com/1652935/1354093

Through the above adjustment, zabbix is basically no pressure (350 single proxy servers), and the scalability is also good. benchmark test needs to be done later to see how many nvps can be run. conclusion: Before zabbix performance optimization, We should monitor zabbix performance. In adjustment, we should consider the distribution of pressure, master distribution to proxy, proxy distribution to agent. Understand the working mechanism of zabbix and the functions of various processes, and have a good understanding of the database table structure of zabbix.

This article from the "Food light blog" blog, please be sure to keep this source http://caiguangguang.blog.51cto.com/1652935/1346372

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.