On September 4 and September 9, the company's monitoring platform Zabbix two large-scale Zabbix monitoring alarms, are ZABBIX agent on * * * * * unreachable for 5 minute unreachable, each time is all monitoring host this alarm.
Fault description: all monitored host alarms, all graphics data is interrupted
Action: The first time is to execute the zabbix_get command on the Zabbix server side, find that the data can be obtained, and add the "time" command before the command. The resulting data time is also within a relatively short range.
Results: After 10 minutes all the alarms disappeared or resumed in a flash, and the data on all the graphs was restored, and the graphics were all coherent.
Detection: After the alarm disappears, the first thing I do is to look at the logs, the logs on the Zabbix_server side, the logs on the server side.
LOG: Cannot send list of active checks to [* * *]: Host [* * *] not found
Item "Vfs.fs.size[c:,used" "On Host" "Failed:first network error, wait for the seconds
such as the log, when I linked the client, the client's log is basically a number of links 10051 port failure, system interruption and other alarms, and then I Baidu and Bing many, but basically did not find a solution to my approach, but only to consult colleagues.
Today things I think should be solved, because in the end we are unable to find the problem, the final test system log, that is/var/log/message inside, in the system log has been reported two kinds of errors:
16:17:24 localhost kernel:nf_conntrack:table full, dropping packet.
LocalHost Rsyslogd-2177:imuxsock begins to drop messages from PID 21607 due to rate-limiting
It says the table is full and starts dropping packets. The other one is Rsyslog log data loss.
I didn't notice at first, but my OPS boss thought it was related to the Zabbix data, and then we slowly searched for data, and finally we found that the Iptables Firewall service actually started, At the beginning of the Zabbix configuration, we all know that the firewall and SELinux will generally shut down, but now the situation is obviously because the firewall has resisted the data, causing the data table to be full after the start of packet loss.
Then I use the root user in the/root directory. Bash_history someone using iptables-l this command to view the firewall rules, I finally boss told me:
Remember: Do not use iptables instructions (such as IPTABLES-NL) to view the current status while the firewall is down! Because this causes the firewall to be started, and the rule is empty. Although there is no blocking effect, all connection states are logged, wasting resources and impacting performance and possibly causing the firewall to drop packets actively!
All right. Finally, after shutting down the firewall,
16:17:24 localhost kernel:nf_conntrack:table full, dropping packet alarm will not exist.
Zabbix Monitor Large-volume alarm Zabbix agent on * * * * unreachable for 5 minute