Yesterday, my colleague told me that cacti suddenly could not catch the SNMP data of a server. Let me see it and then I hurried out .. After logging on to the server, I simply checked that port 161 is on, and the process is not suspicious. I restarted the snmpd service and used
Snmpwalk-V 2C-C public localhost. 1.3.6.1.2.1.1.3
Command to capture the data. I thought the problem had been solved. Who knows that the same fault occurred after 10 seconds .. Run netstat-an | grep-W "161" and find that the Recv-q Data is abnormal. The normal condition is 0, but now it is 86533, indicating that the data has been received, but it remains in the buffer waiting for acceptance .. We initially suspected that the server was under attack because the community we used was the default public, so we changed the SNMP configuration file to pub. After restarting the service, we found that the server was normal again, netstat-an | grep-W "161", the Recv-q Data is changed to 0 again, and no exception occurs in the test ..
Later, my colleague came back and asked him what he had done. He said that he had stopped the server when he found that the NFS service was not needed. I told him about the process of solving the problem. He said that the problem should have nothing to do with external attacks. It may be caused by internal conflicts. So we changed the community to public again ,. The problem is found again... The fault is caused by disabling NFS, so it must be related to NFS. Later we confirmed that the NFS service is not required and the problem of uninstalling NFS is solved ..
I checked the information online and found that the problem may be "overflow caused by Network Package fragmentation". The following information is written on the Internet:
When the rsize/wsize is greater than the MTU of the network (most of the networks are 1500, unless a large package is supported), IP packets are sliced during UDP transmission. A large number of IP packet fragments consume a large amount of CPU resources at both ends of the network, and cause unstable Network Communication (because the complete RPC must be re-transmitted throughout the entire RPC when any packet of the UDP fragment is lost ). Any increase in RPC retransmission will lead to an increase in latency. This is the biggest bottleneck of NFS over UDP performance.
If your network topology is complex, the routes of UDP multipart packets may be different and may not arrive at the server in time. The kernel has limits on the cache of the split package. The maximum value is specified by ipfrag/_ high_thresh. You can view the/proc/sys/NET/IPv4/ipfrag_high_thresh and/proc/sys/NET/IPv4/ipfrag_low_thresh files. Once the number of unprocessed packets exceeds ipfrag_high_thresh, the kernel discards the part package until the number reaches ipfrag_low_thresh.
Another method of monitoring is the IP address in the file/proc/NET/snmp: reasmfails. This is the number of failed parts. If the value increases too fast during the transmission of a large number of files, the above problem may occur.
If you are using an NFS client, you may want to monitor the number of failed IP reorganizations (kernel reorganizations include packet failures of network fragment data), which can be achieved through the SNMP variable IP-MIB: ipreasmfails, the following is a simple command:
#snmpwalk localhost -c public IP-MIB::ipReasmFails.0
|
Although the final problem is solved, there is another question: why is it normal after I change the default public? It seems that I have to check the information again.
Disabling NFS makes cacti unable to capture SNMP data