1, the problem of the show
On the Zabbix there are a few IP repeated error "Zabbix agent on the XXXX is unreachable for ten minutes
", it's OK to test the client port with Telnet. After that or a few minutes or a few 10 minutes, disappeared, and then a call to the police, so down the machine data curve is intermittent, although the impact is not small, but very annoying. SA also does not look closely, the service a restart also regardless of whether there is a problem solved.
Then I went up and checked the server logs. In addition to saying "one key's data cannot be obtained, time-outs fail" is another "first network error". With manual testing, using Zabbix_get to get the data on the agent, no problem, but there will be a "delay and failure" situation. Why fail, but can not find the reason.
Network connectivity should be no problem, although the report is "first network error." Baidu a bit, there is said to be key wrong misleading
Delay, the virtual machine on its own deliberately to write the key wrong, also can not make this error. There is said to be zabbix_server end of the memory is insufficient, I also tried, and did not make this error, and the server memory itself is also monitored, and, in addition, if the problem of memory, why do not machine error, and only these several?
2. Find the cause of the problem
Considering this problem, it is definitely the client's problem. Go up to see no problem, the service is normal, compared with normal machine configuration
file, no problem. It was a strange one. Later think about, look at his connection with the server, this check does not matter, 10050 port and the connection of the servers from the 3W has been brushed to more than 6W, statistics of the next light 10050 this unexpectedly have 2w5+ in time_wait state, even I telnet The connection to the past test was stopped there. When I stop the zabbix_agent service, these states are not released. Think also, I telnet state he saved 5 in the inside, those states are zombie to that, causing the agent port can not be released, too busy, so the server and the client can not establish a connection, and finally failed, because the port is super busy, so it returned to the "first network Error ". On the server side, these connections have already been destroyed. However, it is still possible to telnet, and some values can be obtained when using the Zabbix_get test, but if there is no access to the continuous test. Only the agent is very busy, but not yet dead. This kind of busy is blind and busy.
3, analyze the cause of the problem
The cause of this problem is found, because Windows 2008R2 failed to release the occupied port so that the process connection is in the TIME_WAIT state, and the normal data request can not establish a connection. In this way the server is considered a network error, and the log prints the error "First network error". Because the data is not available, the Zabbix page displays "Agent is unreachable
The When I find the relevant error Baidu,
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/72/9F/wKiom1XpL5GwPgXaAAE5_OsGKe4019.jpg "title=" Search Windows Ports cannot release "alt=" Wkiom1xpl5gwpgxaaae5_osgke4019.jpg "/>
Found there are a lot of similar problems, see the official website means to get patched. So it appears that even after restarting the server, this kind of problem still occurs.
Well, anyway, the reason is found, is the problem of Windows engineers.
This process alone generates more than 25,270 connections. Another one, generated more than 28,000, this command was temporarily copied. If the/c parameter is removed, the screen will be brushed from the server-side 30000+ port to the 60000+ port.
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/72/9B/wKioL1XpMqDQEVOIAAGfnA_XjQY627.jpg "title=" time_ There is too many wait states and cannot release "alt=" Wkiol1xpmqdqevoiaagfna_xjqy627.jpg "/>
4. Re-test the problem
The normal machine connection situation is: The following, more than 20 ports, also see the case of key. However, there will be a refresh and the port on the server will be refreshed.
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/72/9F/wKiom1XpPWCzxH-dAAPrWgc5Y8E177.jpg "title=" Normal Zabbix agent connection condition "alt=" wkiom1xppwczxh-daaprwgc5y8e177.jpg "/>
Then I modified the configuration file on the wrong client to change the listening port to 10051, and still the connection became more and more
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/72/9C/wKioL1XpSCDgEE8SAAGhI0n2wYQ593.jpg "title=" More and more Connections "alt=" wkiol1xpscdgee8saaghi0n2wyq593.jpg "/>
Connected more and more and then I stopped the service
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/72/9C/wKioL1XpSFuxihzpAAHE8AqntgM249.jpg "title=" The failed connection does not close "alt=" Wkiol1xpsfuxihzpaahe8aqntgm249.jpg "/>
When the connection reached 100, I closed the service, and he stopped here. The condition of the normal machine is:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/72/A0/wKiom1XpRo_x01TAAAMiHEMFvH4957.jpg "title=" The normal situation is that the connection is destroyed after the service is closed "alt=" wkiom1xpro_x01taaamihemfvh4957.jpg "/>
This normal machine has up to 27 connections, and then I shut down the service, beginning to decrease until not.
Because this link is generated, the server uses random ports to fetch data here, after the data server to destroy the connection, and the client is parked in the time_wait state, and then the next time a new different port, it will create a new connection, and then stopped here, If the new port is exactly the last time, then it will be able to take the data smoothly, cannot, because the port is too many connections and failed.
Other machines do not have this problem, the agent is deployed together, and I test with this agent, it can be concluded that some of the problems caused by the system.
This article is from the "Sea Breeze" blog, please be sure to keep this source http://beyondhf.blog.51cto.com/8953234/1691308
Zabbix_agent repeated alarms-logs show "first network error" issue