Zabbix Alarm
When the warning machine Room link lost packets, delay increase, down the machine and other problems landed on the corresponding equipment for two-way MTR.
The resulting two-way MTR directly to both sides of the machine Room Fault Group (group), @ related technology can improve response speed.
The MTR will show packet loss, delay increase, the node IP is not through, query the attribution of the IP, if the two-way MTR within the incident IP is the same city, then focus on follow up the room.
When the query fault attribution to the city, no response within the group, timely call the room 24 hours on duty personnel telephone. and informs the case that there is a failure in the IP node.
Fault point not on backbone link
When the server is on the MTR to the terminal, the packet is dropped from the first hop (the first hop is the switch), then the server pings the switch IP to see if the packet is actually dropped. The IP of the server default gateway is the switch IP address.
If the ping switch drops packets, it is possible that the fiber module is causing the failure to call the network group members in time.
If the MTR second packet loss serious, preliminary judgment for the machine room equipment problems (including agents), can be directly to the computer room personnel.
Ensure business is not affected
When contacting the room, was told that the fault can not be restored in time, should cut off the business flow.
Contact the Network group if you encounter a situation that cannot be handled in a timely manner.
When the fault is more than one person to handle, contact network group to deal with network failure.
Failure recovery
If the fault is continuous, indirect, physical factors caused by the failure, do not revert to use.
If the failure has ensured recovery, the MTR, Ping, and wget are normal values that can be tangent back to the traffic recovery use. If necessary, you can adjust the size of the cut by adjusting the polling scale.
Recording
According to the Zabbix alarm record the time of failure, according to the time of test failure result is the failure recovery time.
If a multi-engine room to the same machine room to produce a fault, most of the latter caused by the failure, so only record the room fault can be.
Record the name of the person on duty and send the email.
Network 24-hour on-duty f&q