This afternoon, we helped nosi's engineers solve a small network fault and recalled the entire problem solving process on the way back, the idea of finding a solution to the problem is the same as that of solving the problem in software engineering. It is basically a routine, and the methodology is abstracted above the technology.
The problem is: the afternoon before yesterday, our network management LAN suddenly experienced a large number of packet loss, and the network was severely blocked, resulting in full data communication resistance to one of the data centers. The whole LAN is in a broadcast domain, and my hosts are also affected in this broadcast domain. Not to mention Wireshark directly. packet capture results show that a large number of ethnet II packages, both source and target addresses are MAC addresses, and the protocol is unknown, which is clearly a broadcast storm. As the IDC was the most serious and completely inaccessible, it was unable to maintain the softswitch device, so I called the company's car to the IDC for a look.
Open the cabinet door, nosi manufacturers recently added a new pair of CE and a pair of SPC, the four network tubes network cable was suddenly connected to the same network management switch, the switch cascade to the aggregation switch and then to the province, simple and crude topology. The time of the Internet Connection Manager and the time when the fault occurred were almost the same on that afternoon. It should be related to the time. Because the device is not online yet, we will unmount the four network cables and test them, the fault disappears.
The first thought is that there is a loop, which is also obvious. Anyone with basic network knowledge knows it.
After returning, inform the manufacturer of the problem and check whether there is a loop connection between Ce and SPC. the manufacturer believes that there will be no loops in the layer-3 connection between Ce.
Today, the manufacturer asked me to go to the data center to check for faults. Connect the laptop to the network management switch and ping the gateway continuously to test packet loss.
The manufacturer said they had no problem with the verification. Let me say this. You should give me the topology:
The above figure shows the topology. Obviously, each of the four lines may form a loop.
Tear down the next page, copy the topology, and mark it as 1, 2, 3, 4. I said that you should unmount all 1234 first, then I want you to insert the root, and you will insert the root and the root, to see when packet loss will occur.
Add 1 first without packet loss
2 more, no packet loss
3 more, packet loss
Unplugging 3, plugging 4, no packet loss
Obviously, there is a problem with the ring where 3 is located. How can the manufacturer say about the layer-3 router? I say it depends on whether it is a layer-2 connection, at this time, the manufacturer said that the interfaces CE1 and CE2 belong to the same VLAN. I said that would not have come out. You two CE are in the same vlan16 and then receive a broadcast domain, didn't we create a loop. The manufacturer said to check the Network Management Switch data, and the configuration does not have STP, so the STP is not enabled, so this problem occurs. In this case, I don't agree. The STP Spanning Tree detection cannot be responsible for the appearance of the loop. It is assumed that the premise for other devices to enable STP is unreasonable, because you should also eliminate the loop without STP, STP is responsible for detection and protection, rather than allowing you to have loops. I said that the root cause is not STP, but that you should not perform ce trunk to transparently transmit the VLAN data. Each line is independently connected to the network management switch, and the network management does not require protection switching, this data is redundant. The manufacturer nodded. Yes, no.
Cancel the trunk vlan16 and solve the problem.
----------------------------------------------
This is a very small fault in network management maintenance, But I later thought about it and found it interesting, because I have handled this problem in many places and reflected the thinking of a programmer.
Steps:
|
Programming debug |
Networking troubleshooting |
1 |
Analyze Your architecture, locate the code modules that may cause your problems, and estimate the approximate location. |
Draw a network topology, locate possible loops, and mark suspicious connections. |
2 |
Return to the version that can be tested, add code, test, add, test until the problem occurs. |
Unplug all suspicious connections and return to normal. Add, test, and test until the problem occurs. |
3 |
Find the problem code and analyze the cause |
Find the problematic link and analyze the cause |
4 |
Modify data or refactor to solve the problem. |
Modify data to solve problems |
Principles:
1. Do not think your code is okay:
Do layer-3 connections have no loops? What if it is Layer 2?
2. Do not use other reasons to make excuses for your problem, or make your code highly coupled. Successful Running should not rely on additional non-essential prerequisites:
STP should not be responsible for the appearance of the loop. The query loop should not take STP as the premise and should work normally without STP detection and protection.
Analogy:
The same is true for the design mode and communication protocol.
For example, TCP/IP layer-7 protocol:
The underlying layer is like an abstract class, and the upper layer is like an implementation. The implementation must depend on abstraction.
The lower layer encapsulates the upper layer data and abstracts and encapsulates multiple implementations.
If there is a problem at the upper layer, check whether there is a problem at the lower layer. The pattern architecture problems first occur when the abstraction layer is poorly encapsulated.
Everything is similar, abstract, and the consistency of the problem can always be found when you look at the problem from a high point of view, so that you can reuse the porting method to achieve a common solution.
The productivity gained by mastering scientific thinking tools is enormous.