Network loops connected to vswitch ports should not be underestimated

Source: Internet
Author: User
Tags network troubleshooting

An improper port connection between switches in the Ethernet may cause a network loop. If the related switch does not enable the STP function, this loop will lead to endless repeated packet forwarding, forming a broadcast storm, this causes network faults. We have encountered such a fault many times during the maintenance of the campus network, and we were impressed by the troubleshooting process.

Fault description

One day, we found a problem with a VLAN in the campus network performance monitoring platform-the connection between the access switch and the campus network was interrupted. Check the aggregation switch that is placed in the network center. A large amount of inbound traffic is measured for the 100BASE-FX port connected to it, but the outbound traffic is very small, which is abnormal. However, the performance of this aggregation switch seems to be okay and there is no problem. Therefore, we mirror the abnormal port on this aggregation switch and use the protocol analysis tool Sniffer to capture packets. At most, we can capture more than 0.1 million packets per second. A simple analysis of these data packets shows some of the common features.

1. The vast majority of packets are 62 bytes plus 4 bytes of error detection. The value of the FCS domain is 66 bytes.) The TCP status is SYN;

2. If the source IP address is the IP address of another CIDR Block and the destination IP address is the IP address of the CIDR Block of the building;

3. Although the source IP address is different, the source MAC address is the same;

4. The destination IP address and destination MAC address are the same as the IP-MAC parameters bound to the VLAN of the building on this aggregation switch;

5. The actual data flow direction is opposite to the flow direction determined by the source IP address and destination IP address in the data packet.

At that time, we were eager to repair the network as soon as possible and did not go into the characteristics of these data packets. We only saw 1st points and thought the network was under an unknown Syn Flood attack. It was estimated that it was caused by a new network virus, immediately disable the port on this aggregation switch to avoid network performance degradation.

Troubleshooting

In order to test network connectivity in the field, in the network center, we connected the multi-mode pigtails connected to the building access switch to a PC using a twisted pair through the photoelectric converter, and simulate it as the faulty VLAN gateway. Then, I went to the building network manager and asked him to help us find and isolate hosts infected with unknown viruses as soon as possible.

According to the network manager of the building, the network was still normal yesterday. However, a department in the building was making network adjustments at that time. Today, when I went to work, I found that the Network was not working and I don't know if it had any relationship with them. We believe that the adjustment of the network should have little to do with virus infection. In the main floor wiring room, we unplug the network cables on the access switch and connect them to the laptop to connect the test host in the network center. After we confirm that the link is correct, we will insert half of the remaining network cable quantity back to the switch. If the test shows no problem, we will continue. Otherwise, we will change the other half, gradually reduce the number of suspected problematic Network cables.

We finally found a network cable that could cause problems. If we plugged in the network cable, the network in the building would be disconnected from the simulated gateway. Identified by the Building Network Manager, this network cable is connected to the department that made network adjustment yesterday. He also said that the Department had previously pulled one master, one slave network cable, and there should be another one, and found the other one on the switch. Plug in one of the two network cables at will. The network is okay, but if you plug in the two networks at the same time, there will be a problem.

Will the line activate the SYN Flood Attack of the network virus? At this time, we think this phenomenon is more like a loop in the network. When we arrived at that department, we found that three non-managed switches were all bundled together. However, two of them were connected to the access switch through the two network cables, leading to a network loop. Apparently, the construction staff was not clear about the network topology. When the building's network manager went out, he thought he had to connect the lines wrong, which caused the network accident. You can easily find the cause. You only need to unplug one of the above network cables to restore network connectivity. After some twists and turns, the network has recovered to normal, but we have been wondering, what interferes with our judgment?

Fault Analysis

A typical network loop fault uses protocol analysis tool Sniffer to capture so many data packets. After some analysis, I did not see the problem. Apparently, the first sight of a large number of SYN packets gave us the illusion that it was a SYN Flood attack. Afterwards, we reviewed the network loop troubleshooting process, re-analyzed the captured data packets carefully, and explained the five common characteristics of the data packets mentioned above, this allows you to respond to similar problems in a timely manner.

First look at the first four features: The aggregation switch is a network-layer device, and the network-layer interface of the VLAN to which the building belongs is set on this aggregation switch. In order to implement the network management policy, you have bound MAC addresses to registered or unregistered IP addresses. TCP connections can only be established after three handshakes. The length of the SYN Packet initiating the connection here is 28 bytes, plus 14 bytes of Ethernet frame header and 20 bytes of IP header, the frame length captured by Sniffer is a total of 62 bytes and does not contain 4 bytes of error detection ). It happens that the unicast frame accessing the VLAN was a TCP request packet from the Internet. According to the Ethernet bridge forwarding mechanism, after the CRC correctness check, the static ARP configuration has been completed, this aggregation switch will convert the source MAC address of the unicast frame to the MAC address of the machine. The destination MAC address will be changed based on the binding parameter, and the CRC value will be recalculated to update the FCS domain, after this re-encapsulation, it is then forwarded to the access switch of the building.

Look at the last feature: A Bridge is a storage and forwarding device used to connect to a similar LAN. These bridges listen to each transmitted data frame on all ports and use the bridge table as the basis for forwarding the data frame. The bridge table is a MAC address and a "MAC address-port number" list used to reach the address. It refresh the source MAC address of the data frame and the port number that receives the frame. The bridge uses the bridge table in this way: when the bridge receives a data frame from a port, it refreshes the bridge table first, and then searches for the target MAC address of the frame in the bridge table. If it is found, the frame will be forwarded from the port corresponding to the MAC address. If the forwarding port is the same as the receiving port, the frame will be discarded ).

If the frame cannot be found, the frame is forwarded to the port other than the receiving port, that is, the frame is broadcast. It is assumed that the target MAC address of the data frame cannot be found in the bridge table of the bridge A, B, C, and D during the whole forwarding process, that is, these bridges do not know which port to forward the frame. When bridge A receives A unicast frame from the upstream network from the uplink port, it broadcasts the frame. After receiving the frame, bridge B and C also broadcast the frame, bridge D receives this unicast frame from bridge B and Bridge C, and transmits it back to bridge A through Bridge C and bridge B, respectively, by now, bridge A receives two copies of the unicast frame.

In this loop forwarding process, bridge A continuously receives the same frame on different ports, because the receiving port is changing, the bridge table is also changing the list of "source MAC-port number. Previously, assume that the bridge table does not have the target MAC address of the frame. After receiving the two unicast frames, the frame can only be broadcast to other ports except the receiving port again, so the frame will also be forwarded to the uplink port.

For each unicast frame, bridge A repeats the process mentioned above. Theoretically, 21 frames will be received once broadcast and 22 frames will be received twice broadcast ,..., A 2n frame is received after the nth frame is broadcast. In short, as A result, bridge A will soon form A broadcast storm, and the copy of this unicast frame will eventually consume 100BASE-X port bandwidth. During this period, many data frames may collide with each other and become incomplete, so that Sniffer cannot capture them. However, we can imagine that this unicast frame will repeat many times. Again, we checked the captured packets and found almost all the repeated signs that were not noticed at the time. Based on the 64-byte length, the 100BASE-FX port forwarding speed of the Ethernet switch can reach 144000pps. In this network loop state, Sniffer may capture more than 0.1 million packets with a length of 66 bytes per second.

For the above reason, since the bridge tables of the four switches at that time did not have the destination MAC address of the package, the aggregation switch in the upstream network sent a TCP request packet to the building, it will continuously receive copies of the TCP packet forwarded by the access switch of the building, and the large amount of traffic will be formed). However, it does not resend the received packets. Internet network applications are based on the request/response mode. End-to-End communication can be implemented only when both the transmission and receipt channels are smooth.

Once a channel in this network application is blocked, the application will end because it cannot be implemented. After the network application ends, the requester party will not automatically send a request packet for the application again. Therefore, in the network loop state, there is a channel with large traffic and the other channel has almost no traffic. Because VLANs have the function of isolating broadcast domains, these large traffic will not go through the network layer, so it will not put a lot of pressure on the aggregation switch.

In fact, because this network loop is a fault on the data link layer, it only involves the source MAC address and the target MAC address. No matter what type of packets encapsulated by the upper layer, it may cause a broadcast storm. That is to say, it is possible to capture all kinds of data packets with Sniffer at that time.

Fault prevention

The access layer of the campus network is a user-oriented network interface, which has many uncontrollable components and is complex. It should be managed by a dedicated person, and the reliability should be ensured on the equipment. The local access switch is manageable and has the STP function. Other switches are non-managed and do not have the STP function. The STP function was already configured on the access switch. This network accident can be completely avoided, but somehow it was not done. Afterwards, we can only "make up for it.

It can be seen that even if the STP function is enabled for the access switch, the downstream network forms a loop for some reason, resulting in a broadcast storm, resulting in the impact on the upstream network VLAN, therefore, this access switch should also have the broadcast packet Suppression Function to limit the impact to a local range. Vswitches in downstream networks also have these requirements, but they are only cost-effective. In a word, technology and experience are important in network troubleshooting, but it is even more important to maintain standardized network connections and implement basic preventive measures.

  1. Configure a switch to quickly solve the lan network loop Problem
  2. Smart switch, away from network Loops

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.