Linux Network Fault Analysis

Source: Internet
Author: User
The culprit of network card instability
09:47:42
Add to favorites I want to contribute
The culprit of network card instability is the recent New DB server. During the stress test, it was found that the network card was unstable. After the stress test was just over ten minutes, the server's response became very slow, packets are often lost during Ping and SSH connections are also interrupted. At first, I thought the DB server was unresponsive due to high concurrency. I can check the CPU, memory, and hard disk I/O, and found that none of them reached a high value, it is even much lower than our warning value, and monitoring also shows that the database server has ample resources! It's strange. Why is the NIC unstable? I asked the engineers about the situation and found that the DB Server is a hot standby server. I just bound two groups of gigabit NICs a few days ago. According to engineers, there was no such problem during the stress test before binding. Which part of the binding settings has a problem? Therefore, we decided to perform a detailed check from the Gigabit Nic binding. Fault symptom diagram:
1. Check the ifcfg-bond0 and ifcfg-bond1 file # Cat/etc/sysconfig/network-scripts/ifcfg-bond0DEVICE = bond0 www.2cto.com bootproto = staticonboot = yesipaddr = 10.58.11.11netmask = 255.255.255.0gateway = No # Cat/etc/ sysconfig/network-scripts/ifcfg-bond1DEVICE = bond1bootproto = staticonboot = yesipaddr = 10.10.18netmask = 255.255.255.0gateway = 10.58.121.254userctl = no analysis: very standard configuration, no problem. Do not specify the IP address, subnet mask, or Nic ID of a single Nic. Specify the above information to the virtual adapter (bonding. 2. Check ifcfg-eth0, ifcfg-eth1, ifcfg-eth2, ifcfg-eth3 file # Cat/etc/sysconfig/network-scripts/ifcfg-eth0 device = eth0 onboot = Yes bootproto = none master = bond0 slave = Yes userctl = no ethtool_opts = "Speed 1000 duplex full autoneg on" # Cat/etc/sysconfig/network-scripts/ifcfg-eth1 device = eth1 www.2cto.com onboot = Yes bootproto = none master = bond1 slave = Yes userctl = No ethtool_opts = "Speed 1000 duplex full autoneg on" # Cat/etc/sysconfig/network-scripts/ifcfg-eth2 device = eth2 onboot = Yes bootproto = none master = bond0 slave = Yes userctl = No ethtool_opts = "Speed 1000 duplex full autoneg on" # Cat/etc/sysconfig/network-scripts/ifcfg-eth3 device = eth3 onboot = Yes bootproto = none master = bond1 slave = Yes userctl = No ethtool_opts = "Speed 1000 duplex full autoneg on" analysis: in the configuration file, eth0 and eth2 are bound to bond0, and eth1 and eth3 are bound to bond1. (Note: You can temporarily set the Gigabit full duplex of the NIC.
Ethtool-s eth0 speed 1000 duplex full autoneg onethtool-s eth1 speed 1000 duplex full autoneg on) 3. Check modprobe. conf configuration file # Cat/etc/modprobe. confalias eth0 implements eth1 implements eth2 bnx2alias eth3 bnx2alias implements invalid bond0 bondingoptions bond0 miimon = 100 mode = 0 alias bond1 bondingoptions bond1 miimon = 100 mode = 1 ## beginpp includes /etc/modprobe. conf. PP ### endpp analysis: Add alias bond0 bondingoptions bond0 miimon = 100 mode = 0 alias bond1 bondingoptions bond1 miimon = 100 mode = 1
The main purpose is to enable the system to load the bonding module at startup. The external virtual network interface devices are bond0 and bond1, and miimon are used for link monitoring. For example: miimon = 100, The system monitors the link connection status every Ms. If one line fails, it is transferred to another line. The value of mode indicates the working mode, which has a total, two or three modes, commonly used: 0, 1.
Mode = 0 indicates that the load balancing (round-robin) method is load balancing, and both NICs work. Mode = 1 indicates that fault-tolerance (Active-Backup) provides redundancy, working in the active/standby mode. That is to say, by default, only one network card works and the other is used for backup. note: bonding can only provide link monitoring, that is, whether the link from the host to the switch is connected. If the external link of the switch is down and the switch is not faulty, bonding considers that the link is correct and continues to be used. There is no problem with this part of configuration. We haven't seen the problem yet, but there is another area that is easily overlooked, that is, RC. local file. To enable Nic binding to take effect immediately after each startup, RC is usually set. local. so we should also check this file. Www.2cto.com 4. Check RC. local file # Cat/etc/rc. d/RC. localtouch/var/lock/subsys/localifenslave bond0 eth0 eth1ifenslave bond1 eth2 eth3 analysis: This setting facilitates automatic loading of configuration during startup. Note: Put eth0 and eth1 in bond0, And put eth2 and eth3 in bond1. If you think about it, you will find that in step 2, eth0 and eth2 are bound to bond0, and eth1 and eth3 are bound to bond1. It seems that the cause of the problem is here. So what will happen if this configuration is wrong? First, review the network card binding principles. We know that under normal circumstances, the Ethernet Adapter only receives the ether frame of the target MAC address as its own MAC, and filters out all other data frames to reduce the burden on the driver, that is, the software. However, the Ethernet NIC also supports another promisc mode that can receive all frames on the network. Many system programs such as Sniffer and tcpdump run in this mode. Bonding Nic binding also runs in this mode, and modifies the MAC address in the driver, changing the MAC address of the two NICs to the same, can receive data frames of a specific Mac. Then, the data frame is sent to the bond driver for processing. Then we check the RC. in the local file, due to the carelessness of the system engineer, the NIC binding configuration is incorrect, so a slight configuration error will cause an IP address to correspond to two different MAC addresses, it will obviously cause network latency and instability, which is similar to ARP attacks. When there are multiple MAC addresses that correspond to the same IP address, the ARP of each machine in the network, including the IP address corresponding to the router, will change constantly. If the packet is lost, it will be sent to the wrong MAC address. Www.2cto.com
We can check the MAC of each Nic to confirm it. Eth0 link encap: Ethernet hwaddr D4: AE: 52: 7f: D1: 74 up broadcast running slave multicast MTU: 1500 Metric: 1 RX packets: 358839038 errors: 0 dropped: 0 overruns: 0 frame: 0 TX packets: 445740732 errors: 0 dropped: 0 overruns: 0 carrier: 0 Collisions: 0 txqueuelen: 1000 RX Bytes: 84060158481 (78.2 Gib) TX Bytes: 324117093205 (301.8 Gib) interrupt: 178 memory: c6000000-c6012800eth1 link encap: Ethernet hwaddr D4: AE: 52: 7f: D1: 76 up broadcast running slave multicast MTU: 1500 Metric: 1 RX packets: 1319022534 errors: 0 dropped: 0 overruns: 0 frame: 0 TX packets: 827575644 errors: 0 dropped: 0 overruns: 0 carrier: 0 Collisions: 0 txqueuelen: 1000 RX Bytes: 402801656790 (375.1 Gib) TX Bytes: 249765452577 (232.6 Gib) interrupt: 186 memory: c8000000-c8012800eth2 link encap: ethernet hwaddr D4: AE: 52: 7f: D1: 74 up broadcast running slave multi Cast MTU: 1500 Metric: 1 RX packets: 368142910 errors: 0 dropped: 0 overruns: 0 frame: 0 TX packets: 445816695 errors: 0 dropped: 0 overruns: 0 carrier: 0 Collisions: 0 txqueuelen: 1000 RX Bytes: 88487806059 (82.4 Gib) TX Bytes: 324236716714 (301.9 Gib) interrupt: 194 memory: ca000000-ca012800eth3 link encap: Ethernet hwaddr D4: AE: 52: 7f: D1: 76 up broadcast running slave multicast MTU: 1500 Metric: 1 RX packets: 13110654 14 errors: 0 dropped: 0 overruns: 0 frame: 0 TX packets: 827581593 errors: 0 dropped: 0 overruns: 0 carrier: 0 Collisions: 0 txqueuelen: 1000 www.2cto.com RX Bytes: 400383501186 (372.8 Gib) TX Bytes: 249850192137 (232.6 Gib) interrupt: 202 memory: The cc000000-cc012800 shows that the MAC of eth0 and eth2 is the same, and the MAC of eth1 and eth3 is the same. For the cause of the problem, immediately modify the RC. Local file and change it back to the correct configuration. Ifenslave bond0 eth0 eth2ifenslave bond1 eth1 eth3, restart the server, and perform a stress test. It turns out everything is normal. Summary: binding a Linux dual-nic is a specific operation. In configuration, we should not only be familiar with its principles, but also be careful when deploying and implementing it, an oversight will result in network instability and node paralysis.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.