Recently, the new db server found that the NIC was unstable during the stress test. After the stress test was conducted for just a decade, the SERVER's response became very slow, packets are often lost during PING and SSH connections are also interrupted. At first, I thought the db server was unresponsive due to high concurrency. I can check the CPU, memory, and hard disk I/O, and found that none of them reached a high value, it is even much lower than our warning value, and monitoring also shows that the database server has ample resources! It's strange. Why is the NIC unstable?
I asked the engineers about the situation and found that the DB Server is a hot standby server. I just bound two groups of gigabit NICs a few days ago. According to engineers, there was no such problem during the stress test before binding. Which part of the binding settings has a problem? Therefore, we decided to perform a detailed check from the Gigabit Nic binding.
Fault symptom diagram:
1. Check the ifcfg-bond0 and ifcfg-bond1 documents
# Cat/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE = bond0
BOOTPROTO = static
ONBOOT = yes
IPADDR = 10.58.11.11
NETMASK = 255.255.255.0
GATEWAY = 10.58.121.254
USERCTL = no
# Cat/etc/sysconfig/network-scripts/ifcfg-bond1
DEVICE = bond1
BOOTPROTO = static
ONBOOT = yes
IPADDR = 10.10.10.18
NETMASK = 255.255.255.0
GATEWAY = 10.58.121.254
USERCTL = no
Analysis: there is no problem with the standard configuration. Do not specify the IP address, subnet mask, or Nic ID of a single Nic. Specify the above information to the virtual adapter (bonding.
2. Check ifcfg-eth0, ifcfg-eth1, ifcfg-eth2, ifcfg-eth3 documents
# Cat/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE = eth0
ONBOOT = yes
BOOTPROTO = none
MASTER = bond0
SLAVE = yes
USERCTL = no
ETHTOOL_OPTS = "speed 1000 duplex full autoneg on"
# Cat/etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE = eth1
ONBOOT = yes
BOOTPROTO = none
MASTER = bond1
SLAVE = yes
USERCTL = no
ETHTOOL_OPTS = "speed 1000 duplex full autoneg on" # cat/etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE = eth2
ONBOOT = yes
BOOTPROTO = none
MASTER = bond0
SLAVE = yes
USERCTL = no
ETHTOOL_OPTS = "speed 1000 duplex full autoneg on"
# Cat/etc/sysconfig/network-scripts/ifcfg-eth3
DEVICE = eth3
ONBOOT = yes
BOOTPROTO = none
MASTER = bond1
SLAVE = yes
USERCTL = no
ETHTOOL_OPTS = "speed 1000 duplex full autoneg on"
Analysis: In the configuration file, eth0 and eth2 are bound to BOND0, and eth1 and eth3 are bound to BOND1.
(Note: You can temporarily set the Gigabit full duplex of the NIC so that ethtool-s eth0 speed 1000 duplex full autoneg on
Ethtool-s eth1 speed 1000 duplex full autoneg on)
3. Check the modprobe. conf configuration file.
# Cat/etc/modprobe. conf
Alias eth0 bnx2
Alias eth1 bnx2
Alias eth2 bnx2
Alias eth3 bnx2
Alias scsi_hostadapter megaraid_sas
Alias scsi_hostadapter1 ata_piix
Alias scsi_hostadapter2 lpfc
Alias bond0 bonding
Options bond0 miimon = 100 mode = 0
Alias bond1 bonding
Options bond1 miimon = 100 mode = 1
### BEGINPP
Include/etc/modprobe. conf. pp
### ENDPP
Analysis: add this file
Alias bond0 bonding
Options bond0 miimon = 100 mode = 0
Alias bond1 bonding
Options bond1 miimon = 100 mode = 1
The main purpose is to enable the system to load the bonding module at startup. The external virtual network interface devices are bond0 and bond1.
In addition, miimon is used for link monitoring. For example: miimon = 100, The system monitors the link connection status every Ms. If one line fails, it is transferred to another line. The value of mode indicates the working mode, which has a total, two or three modes, commonly used: 0, 1.
Mode = 0 indicates that the load balancing (round-robin) method is load balancing, and both NICs work.
Mode = 1 indicates that fault-tolerance (active-backup) provides redundancy, working in the active/standby mode. That is to say, by default, only one network card works and the other is used for backup. note: bonding can only provide link monitoring, that is, whether the link from the host to the switch is connected. If the external link of the switch is down and the switch is not faulty, bonding considers that the link is correct and continues to be used.
There is no problem with this part of configuration.
We haven't seen the problem yet, but there is another area that is easily overlooked, that is, rc. local file. To enable Nic binding to take effect immediately after each startup, rc is usually set. local. so we should also check this file.
4. Check the rc. local file
# Cat/etc/rc. d/rc. local
Touch/var/lock/subsys/local
Ifenslave bond0 eth0 eth1
Ifenslave bond1 eth2 eth3
Analysis: This setting makes it easy to automatically load the configuration during startup.
Note: Put eth0 and eth1 in bond0, And put eth2 and eth3 in bond1. If you think about it, you will find that in step 2, eth0 and eth2 are bound to BOND0, and eth1 and eth3 are bound to BOND1. It seems that the cause of the problem is here. So what will happen if this configuration is wrong?
First, review the network card binding principles. We know that under normal circumstances, the ethernet Adapter only receives the ether frame of the target mac address as its own mac, and filters out all other data frames to reduce the burden on the driver, that is, the software. However, the ethernet NIC also supports another promisc mode that can receive all frames on the network. Many system programs such as sniffer and tcpdump run in this mode. Bonding Nic binding also runs in this mode, and modifies the mac address in the driver, changing the mac address of the two NICs to the same, can receive data frames of a specific mac. Then, the data frame is sent to the bond driver for processing.
Then we check the rc. in the local file, due to the carelessness of the system engineer, the NIC binding configuration is incorrect, so a slight configuration error will cause an IP address to correspond to two different MAC addresses, it will obviously cause network latency and instability, which is similar to ARP attacks. When there are multiple MAC addresses that correspond to the same IP address, the ARP of each machine in the network, including the IP address corresponding to the router, will change constantly. If the packet is lost, it will be sent to the wrong MAC address.
We can check the MAC of each Nic to confirm it.
Eth0 Link encap: Ethernet HWaddr D4: AE: 52: 7F: D1: 74
Up broadcast running slave multicast mtu: 1500 Metric: 1
RX packets: 358839038 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 445740732 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 FIG: 1000
RX bytes: 84060158481 (78.2 GiB) TX bytes: 324117093205 (301.8 GiB)
Interrupt: 178 Memory: c6000000-c6012800eth1 Link encap: Ethernet HWaddr D4: AE: 52: 7F: D1: 76
Up broadcast running slave multicast mtu: 1500 Metric: 1
RX packets: 1319022534 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 827575644 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 FIG: 1000
RX bytes: 402801656790 (375.1 GiB) TX bytes: 249765452577 (232.6 GiB)
Interrupt: 186 Memory: c8000000-c8012800eth2 Link encap: Ethernet HWaddr D4: AE: 52: 7F: D1: 74
Up broadcast running slave multicast mtu: 1500 Metric: 1
RX packets: 368142910 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 445816695 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 FIG: 1000
RX bytes: 88487806059 (82.4 GiB) TX bytes: 324236716714 (301.9 GiB)
Interrupt: 194 Memory: ca000000-ca012800eth3 Link encap: Ethernet HWaddr D4: AE: 52: 7F: D1: 76
Up broadcast running slave multicast mtu: 1500 Metric: 1
RX packets: 1311065414 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 827581593 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 FIG: 1000
RX bytes: 400383501186 (372.8 GiB) TX bytes: 249850192137 (232.6 GiB)
Interrupt: 202 Memory: cc000000-cc012800
We can see that eth0 and eth2 have the same MAC, and eth1 and eth3 have the same MAC.
For the cause of the problem, immediately modify the rc. local file and change it back to the correct configuration.
Ifenslave bond0 eth0 eth2
Ifenslave bond1 eth1 eth3, restart the server, and perform a stress test. It turns out everything is normal.
Summary: binding a Linux dual-nic is a specific operation. In configuration, we should not only be familiar with its principles, but also be careful when deploying and implementing it, an oversight will result in network instability and node paralysis.
This article from the "Drop water stone" blog, please be sure to keep this source http://xjsunjie.blog.51cto.com/999372/886294