Nic aggregation at the link layer-Based on Linux Bonding

Source: Internet
Author: User
Tags case statement

Linux can always implement a complex function in the simplest way, especially in terms of Network
Even if this function is considered to be available only on high-end devices, Linux can easily implement it. The previous articles have already said many times, such as VLAN functions, for example, advanced routing and firewall functions. This article focuses on Linux bonding, which is the functional module of port aggregation. It is undeniable that at the network device level, Linux has developed two very successful Virtual Devices: a tap nic and bonding described in this article, for more information about the tap Nic, see the previous article about openvpn.
If there is a problem, there is some good information about Linux bonding. The answer is the Linux kernel documentation. This document is in $ kernel-root/documentation/networking/bonding.txt, I don't think there is any more authoritative information.
I. Introduction to bonding



Bonding is a Linux kernel driver. After it is loaded, Linux supports bundling multiple physical NICs into a virtual bond Nic. With the upgrade of the version, the bond driver has more and more configurable parameters, and the configuration itself is more and more convenient.
We use the port aggregation function of the physical Nic in many places. For example, we want to increase the network speed. For example, we want to provide hot backup. For example, we want to configure our host as a bridge, it also supports 802.3ad dynamic port aggregation protocol, and so on. However, the most important thing is two points: load balancing and hot backup.
II. Introduction to drivers and changes



The initial version of the Linux bonding driver only provides the basic mechanism and requires you to specify the configuration parameters when loading the module. To change the configuration parameters, you must re-load the bonding module; modprobe then supports a RENAME mechanism, that is, it can be used in modprobe.-O: rename this module
In this way, a module can be loaded multiple times with different configuration parameters. At first, for example, I had four Network Ports and wanted to configure two ports as Server Load balancer and two ports as hot standby, this can only be solved by manually compiling bonding into different names. After modprobe has the-O option, it can load the same driver twice. For example, you can use:
Modprobe bonding-O bond0 mode = 0
Modprobe bonding-O bond1 mode = 1


After loading the bonding driver twice, use lsmod to check that the result is bond0 and bond1, and there is no bonding. This is because the modprobe is named during loading. However, in the end, this naming mechanism is no longer supported, as described in the man manual of modprobe, the-O rename mechanism is mainly applicable to test. Finally, bonding supports the sysfs configuration mechanism to read or write files in the/sys/class/NET/directory to complete the driver configuration.
In any case, before sysfs fully supports bonding configuration, if you want to add or delete a device to or from a bonding Nic, you still need to use the classic and traditional IOCTL call, therefore, a user-State program must correspond to it. The program is ifenslave.
I think, if all the device configurations in Linux are consistent with sysfs, all kernel and process configurations are consistent with procfs (the kernel is the address space shared by all processes, also has its own kernel thread and process 0, so the Kernel configuration should be in procfs), for all messages, use Netlink communication, this is too good, getting rid of the imperative IOCTL configuration, the configuration of file-type (system calls such as sendto used by Netlink can also be classified as file system calls) will be more efficient, simple, and fun!
Iii. bonding configuration parameters



In the kernel documentation, many bonding driver parameters are listed. This document is not translated, so documents are no longer translated and parameters irrelevant to topics are introduced, this article only introduces important parameters, and these introductions are not translations, but some suggestions or experiences.
Ad_select:

802.3ad. If you don't understand this, it doesn't matter. simply look at the 802.3ad specifications without the Linux bonding driver. Listing this option indicates that the Linux bonding driver fully supports the dynamic port aggregation protocol.
Arp_interval and arp_ip_target:

Sends ARP packets to some fixed addresses at a fixed interval to monitor the link. In some configurations, ARP is required to monitor the link, because this isLayer-3 link monitoring
, Only NIC status or link layer PDU monitoring can be monitoredInterfaces at both ends of twisted pair wires
But cannot monitor the health status of all links between the next vro or target host.
Primary:

Priorities are arranged sequentially. When a selection event occurs, select the network port in the order from the front to the back, such as the selection behavior in the 802.3ad protocol.
Fail_over_mac:

If the same MAC address is used in hot standby mode, if one MAC address is not used, the free ARP mechanism is required to update the ARP cache of other machines. For example, if two NICs exist, Nic 1 and nic 2 are in hot standby mode, Nic 1's Mac is mac1, Nic 2's Mac is mac2, and nic 1 is always master, however, when Nic 1 suddenly goes down, Nic 2 is required to replace it. However, the MAC address of NIC 2 is different from that of NIC 1, when other hosts reply data packets, they still use the MAC address of Network Card 1. Because mac1 is no longer on the network, the data packets will not be received by any network card. Therefore, after ENI 2 takes over the master role, it is best to have a callback event. When processing this event, it will have a free ARP broadcast and broadcast itself to change the MAC address.
Lacp_rate:

Send the lacpdu of 802.3ad so that the Peer device can automatically obtain link aggregation information.
Max_bonds:

The number of bond device interfaces created initially. The default value is 1. However, this parameter does not affect the maximum number of bond devices that can be created.
Use_carrier:

Use IOCTL of MII or the driver to obtain the maintained State. If it is the former, you need to call the MII interface for hardware detection, while the latter is the driver's automatic hardware detection (using watchdog or timer), the bonding driver only obtains the result, however, this depends on the NIC Driver must support status detection, if not, the NIC status will always be on.
Mode:

This parameter is the most important parameter. In what mode does the configuration run? This parameter cannot be changed when the bond device is up. You must first down the device (use ifconfig bondx down) to configure it, there are mainly the following:
1. Balance-RR or 0:

Server Load balancer in rotation mode distributes traffic among real bondx devices in turn. Note: you must use the status detection mechanism. Otherwise, if a device is down, the device will remain in the up state and still accept the sending task, resulting in packet loss.
2. Active-backup or 1:

Hot standby mode. In higher versions, free ARP is automatically sent during switchover to avoid some faults, such as the fault described by the fail_over_mac parameter.
3. Balance-XOR or 2:

I don't know why the Bonding Parameter xmit_hash_policy should be set to a separate mode. In this mode, the traffic is also distributed, which is different from the rotation load, it uses the source/Target MAC address as the independent variable and uses the XOR | mod function to calculate the port to which the packet is distributed.
4. broadcast or 3:

Broadcast Data to all ports. This mode is XX, but the fault tolerance is strong.
5.802.3ad or 4:

This is simply an 802.3ad method.
...
Xmit_hash_policy:

The importance of this parameter is second only to the mode parameter. The mode parameter definesDistribution Mode
And this parameter definesDistribution Policy
In this document, this parameter is used for mode2 and mode4. I think we can define more complex policies.
1. Layer2:

The layer-2 frame header is used as the parameter for calculating the distribution egress. As a result, data streams from the same gateway are completely sent from one port. in order to refine the distribution policy, some layer-3 information must be used, however, it increases computing overhead,God, you have to weigh everything!

2. Layer2 + 3:

Three Layers of IP header information are added on the basis of 1, and the calculation workload is increased, but the load is more balanced, data streams from one host to the host are formed and the same stream is distributed to the same port. According to this idea, to make the load more balanced, we can get Layer 4 information while increasing the cost.
3. layer3 + 4:

Do I need to talk more about this? A port-to-port stream can be formed, and the load is more balanced.However, it's slow!
The process is not over yet. Although we do not want to parallelize the transmission of the same TCP stream to avoid re-order or re-transmit, because TCP itself is a serial protocol, for example, Intel's 8257x series Nic chips are trying to minimize the distribution of a TCP stream package to different CPUs. Similarly, in a port aggregation environment, the same TCP stream should also be sent using the same port using this policy, but do not forget that the TCP must pass through the IP address, and the IP address may need to be segmented, segment-separated IP datagram cannot be classified as a TCP stream until it is reorganized (to the peer or to a device using NAT. The IP address is a completely connectionless protocol. It only cares about the segmentation based on the Local MTU, which leads to many times we will not be completely satisfied with the layer3 + 4 strategy. However, the problem is not so serious, because the IP is only segmented according to the local MTU, and TCP is end-to-end, it can use mechanisms such as MSS and MTU detection in combination with the sliding window mechanism to minimize IP segment, so the layer3 + 4 policy is very OK!
Miimon and ARP:

Miimon can only detect the status of the link layer, that is, the end-to-end connection of the link layer (that is, a port of the switch and the ingress network access card that is directly connected to it ), however, if the uplink port of the switch is down, it cannot be detected. Therefore, it is necessary to check the status of the network layer. The simplest and most direct method is arp. You can directly use the ARP gateway, if the gateway does not reply to ARP reply when the timer expires, the link will be disconnected.
4. How can I configure it?



1. First of all, the traditional method is definitely inappropriate. The kernel documentation is written. For your reference, remember to install an ifenslave
2. The latest sysfs configuration method
First, check that your system has the Sys directory, and the file system mounted to it is of the sysfs type. Then the following steps are taken:
Step 1: load the module

Root @ zyxx: modprobe Bonding
Step 1: Enter the corresponding directory

Root @ zyxx: CD/sys/class/NET/

Step 2: Check the file and familiarize yourself with the terrain (this step can be omitted)

Root @ zyxx:/sys/class/net # ls
Bond0 bonding_masters eth0 eth1 eth2 eth3 eth4 eth5 Lo

Step 3: check which bond devices are currently available

Root @ zyxx:/sys/class/net # Cat bonding_masters
Bond0

Step 4: add or delete an Ethernet card device from a bond Device

Root @ zyxx:/sys/class/net # echo + (-) x> bonding_masters

# Note: "+" in the previous command indicates adding a device, and "-" indicates deleting a device. "X" in "+ X" indicates any name you like, -X indicates an existing name in bonding_masters.
Step 5: Go to the newly created bondmy and configure it as you like.

Root @ zyxx:/sys/class/NET/bondmy/bonding # ls
Active_slave ad_num_ports ad_select arp_validate lacp_rate mode primary use_carrier
Ad_actor_key ad_partner_key arp_interval downdelay miimon num_grat_arp slaves xmit_hash_policy
Ad_aggregator ad_partner_mac arp_ip_target fail_over_mac mii_status num_unsol_na updelay

1. Add eth2 to bondmy
Root @ zyxx:/sys/class/NET/bondmy/bonding # echo + eth2> slaves

2. Set link monitoring interval
Root @ zyxx:/sys/class/NET/bondmy/bonding # echo 100> miimon

3. Set the mode to hot standby.
Root @ zyxx:/sys/class/NET/bondmy/bonding # Echo 1> mode

...
Step 7: emotion

The configuration process is very simple. The module only needs to be loaded once, and dynamic configuration will be okay in the future.
5. Implementation of bonding driver



After reading the wonderful configuration and actually configuring a very useful network, I will be eager to look at the implementation of the source code, which is why I like Linux, because it allows you to patch at will. In fact, the bonding driver is very simple. Like tap, there are basically three parts:
Part 1: initialization


This part is very simple, that is, initializing a net_device and registering it.
Part 2: implement the user configuration interface


There are two types of interfaces. The first is the traditional IOCTL-based configuration, which is to implement an ioctl, and the other is implemented through sysfs, which is also very simple, you can implement some attitude store/show methods. No matter which method you use, you must call a function, that is, netdev_set_master. The most important thing in this function is one, set the master of the physical Nic to the bond device we initialized in the first part:
Slave-> master = master;
Part 3: Initialize the transmission and receipt Interfaces


For the transmission interface, it is very simple, similar to multi-port bridges. The bond device initializes start_xmit to bond_start_xmit during initialization. There is a switch-case in bond_start_xmit. What is the switch? Of course it is the bonding mode. For example, in mode0, that is, in the rotation load mode, bond_start_xmit calls the following code segment:
Switch-case: bond_xmit_roundrobin...
Bond_for_each_slave_from (bond, slave, I, start_at ){
If (is_up (slave-> Dev) & (slave-> link = bond_link_up) & (slave-> state = bond_state_active )){
Res = bond_dev_queue_xmit (bond, SKB, slave-> Dev );
Break;
}
}
For the receiving interface, all devices distribute data packets starting from netif_receive_skb. The following code is provided at the beginning of the function:
If (orig_dev-> master ){
If (skb_bond_should_drop (SKB ))
Null_or_orig = orig_dev;
Else
SKB-> Dev = orig_dev-> master;
}
Orig_dev is a physical network card. If it has a master, that is, after the physical network card is bound to a bond device using the user interface, the master field of the physical NIC will be set. After this code, Dev of SKB will be set to master, that is, bond device. Next, the above-layer protocol is transmitted by the bond device, because all the layer-3 information is completely on the bond device, not on the physical Nic orig_dev.
6. Advanced themes



It is undeniable that the analysis in this article is extremely simple. If you want to get a more advanced topic and some in-depth topics, or want to know some knowledge about vswitches and performance issues, for more information about bonding in Linux, see $ kernel-root/documentation/networking/bonding.txt. We strongly recommend that you read this document. After reading this document, you are a bonding expert.
Appendix: Configuration Methods


Linux's sysfs and proc (sysctl and so on) mechanisms can be used to configure system parameters and devices through file I/O. When I first learned how to configure Cisco/Huawei devices, I spent tens of thousands of RMB in RMB, the final result is that only "?" After learning, I found that netsh for Windows is also the configuration method. It is even easier than configuring Cisco/Huawei devices, but it is all imperative, the so-called imperative is similar to English. The subject is the current user of the system, and the predicate is the command, the object is the parameter of the fixed complement command of the target device or system. This type of command is very convenient for people who are used to natural language, but the memory overhead is not small. In Linux, similar configurations can be implemented through file read/write. I don't know if it is good or bad. Anyway, when I first used sysfs to configure bonding, the first thing I felt was "great ", you no longer need to use man ifenslave and VIM documents. kobject organizes bonding-related content in the $ sysfstoor/class/NET/directory, as long as you access the file, you can configure it, and the new bonding version almost discards the original ifenslave IOCTL method. Any bonding configuration can be performed using sysfs.
The most important thing is that the file or Netlink configuration can reduce the expansion of the VFS layer of the device compared with the ioctl-type classical imperative configuration. Let's look at the ioctl-level code, how messy it is. Every time you add a new command, you need to modify or even several layers of IOCTL code. In fact, you need to add a case statement to distribute the newly added commands, in this way, the ioctl code is layered but cannot solve the problem of over-coupling between the device driver and the interface.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.