Ingress traffic control (entry traffic control) for using virtual NICs in Linux)

Source: Internet
Author: User
The Linux Kernel implements a data packet queue mechanism, which works with a variety of queuing policies to achieve perfect traffic control and traffic shaping (hereinafter referred to as traffic control ). Traffic control can be implemented in two places: egress and ingress. egress is the action trigger point before data packets are sent, while the ingres Linux Kernel implements the packet queue mechanism, with a variety of queuing policies, you can achieve perfect traffic control and traffic shaping (hereinafter referred to as traffic control ). Traffic control can be implemented in two places: egress and ingress. egress is the action trigger point before the data packet is sent, while ingress is the action trigger point after the data packet is received. The traffic control of Linux is asymmetrical in these two locations, that is, Linux does not implement the queue mechanism in the position of ingress. Therefore, you can hardly implement throttling on the ingress.
Although iptables can also be used to simulate throttling, if you want to implement throttling with a real queue, you really need to find a solution. Maybe, just like the core idea of e-mail, you can control sending perfectly, but have no control over receiving. if you absorb this idea, you can understand the difficulty of ingress queue traffic control. However, it is only possible.
Linux implements a simple flow control by using a non-queue mechanism at the position of ingress. Aside from the disadvantages of the non-queue mechanism over the queue mechanism, the position of ingress alone shows that we have almost zero control over it. Before the data enters the IP layer, ingress cannot mount any hooks on the IP layer, and the PREROUTING of Netfilter cannot be customized, even the IPMARK cannot be seen, not to mention the connection to the socket. Therefore, it is difficult for you to configure a queuing policy. you can only see the IP address and port information.
A realistic requirement for ingress traffic control is to control the upload of local service client data, such as uploading large files to the server. On the one hand, the CPU pressure can be released at the underlying layer, and data other than the CPU processing capability can be discarded in advance. on the other hand, the I/O of user-mode services can be smoother or smoother, depending on the policy.

Since there is a demand, we must try to meet the demand. What we know at present is that we can only perform throttling on egress, but cannot make the data really outgoing. In addition, we need to do a lot of policies, which are far from just IP addresses, protocols, the 5-tuples of the port can be given. An obvious solution is to use a virtual network card, as shown in the figure below:




The above schematic diagram is very simple, but there are several details for implementation. The most important thing is the routing details. we know that, even for a policy route, you must start searching from the local table unconditionally. if the target address is a local address, if you want data to follow the preceding process, you must delete the address from the local table. However, once deleted, the local machine will no longer respond to ARP requests for the address. Therefore, several solutions can be used:

1. use static ARP or ebtables to change ARP, or use arping to actively broadcast arp configurations;
2. use a non-local address, modify the xmit function of the virtual network card, and set the DNAT address internally, which bypasses the local table. Signature + signature/Signature/Cw + ajrM7Sw8e + zb/J0tSwtNXVyc/Signature/I0qrKtc/W0ru49tDpxOLN + L + signature/a1/Signature/ao6zK18/signature + L + signature" brush: java; "> dev = alloc_netdev (0," ingress_tc ", tc_setup );Then initialize its key fields
static const struct net_device_ops tc_ops = {        .ndo_init      = tc_dev_init,        .ndo_start_xmit= tc_xmit,};static void tc_setup(struct net_device *dev){        ether_setup(dev);        dev->mtu                = (16 * 1024) + 20 + 20 + 12;        dev->hard_header_len    = ETH_HLEN;     /* 14   */        dev->addr_len           = ETH_ALEN;     /* 6    */        dev->tx_queue_len       = 0;        dev->type               = ARPHRD_LOOPBACK;      /* 0x0001*/        dev->flags              = IFF_LOOPBACK;        dev->priv_flags        &= ~IFF_XMIT_DST_RELEASE;        dev->features           = NETIF_F_SG | NETIF_F_FRAGLIST                | NETIF_F_TSO                | NETIF_F_NO_CSUM                | NETIF_F_HIGHDMA                | NETIF_F_LLTX                | NETIF_F_NETNS_LOCAL;        dev->ethtool_ops        = &tc_ethtool_ops;        dev->netdev_ops         = &tc_ops;        dev->destructor         = tc_dev_free;}

Then construct its xmit function
Static netdev_tx_t tc_xmit (struct sk_buff * skb, struct net_device * dev) {skb_orphan (skb); // directly pass the second layer! Skb-> protocol = eth_type_trans (skb, dev); skb_reset_network_header (skb); signature (skb); skb-> mac_len = skb-> network_header-skb-> mac_header; // receive ip_local_deliver (skb) locally; return NETDEV_TX_ OK ;}
Next, we will consider how to import data packets to the virtual Nic. There are three options:
Solution 1:If you do not want to set arp-related items, you have to modify the kernel. Here I introduced a routing flag, RT_F_INGRESS_TC. all the routes with this flag are imported into the built virtual network card. for the sake of strategization, I did not write this in the code, instead, it changes the search order of the RT_F_INGRESS_TC route, first searches for the policy route table, and then searches for the local table. in this way, you can use the policy route to import data packets to the virtual network card.
Solution 2:Build a Netfilter HOOK and set the NF_QUEUE data to be throttled to the virtual Nic in its target, that is, set skb-> dev as the virtual Nic in the handler of the queue, and call dev_queue_xmit (skb) the new schematic diagram is simple. you only need to reinject the data packet in the hard_xmit of the virtual network card. (As a matter of fact, I learned later that IMQ was actually implemented in this way. Fortunately, I didn't do anything useless)
Solution 3:This is a quick test solution, that is, my original idea. I will delete the target IP address from the local table and then manually run arping. my test is based on this solution, and the results are good.
No matter how the above solution changes, it is an effect after all, that is, since the NIC's ingress cannot throttling, it should be done on the egress, rather than using the physical Nic, the virtual network card can be customized to meet any needs. We can see how powerful the virtual network card is, tun, lo, nvi, tc... all of this, all of which are subtle in their respective xmit functions.

Reconstruction of the Linux kernel protocol stack processing process

I personally think that there is another asymmetry in Linux network processing, that is, the forwarding function after routing. we know that Linux network processing has a branch after routing, depending on the destination, the processing logic is separated from each other. if the routing result carries the LOCAL flag, ip_local_deliver is called, and vice versa. call ip_forward (for details, see rth-> u in ip_route_input_slow. dst. input value ). As a result, the LOCAL data is directly sent to the LOCAL device, which is also recommended by RFC. it briefly describes the routing algorithm: Check whether the target address is LOCAL, if yes, local reception... however, in my opinion (although there are always some unknown prejudices), there is no need to part with each other. it is better to use a function to send data packets. for example, local data packets are also sent to a network card for processing, this is only a LOOPBACK Nic, so that the entire IP address receiving routine can be described in a unified manner. Similarly, local data can be sent by a Virtual LOOPBACK Nic instead of directly sent to the routing module. As shown in the following figure:





Although such symmetric processing seems to have affected the efficiency, logically it seems that after the data packet reaches the third layer, it returns to the second layer, and then the local LOOKBACK Nic on the second layer calls ip_local_deliver for local reception, but it is implemented in the code, that is, several function calls. you can set a pass-through route for the LOCAL path from ip_forward to dev_queue_xmit, in this way, there are four benefits:
1. the input hook point is no longer required;
2. the forward hook point is no longer required;
3. the output hook point is no longer needed. together with the 1st and 2nd points, Netfilter's overall architecture is completely out of the saddle shape. thus, the data sent from the local machine no longer needs reroute after DNAT. In fact, it is awkward to process routes in the NAT module...

4. policy routing can be more strategized, because even if the destination address is the route of the local table, it can be pushed directly to another policy table.
With the above 3rd points, we can easily implement N kinds of virtual network card devices and implement almost any functions in them. for example, this ingress traffic control can be easily implemented, you do not need to change the kernel or perform complicated configurations. you only need to write a virtual Nic and configure several route entries. The re-implementation of the protocol stack processing logic may violate the protocol stack layered design principles to some extent, but it can indeed bring a lot of benefits, of course, these benefits are at a cost. It is worth noting that the absence of three HOOK points is a very important issue. Although the OUTPUT is mounted after the route, it should actually be processed before the route, isn't INPUT followed by a route? Why do we need to differentiate POSTROUTING into INPUT and FORWARD? In addition, FORWARD is also a kind of routing... what's really reasonable is that the INPUT and FORWARD should be the subHOOK points mounted on POSTROUTING. they can be implemented into a HOOK Operation, which can be directly removed from the OUTPUT! The HOOK points are neither specific nor specific. this logic should be differentiated by HOOK Operation.

After the Linux IMQ patch implements its own virtual network card and configures available ingress traffic control, I read the IMQ implementation of the Linux kernel. In view of the demand for over-traffic control, I have never been very concerned about IMQ. I should first try to implement one or give my own solution (at least I should have an ideological experiment solution) based on everything) then implement the standard again (the so-called standard is not so reliable, but actually it is just another saying that "everyone accepts the implementation", and there is no real standard at all) the comparison principle, I have provided the IMQ idea above.
The core of IMQ patch is as follows:
1. name.I think Intermediate in IMQ is very good. it clearly states that an Intermediate layer is used to adapt to igress traffic control;
2. implement a virtual Nic device.The so-called Intermediate device;
3. Use of NF_QUEUE.Use the NF_QUEUE mechanism of Netfilter to directly import data packets that require throttling to virtual devices, rather than indirectly importing data to virtual devices through policy routing.
4. added the skb_buff data structure and introduced management fields related to IMQ.I personally think this is its deficiency, and I do not tend to modify the core code. However, in my own virtual device implementation, because the ip_local_deliver function is not exported by the core (EXPORT), I have to use/proc/kallsym to find its location, this is indeed not a standard, and I have to modify the core, although I just added a line of code:
EXPORT_SYMBOL (ip_local_deliver );
But I still feel uncomfortable!
The overall IMQ diagram is as follows:



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.