The influence of Linux3.5 kernel routing subsystem on redirect route and neighbour Subsystem

Source: Internet
Author: User

A few years ago, I remember writing a few articles about Linux removal support for the path cache, the routing cache class is derived from the reconfiguration of the route subsystem, the specific reason is no longer repeated, this article will introduce the impact of this refactoring on redirect Routing and neighbour Subsystem.

In fact, it was not until the last 3 months that I realized that these effects were so great that the details of the work were not detailed, but that it was only a summary of some of the implementation knowledge of the open source Linux kernel stack for future reference and would be honored if anyone benefited.

Route items Rtable,dst_entry and neighbour

IP protocol stack, IP send consists of two parts:

Search for IP routes

To successfully send a packet, it must have a response route, which is done by the IP protocol specification of the routing lookup logic, routing lookup details is not the main point of this article, for the Linux system, the final result is a Rtable structure object, represents a route item, The first field embedded in it is a dst_entry struct, so the two can be cast to each other, and the important field is: Rt_gateway
Rt_gateway just wants to send the packet to the destination, the next hop IP address, which is the core of the IP hop forward. This concludes the IP routing lookup.

Analysis of IP neighbour

In the IP routing lookup phase already knew Rt_gateway, then will go to two layer implementation, this is the IP neighbour parsing work, we know Rt_gateway is neighbour, now need to parse it into a hardware address. The so-called neighbour is logically with the local direct connection of all the network card devices, "logically direct" means that for the Ethernet, all the devices on the entire Ethernet can be native neighbors, the key to see who is chosen to send the next hop of the current package, and for the Pointopoint device, Then its neighbor has only one, that is, the peer device, the only means that do not need to resolve the hardware address! It is worth noting that ignoring this distinction will result in a huge loss of performance, which I will explain at the end of this article.

Statement:

In order to describe the convenience, the following will no longer lift rtable, the route lookup results are used dst_entry instead! The following code is not actually the code of the Linux protocol stack, but to express the convenience of the abstract pseudo-code, so dst_entry is not the kernel of the dst_entry structure, but only represents a route item! The reason for this is that dst_entry represents an unrelated part of the protocol, and the content of this article is not related to a specific protocol, so the Protocol-related rtable struct is no longer used in pseudo-code to represent a route entry.


Refactoring of the Linux kernel to the routing subsystem

Prior to the Linux kernel version 3.5, the routing subsystem had a routing cache hash table, which caches some of the most frequently used Dst_entry (IPV4 or rtable) routing entries, and first looks for the routing cache with its IP address tuple information for the packet. If the hit can be directly removed dst_entry, otherwise go to find the System routing table.
In the 3.5 kernel, the route cache is missing, the specific reason is not the focus of this article, there are other articles described, the removal of the route cache caused the side effect on the neighbour subsystem, this side effect proved to be beneficial, the following large space is spent in this aspect, in detail the reconstruction of the neighbour subsystem Before the impact, another simple change is that the implementation of the redirect routing changes.
The so-called redirect route is definitely the redirect of the existing route entries in the native, but in the early kernel, the redirect routes are saved in different locations such as Inet_peer, which means that the routing subsystem is coupled to the rest of the stack. In the early kernel, in fact, regardless of where the redirect route entry exists, eventually it will have to enter the routing cache to function, but after the route cache is completely gone, the location of the redirect route save is exposed, in order to " Resolve redirect routing problems within the routing subsystem ", the reconstructed kernel saves a exception hash table for each route entry in the routing table, and a route entry fib_info resembles the following:

fib_info {Address nexhop; Hash_list exception;};

The table entry for this exception table looks similar to the following:

exception_entry {Match_info info; Address New_nexthop;};

In this case, when the Reidrect route is received, a exception_entry record is initialized and inserted into the corresponding exception hash table, in the query route, for example, finally found a fib_info, in the construction of the final dst_ Before entry, to find the exception hash table with match_info such as source IP information, if a matching exception_entry is found, fib_info build nexhop in Dst_entry is no longer used. Instead, use the new_nexthop in the found Exception_entry to build the dst_entry.
After a brief introduction to redirect routing, the following space will be used to describe the relationship between routing and neighbour.

The side effects of refactoring on the neighbour subsystem

The following is an excerpt from the web about the impact on neighbour after the route cache has been removed:
Neighbours
>hold link-level nexthop information (for ARP, etc)
>routing Cache pre-computed Neighbours
>remember:one "Route" can refer to several nexthops
>need to disconnect neighbours from route entries.
>solution:
Make neighbour lookups cheaper (faster hash, etc.)
Compute neighbours at packet send time ...
.. Instead of using precomputed reference via route
>most of work involved removing dependenies in old setup

In fact, the two should not be related, the routing subsystem and the neighbour subsystem are two sub-systems at different levels, the reasonable way is through the nexthop value of the route item to link, through a unique neighbour Lookup Interface Association can:

dst_entry = routing Table lookup (or route cache lookup, via SKB destination as key value) Nexthop = Dst_entry.nexthopneigh = Neighbour Table lookup (via Nexthop as key value)

However, the implementation of the Linux protocol stack is far more complex than this, and all this has to start with the 3.5 kernel refactoring.

Before refactoring

Before refactoring, because of the existence of the route cache, usually in the cache can find Dst_entry SKB, then no longer find the routing table, the assumption that the route cache exists, for the vast majority of SKB, do not need to find the routing table, Ideally, Can be hit in the routing cache. For neighbour, the obvious approach is to bind neighbour and Dst_entry, find dst_entry in the cache, and find neighbour together. That is, the route cache not only caches dst_entry, but also caches neighbour.
In fact, before the 3.5 kernel, there is a field in the dst_entry struct that is neighbour, which represents the Neighour bound to the route item, and the route entry is found from the routing cache. The output callback function can be called directly by removing neighbour directly.
We can deduce the binding period between dst_entry and neighbour, that is, after finding the routing table, that is, when the route cache misses, and then after the routing table is complete, insert the result into the route cache before inserting the results into a neighbour binding logic.
As with the routing cache, the neighbour subsystem maintains a neighbour table and performs a replacement, update, and expiration state operation, which has a huge coupling between the neighbour table and the routing cache table, before describing these couplings, Let's look at the whole logic first:

Func ip_output (SKB):         dst_entry = lookup_from_ Cache (skb.destination);         if dst_entry == null         then                 dst_entry = lookup_fib (skb.destination);                 nexthop = dst_ entry.gateway?:skb.destination;                 neigh = lookup (Neighbour_table, nexthop);                 if neigh == NULL                 then                          neigh = create ( Neighbour_table, nexthop);                         neighbour_add_timer (neigh);                 end                 dst_entry.neighbour =  Neigh;                insert_ Into_route_cache (dst_entry);        end         neigh = dst_entry.neighbour;         Neigh.output (NEIGH, SKB); Endfunc---->to layer2

Try the following questions:
if a neighbour timer executes, a neighbour expires, Can I delete it?
If a route cache expires when the route cache timer executes, can it be deleted?

The relationship between the routing subsystem and the neighbour subsystem is sufficient to understand if you can answer both of these questions accurately. Let's look at the first question first.
If you delete neighbour, because the route cache entry that is bound to the neighbour may still be present, you will not be able to remove and use neighbour if the subsequent SKB matches the route cache entry because Dst_ The binding of entry and neighbour only occurs when the route cache misses and cannot be re-bound at this time, in fact, because the route item and neighbour are a many-to-one relationship, the route cache entry in neighbour cannot be reversed by dst_ A entry.neighbour reference to a deleted neighbour is a wild pointer that raises the OOPS final kernel panic. Therefore, the obvious answer is that even if the neighbour expires, it cannot be deleted and can only be marked as invalid, which can be done by reference counting. Now look at the second question. The
route cache expires and can be deleted, but remember to decrement the reference count of the neighbour bound to the route cache entry, if it is 0, remove neighbour, This neighbour is the first question of the type of neighbour that cannot be deleted when neighbour expires. From this we can see that the coupling relationship between the routing cache and the neighbour causes the expired delete operation of the neighbour bound with a dst_entry to originate only from the route cache, unless a neighbour does not have the same dst_ Entry binding. The sending logic for the overall modification is as follows:

Func ip_output (SKB):         dst_entry = lookup_from_ Cache (skb.destination);         if dst_entry == null         then                 dst_entry = lookup_fib (skb.destination);                 nexthop = dst_ entry.gateway?:skb.destination;                 neigh = lookup (Neighbour_table, nexthop);                 if neigh == NULL                 then                          neigh = create ( Neighbour_table, nexthop);                         neighbour_add_timer (neigh);                 end                 inc (neigh.refcnt);                 dst_entry.neighbour =  neigh;                 Insert_into_route_cache (dst_entry);        end         neigh = dst_entry.neighbour;         #  if it is a neigh of the invalid state, you need to handle         neigh.output (NEIGH, SKB) in the output callback;endfunc    func neighbour_add_timer (neigh):         inc ( NEIGH.REFCNT);        neigh.timer.func = neighbour_timeout;         timer_start (Neigh.timer); Endfuncfunc neighbour_timeout ( Neigh):         cnt = dec (neigh.refcnt);         if cnt == 0        then                 free_neigh ( Neigh);        else                 neigh.status = INVALID;         endendfuncfunc&nbsP;dst_entry_timeout (dst_entry):         neigh = dst_ Entry.neighbour;        cnt = dec (neigh.refcnt);         if cnt == 0         Then                free_neigh (neigh);        end         FREE_DST (dst_entry); endfunc

Let's finally see what the problem is.
If the GC parameters of the neighbour table and the GC parameters of the routing cache table are not synchronized, for example, neighbour expires too fast, and the route cache entry expires slowly, there will be a lot of neighbour that cannot be deleted, causing the neighbour table to be full, so in this case, The need to forcibly recycle the route cache, which is a coupling of the neighbour subsystem to the routing subsystem, is simply too messy:

Func Create (Neighbour_table, nexthop): Retry:neigh = Alloc_neigh (nexthop);                if neigh = = NULL or Neighbour_table.num > MAX then Shrink_route_cache ();        Retry Endendfunc


With regard to the relationship between the GC timer for routing cache and the neighbour subsystem, there is a well-written article on the route cache, Tuning Linux IPv4 route cache, as described below:
You could find documentation about those obsolete SYSCTL values:
net.ipv4.route.secret_interval have been removed in Linux 2.6.35; It is used to trigger a asynchronous flush at a fixed interval to avoid to fill the cache.
net.ipv4.route.gc_interval have been removed in Linux 2.6.38. It is still present until Linux 3.2 and has no effect. It is used to trigger an asynchronous cleanup of the route cache. The garbage collector is now considered efficient enough for the job.
UPDATED: net.ipv4.route.gc_interval is back for Linux 3.2. It is still needed to avoid exhausting the neighbour cache because it allows to cleanup the cache periodically and not onl Y above a given threshold. Keep it to its default value of.


This has changed after the 3.5 kernel!!

After refactoring

After refactoring, 3.5 and subsequent kernels removed the support for the routing cache, which means that each packet is queried for the route table (which is not considered in the case of the socket cache dst_entry). The absence of a route cache means that there is no need to deal with the expiration and substitution of the cache, the entire routing subsystem becomes a completely stateless system, so dst_entry no longer has to be bound to neighbour, since it is not expensive to re-locate the routing table every time. The cost of finding a much less neighbour table is negligible (although the lookup table overhead is unavoidable), so dst_entry removes the neighbour field, and the IP send logic is as follows:

Func ip_output (SKB): Dst_entry = Lookup_fib (skb.destination);        Nexthop = dst_entry.gateway?:skb.destination;        Neigh = Lookup (neighbour_table, nexthop);        if neigh = = NULL then neigh = Create (neighbour_table, nexthop); End Neigh.output (SKB); endfunc

The route item is no longer associated with the neighbour, so the neighbour table can perform an expiration operation independently, and the neighbour table disappears because the GC of the route cache is too slow to cause frequent full.
Not only that, the code also looks a lot fresher.

One detail: neighbour on pointopoint and loopback devices

There are a lot of information about the Linux neighbour subsystem, but almost no exception is in the ARP, a variety of complex ARP protocol operations, queue operations, state machines, etc., but almost no information about the neighbour beyond the ARP, So this article is going to add an example of this in the final section. Or start with the question:
A noarp device, such as SKB from a pointopoint device, who is neighbour?
in the case of broadcast Ethernet, to send the packet to the far end, you need to resolve the "next hop" address, that is, each outgoing packet is sent through a gateway, the gateway is abstracted to a network segment of the IP address, it is necessary to use the ARP protocol to determine the hardware address. However, for pointopoint devices, there is only one fixed connection to the device, it does not have a broadcast or multicast two layer, so there is no concept of gateway, or in other words, its next hop is the destination IP address itself.
According to the Ip_output function above, before looking for the neighbour table, the key value used is nexthop, for the pointopoint device, Nexthop is the destination address of the SKB itself, if not found will be created as a key value, So imagine using the Pointopint device to send the SKB destination address space is very large, there will be a huge amount of neighbour at the same time to be created, these neighbour will be inserted into the neighbour table, which is bound to encounter the problem of lock, in fact, Their insert operation will all spin the write lock on the neighbour table read and write locks!! The logic for
Neigh_create is as follows:

struct neighbour *neigh_create (struct neigh_table *tbl, const void *pkey, struct net_device *dev) {St  Ruct neighbour *n1, *rc, *n = Neigh_alloc (TBL);  ... write_lock_bh (&tbl->lock);    Insert Hash Table WRITE_UNLOCK_BH (&tbl->lock); .......}

This is a completely SKB bottleneck when the massive target IP is sent through the Pointopoint device! But the kernel is not so silly. It is circumvented by the following methods:

__be32 nexthop = ((struct rtable *) DST)->rt_gateway?:ip_hdr (SKB)->daddr;if (dev->flags& (iff_loopback| Iff_pointopoint)) Nexthop = 0;

This means that as long as the Pointopint device is sent the same, and the pseudo two layer (such as Ipgre) information is the same, all SKB will use the same neighbour regardless of their destination address. In the case of Ipip tunnel, since this device does not have any two layers of information, this means that all SKB through the IPIP tunnel device will use a single neighbour, even if it is sent using a different IPIP tunnel device.
But after the 3.5 kernel refactoring, it's tragic!
We look directly at the 4.4 kernel!

Static inline __be32 rt_nexthop (CONST STRUCT RTABLE *RT, __BE32 DADDR) {    if  (Rt->rt_gateway)         return  RT->RT_GATEWAY;    RETURN DADDR;} Static int ip_finish_output2 (struct net *net, struct sock *sk, struct  SK_BUFF *SKB) {......    nexthop =  (__force u32)  rt_nexthop ( RT, IP_HDR (SKB)->daddr);     neigh = __ipv4_neigh_lookup_noref (dev,  Nexthop);    if  (Unlikely (!neigh))          Neigh = __neigh_create (&arp_tbl, &nexthop, dev, false);     if  (!is_err (neigh))  {        int res =  Dst_neigh_output (DST, NEIGH, SKB);         return res;    } 

The

can see,dev->flags& (iff_loopback| Iff_pointopoint) This judgment has disappeared! This means that the kernel is getting silly. The phenomenon analyzed in the previous paragraph will occur in the kernel after 3.5, and in fact it will happen.
After encountering this problem, without looking at the kernel implementations before 3.5, my idea is to initialize a global dummy neighbour, which simply uses dev_queue_xmit for direct out:

static const struct neigh_ops dummy_direct_ops = {    . family =        af_inet,    .output =         neigh_direct_output,    .connected_output  =    neigh_direct_output,};struct neighbour dummy_neigh;void dummy_ Neigh_init () {    memset (&dummy_neigh, 0, sizeof (dummy_neigh));     dummy_neigh.nud_state = NUD_NOARP;    dummy_neigh.ops =  &dummy_direct_ops;    dummy_neigh.output = neigh_direct_output;     dummy_neigh.hh.hh_len = 0;} Static inline int ip_finish_output2 (STRUCT SK_BUFF *SKB)  {......      nexthop =  (__FORCE U32)  rt_nexthop (RT, IP_HDR (SKB)->daddr);    if  (Dev->type == arphrd_tunnel)  {         neigh = &dummy_neigh;    } else  {        neigh = __ipv4_neigh_lookup_noref (dev,  Nexthop);    }     if  (Unlikely (!neigh))           neigh = __neigh_create (&arp_tbl, &nexthop, dev  ,  false); ...  }

After looking at the implementation of the 3.5 kernel, we found:

if (dev->flags& (iff_loopback| Iff_pointopoint)) Nexthop = 0;

So decided to use this, the code is less and more elegant! It then produces the following patch:

DIFF --GIT A/NET/IPV4/IP_OUTPUT.C B/NET/IPV4/IP_OUTPUT.C--- a/net/ipv4/ip_output.c+++  b/net/ipv4/ip_output.c@@ -202,6 +202,8 @@ static int ip_finish_output2 ( struct net *net, struct sock *sk, struct sk_buff *s         RCU_READ_LOCK_BH ();        nexthop  =  (__FORCE U32)  rt_nexthop (RT, IP_HDR (SKB)->daddr);+        if  (dev->flags &  (iff_loopback | iff_pointopoint)) +                nexthop = 0;         neigh = __ipv4_neigh_lookup_noref (Dev, nexthop);         if  (Unlikely (!neigh))                  neigh = __neigh_create (&arp_tbl, &nexthop, dev, false );




The influence of Linux3.5 kernel routing subsystem on redirect route and neighbour Subsystem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.