Linux3.5 Effect of kernel reconstruction on the routing subsystem on the Redirect route and the neighbor Subsystem

Source: Internet
Author: User

Linux3.5 Effect of kernel reconstruction on the routing subsystem on the Redirect route and the neighbor Subsystem
A few years ago, I remember writing several articles about Linux's removal of Route cache support. The route cache class comes from a reconstruction of the routing subsystem. The specific reason is that I will not repeat it again, this article describes the impact of this refactoring on the Redirect route and the neighbor subsystem.

In fact, it was not until the last three months that I discovered that these effects were so great that I could not elaborate on the details of my work, here is just a summary of some implementations of the open-source Linux kernel protocol stack for future reference. It would be a great honor to have anyone benefit from this.

In the rtable, dst_entry, and neighbourIP protocol stacks, IP sending consists of two parts:
To successfully send a data packet, you must have a response route. This is done by the routing search logic specified by the IP protocol, for Linux systems, the final search result is an rtable struct object that represents a route entry. The first field embedded in it is a dst_entry struct, so the two can be forcibly converted to each other, the important field is rt_gateway.
Rt_gateway only wants to send data packets to the destination, the IP address of the next hop, which is the core of IP jump-by-step transmission. So far, the IP Route query is complete.
The IP neighbor resolution has already known rt_gateway in the IP route lookup phase, so the next step is to implement it on the second layer. This is the work of IP neighbor resolution. We know that rt_gateway is neighbor, now you need to resolve it to a hardware address. The so-called neighbor is all Nic devices that are logically connected directly to the local machine. "logical direct connection" means that for Ethernet, all devices on the entire ethernet can be neighbors of the local machine, the key is who is chosen to send the next hop of the current package. For the POINTOPOINT device, the neighbor has only one, that is, the peer device. The only one means that the hardware address does not need to be parsed! It is worth noting that ignoring this difference will cause huge performance losses. I will describe it at the end of this article.

Statement:

For ease of description, the rtable will not be mentioned below, and the routing search results will be replaced by dst_entry! The following code is not actually a Linux protocol stack code, but a pseudo code that is easily abstracted. Therefore, dst_entry is not a dst_entry struct in the kernel, but a route entry! The reason for doing so is that dst_entry represents a part unrelated to the Protocol, and the content of this article is also irrelevant to the specific protocol, therefore, the rtable structure related to the Protocol is no longer used in pseudo-code to represent the route entry.


The reconstruction of the routing subsystem by the Linux kernel before the Linux kernel version 3.5, the routing subsystem has a routing cache hash table, which caches some of the most recently used dst_entry (IPv4, rtable) routing items, first, the route cache is searched for data packets based on their IP address tuples. If hit, dst_entry can be retrieved directly. Otherwise, the system route table is searched.
In the 3.5 kernel, the route cache is missing because it is not the focus of this article. It has been described in other articles that the removal of the route cache has caused side effects on the neighbor subsystem, this side effect has been proved beneficial. The following sections are devoted to this aspect. Before describing the impact of refactoring on the neighbor subsystem in detail, let's talk about another change, is the change in the implementation of Redirect routing.
The so-called Redirect route must be a Redirect to the existing route entry on the local machine. However, in earlier kernels, the redirection route is saved in different locations, such as inet_peer, this means that the routing subsystem is coupled with other parts of the protocol stack. In the early kernel, no matter where the Redirect route entry exists, it will only take effect after it enters the route cache. However, after the route cache is completely unavailable, the location of the Redirect route is exposed. To solve the Redirect route problem in the routing subsystem, the reconstructed kernel saves an exception hash table for each route entry in the route table, A route entry, fiber info, is similar to the following:
Fib_info {  Address nexhop;  Hash_list exception;};
The table items in this exception table are similar to the following:
Exception_entry {  Match_info info;  Address new_nexthop;};
In this way, when a Reidrect route is received, an Exception_entry record will be initialized and inserted to the corresponding exception hash table. When querying the route, for example, a Fib_info will be found, before building the final dst_entry, you must first use Match_info, such as the source IP information, to find the exception hash table. If a matched Exception_entry is found, you will not use nexhop in fiber _ INFO to construct dst_entry, instead, use new_nexthop in prediction_entry to build dst_entry.
After a brief introduction of the Redirect route, the following sections describe the relationship between the route and the neighbor.

The side effects of refactoring on the neighbor subsystem are as follows:
Neighbours
> Hold link-level nexthop information (for ARP, etc .)
> Routing cache pre-computed neighbors
> Remember: One "route" can refer to several nexthops
> Need to disconnect neighbors from route entries.
> Solution:
Make neighbor lookups cheaper (faster hash, etc .)
Compute neighbors at packet send time...
.. Instead of using precomputed reference via route
> Most of work involved removing dependenies on old setup

In fact, the two should not be associated. The routing subsystem and the neighbor subsystem are two subsystems at different levels. A reasonable way is to use the nexthop value of the route item to link up and down, you can use a unique neighbor to find the interface Association:
Dst_entry = route table query (or route cache query, using destination of skb as the key value) nexthop = dst_entry.nexthopneigh = neighbor Table query (using nexthop as the key value)
However, the implementation of the Linux protocol stack is far more complex than this, and all this has to begin before the 3.5 kernel reconstruction.

Before reconstruction, because of the existence of the route cache, any skb of dst_entry can be found in the cache, and no route table needs to be searched. The assumption of the existence of the route cache is that for the vast majority of skb, you do not need to find the route table. Ideally, you can hit it in the route cache. For neighbor, it is obvious that the neighbor and dst_entry are bound, dst_entry is found in the cache, and neighbor is also found together. That is to say, the routing cache not only caches dst_entry, but also caches neighbor.
In fact, before the 3.5 kernel, a field in the dst_entry struct is neighbor, indicating the neighour bound to the route entry. After finding the route entry from the route cache, you can directly retrieve the neighbor and call its output callback function.
We can export the binding period between dst_entry and neighbor, that is, after finding the route table, that is, when the route cache does not hit, after finding the route table, insert the result to the route cache, execute the logic of binding a neighbor.
Like the routing cache, the neighbor subsystem also maintains a neighbor table and performs status operations such as replacement, update, and expiration. There is a huge coupling between the neighbor table and the routing cache table, before describing these coupling, let's take a look at the overall logic:
func ip_output(skb):        dst_entry = lookup_from_cache(skb.destination);        if dst_entry == NULL        then                dst_entry = lookup_fib(skb.destination);                nexthop = dst_entry.gateway?:skb.destination;                neigh = lookup(neighbour_table, nexthop);                if neigh == NULL                then                        neigh = create(neighbour_table, nexthop);                        neighbour_add_timer(neigh);                end                dst_entry.neighbour = neigh;                insert_into_route_cache(dst_entry);        end        neigh = dst_entry.neighbour;        neigh.output(neigh, skb);endfunc---->TO Layer2
Refer to the following questions:
If a neighbor expires when the neighbor timer is executed, can it be deleted?
If a route cache expires when the route cache timer is executed, can it be deleted?

If you can answer the preceding two questions accurately, you can understand the relationship between the routing subsystem and the neighbor subsystem. Let's take a look at the first question first.
If the neighbor is deleted, because the route cache entry bound to the neighbor may still exist, the neighbor cannot be retrieved or used when the subsequent skb matches the route cache entry, because the binding of dst_entry and neighbor only happens when the routing cache is not hit, re-binding cannot be performed. In fact, because the routing item and neighbor are multiple-to-one, therefore, the route cache item cannot be referenced in neighbor. A deleted neighbor referenced by dst_entry.neighbour is a wild pointer, which leads to oops final Kernel panic. Therefore, the obvious answer is that even if neighbor expires, it cannot be deleted and can only be marked as invalid. This can be done through reference counting. Now let's look at the second question.
The route cache expires and can be deleted. Remember to decrease the reference count of the neighbor bound to the route cache entry. If it is 0, delete neighbor, this neighbor is the class of neighbor that cannot be deleted when the neighbor expires. From this we can see that the coupling between the route cache and the neighbor results in the expiration and deletion operations of the neighbor bound to a dst_entry can only be initiated from the route cache entry, unless a neighbor is not bound to any dst_entry. The overall sending logic is modified as follows:
Func ip_output (skb): dst_entry = lookup_from_cache (skb. destination); if dst_entry = NULL then dst_entry = lookup_fib (skb. destination); nexthop = dst_entry.gateway?: Skb. destination; neigh = lookup (neighbour_table, nexthop); if neigh = NULL then neigh = create (neighbour_table, nexthop); neighbour_add_timer (neigh); end inc (neigh. refcnt); bytes = neigh; insert_into_route_cache (dst_entry); end neigh = dst_entry.neighbour; # If it is INVALID, you need to process neigh in the output callback. output (neigh, skb); endfunc func neighbour_add_timer (neigh): inc (neigh. refcnt); neigh. timer. func = neighbour_timeout; timer_start (neigh. timer); endfuncfunc neighbour_timeout (neigh): cnt = dec (neigh. refcnt); if cnt = 0 then free_neigh (neigh); else neigh. status = INVALID; endfuncfunc dst_entry_timeout (dst_entry): neigh = dst_entry.neighbour; cnt = dec (neigh. refcnt); if cnt = 0 then free_neigh (neigh); end free_dst (dst_entry); endfunc
Let's finally look at what problems this will bring.
If the gc parameters of the neighbor table and the gc parameters of the route cache table are not synchronized, for example, the neighbor expires too quickly and the route cache entry expires slowly, many neighbor cannot be deleted, the neighbor table is full, so in this case, we need to forcibly recycle the route cache. This is a Coupling feedback from the neighbor subsystem to the routing subsystem, which is too messy:
func create(neighbour_table, nexthop):retry:        neigh = alloc_neigh(nexthop);        if neigh == NULL or neighbour_table.num > MAX        then                shrink_route_cache();                retry;        endendfunc

The relationship between the gc timer of the route cache and the neighbor sub-system is described as follows:
You may find documentation about those obsolete sysctl values:
Net. Route 4.route. secret_interval has been removed in Linux 2.6.35; it was used to trigger an asynchronous flush at fixed interval to avoid to fill the cache.
Net. route 4.route. gc_interval has been removed in Linux 2.6.38. it is still present until Linux 3.2 but has no effect. it was used to trigger an asynchronous cleanup of the route cache. the garbage collector is now considered efficient enough for the job.
UPDATED: net. route 4.route. gc_interval is back for Linux 3.2. it is still needed to avoid exhausting the neighbor cache because it allows to cleanup the cache periodically and not only abve a given threshold. keep it to its default value of 60.


All of this has changed since the 3.5 kernel !!

After reconstruction, the kernel 3.5 and later removed the support for the route cache, that is to say, the route table should be queried for each data packet (the dst_entry cache in the socket is not considered for the time being ), if no route cache exists, it means that the cache expiration and replacement issues do not need to be handled. The entire routing subsystem is a completely stateless system. Therefore, dst_entry no longer needs to be bound to neighbor, since it is not costly to search the route table again each time, the overhead of the neighbor table with a much smaller number of queries can be ignored (although the overhead of the query table cannot be avoided). Therefore, dst_entry removes the neighbor field, the IP sending logic is as follows:
func ip_output(skb):        dst_entry = lookup_fib(skb.destination);        nexthop = dst_entry.gateway?:skb.destination;        neigh = lookup(neighbour_table, nexthop);        if neigh == NULL        then                    neigh = create(neighbour_table, nexthop);        end        neigh.output(skb);endfunc
The route entry is no longer associated with neighbor. Therefore, the neighbor table can perform the expiration operation independently. Because the gc speed of the route cache is too slow, the result of frequent overflow disappears.
Not only that, but the code looks much refreshed.

Details: There are a lot of documents about the Linux neighbor subsystem on POINTOPOINT and LOOPBACK devices, but almost all of them talk about ARP, various complicated ARP protocol operations, queue operations, state machine, but almost no information about neighbor outside ARP is described. Therefore, this article will add an example of this in the last section. Or start with the problem:
Who is the neighbor of a NOARP device, such as the skb issued by the pointopoint device?
In the case of broadcast Ethernet, to send data packets to the remote end, you need to resolve the "Next Hop" address, that is, each sent data packet must be sent through a gateway, this gateway is abstracted as an IP address of the same network segment. Therefore, it is necessary to use ARP to implement the specified hardware address. However, for a pointopoint device, there is only one fixed connection to the device, and there is no broadcast or multicast layer 2, so there is no gateway concept, or in other words, the next hop is the destination IP address.
According to the above ip_output function, the key value used before querying the neighbor table is nexthop. For pointopoint devices, nexthop is the target address of skb, if it cannot be found, it will be created based on this key value. If the destination address space of the skb sent by the pointopint device is very large, a large number of neighbor will be created at the same time, these neighbor will be inserted to the neighbor table at the same time, and this will inevitably encounter the lock problem. In fact, their insertion operations will all spin on the write lock of the read/write lock in the neighbor table !!
The logic of neigh_create is as follows:
Struct neighbor * neigh_create (struct neigh_table * tbl, const void * pkey, struct net_device * dev) {struct neighbor * n1, * rc, * n = neigh_alloc (tbl );...... write_lock_bh (& tbl-> lock); // Insert the hash table write_unlock_bh (& tbl-> lock );.......}
When the skb of a massive target IP address is sent through the pointopoint device, this is a bottleneck that cannot be avoided completely! However, the kernel is not so silly. It adopts the following methods to avoid:
__be32 nexthop = ((struct rtable *)dst)->rt_gateway?:ip_hdr(skb)->daddr;if (dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT))  nexthop = 0;
This means that as long as the pointopint device to be sent is the same and the pseudo-L2 (for example, IPGRE) information is the same, all skbs will use the same neighbor, regardless of whether they have the same target address. In the case of IPIP Tunnel, because this device does not have any Layer 2 Information, it means that all skb devices that use IPIP Tunnel will use a single neighbor, even if you use different IPIP Tunnel devices for sending.
But after the reconstruction of the 3.5 kernel, it was a tragedy!
Let's look at the 4.4 kernel!
static inline __be32 rt_nexthop(const struct rtable *rt, __be32 daddr){    if (rt->rt_gateway)        return rt->rt_gateway;    return daddr;}static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb){  ......    nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);    neigh = __ipv4_neigh_lookup_noref(dev, nexthop);    if (unlikely(!neigh))        neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);    if (!IS_ERR(neigh)) {        int res = dst_neigh_output(dst, neigh, skb);        return res;    }  ......}
As you can see, dev-> flags & (IFF_LOOPBACK | IFF_POINTOPOINT) This judgment disappears! This means that the kernel has become silly. The phenomena analyzed in the previous section will happen in the kernel after 3.5, and will happen in fact.
After this problem occurs, before looking at the kernel implementation before 3.5 in detail, my idea is to initialize a global dummy neighbor, which simply uses dev_queue_xmit for direct out:
static const struct neigh_ops dummy_direct_ops = {    .family =        AF_INET,    .output =        neigh_direct_output,    .connected_output =    neigh_direct_output,};struct neighbour dummy_neigh;void dummy_neigh_init(){    memset(&dummy_neigh, 0, sizeof(dummy_neigh));    dummy_neigh.nud_state = NUD_NOARP;    dummy_neigh.ops = &dummy_direct_ops;    dummy_neigh.output = neigh_direct_output;    dummy_neigh.hh.hh_len = 0;}static inline int ip_finish_output2(struct sk_buff *skb) {  ......     nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);    if (dev->type == ARPHRD_TUNNEL) {        neigh = &dummy_neigh;    } else {        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);    }     if (unlikely(!neigh))         neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);  ...... }
After looking at the implementation before the 3.5 kernel, we found:
if (dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT))  nexthop = 0;
So I decided to use this, and the code was less elegant! Then the following patch is generated:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c--- a/net/ipv4/ip_output.c+++ b/net/ipv4/ip_output.c@@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s        rcu_read_lock_bh();        nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);+       if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))+               nexthop = 0;        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);        if (unlikely(!neigh))                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.