Linux3.5 the next-hop cache after the kernel is routed

Source: Internet
Author: User

Before the Linux3.5 version number (including). There is a route cache. The intent of this route cache is good, but the reality is often regrettable. The following are the two issues displayed:
1. The DDoS problem facing the hash algorithm (the article describing the problem has been voluminous, no longer repeat);
2. The cache egress device is a routing term for peer-to-peer devices that degrades performance.

These problems are essentially caused by the way the routing cache is found and the way the routing table is found incompatible. The routing cache must be an exact tuple match, so it must be designed as a one-dimensional hash table, while the routing table lookup algorithm is the first prefix match. So it can be multidimensional.

Route Lookup will finally find the route entry. Without considering policy routing, let's take a look at how meaningless it is for the routing of the Egress device to be a peer device from the cache.

There is only one next hop in the Peer collection, which is its pair-side, so it is for peer devices. It doesn't even have to be a neighbor binding process. Suppose, however, that this type of route plugs into the path by the cache. Will occupy a huge amount of memory, imagine a 10w IP address need to communicate. The source IP collection has the same 10w IP address. It is possible to establish a 100w route cache entry. To be extreme, assuming that there are only a few routing table entries in the system at this point, the cost of finding the routing table might be lower than the cost of finding the route cache. In particular. Assuming that the routing results are peer-to-peer, it is only possible to find a way to cache the only entry. This is the difference between one and many, and this time, we found that not only 0 to a meaningful. One to many is also the same cannot be underestimated.



Suppose there is an Ethernet card eth0 in the system. Because the same network segment will have multiple neighbors, a different destination IP address. The next hop may be different, we have to cache each eth0-related routing item, and then exact match for each packet, but suppose that there is a peer-to network card in the system, its neighbors only one, for the point-to-point device, its peer-to-peer logic has only one device, It is unique and deterministic. It is the only neighbor in the neighbor collection of the point-to-point device, so there is no need to do a neighbor binding process, just send the packet from the point-to-point device, the packet must reach the unique peer, in which case, suppose we also cache each of the peer network interface related to the route item, the meaning is not very much, However, this is not possible with the Linux routing cache mechanism, because it is before locating the route cache and finding the routing table. We have no way of knowing that this packet is finally going to be sent out from a peer-to Mehsud.

One solution is to assume that the result of finding the routing table indicates that its egress device is peer-to-peer. Then set a NoCache flag that means not to cache it, until the packet is sent to complete the release, I think this implementation is simple and clear. It was meant to be achieved last September, and it was for one of our gateway products to improve performance. But after I left the office, this matter also left, until recently, I faced this problem again.

However, I have a better suggestion, that is upgrading the kernel to 3.6 +. Just this is something, in fact, assuming you have to maintain the old product based on the lower version number of the kernel, the change code is not open, fortunately, both the old company. As a new company, I've been dealing with 2.6.32 version code for 6 years.

Expand Point said. Routing find this thing is really embarrassing, to be sure, there might be hundreds of thousands of routes on a single device. However, the number of nodes in the neighbor collection that is connected to it can be expressed in a single byte, and most of the nodes may have no more than 10 neighbors! We consume a lot of energy and what cache queries. What is the longest prefix match. Finally, to get a few needles out of the hundreds of thousands of-magnitude sea, it's always been a challenging area, compared to TCP acceleration. This field is more closed-loop, and it is unaffected by other factors. Only the algorithm itself affects it! In fact, not only peer devices, even ethx equipment. The ending is also sad, configure dozens of routes. Finally, the next hop is probably just five or six, and the peer device is only a bit more extreme. For peers, we typically write a route like this:
Route add-host/net a.b.c.d/e Dev tunlx
For ETHX devices, however, in general we must write the route:
Route add-host/net a.b.c.d/e GW a.b.c.d
That is, the peer-to-peer device directly informs the packet to be sent from the device, but for ETHX devices (or all broadcast network devices and NBMA devices), address parsing or next-hop parsing is required to know where to send. Not only this. The routing cache also has an impact on the neighbor subsystem, which simply means that the routing item refers to the neighbor. The neighbor cannot be freed until the route item is released. Even if the peer-to device does not require neighbor parsing. At the level of the code must also be special processing, unfortunately, the Linux kernel does not see such a special processing, the peer device routing items will still plug the route by the cache.

The above is the dilemma of routing lookups. The dilemma lies in the many-to-one or many-to-few mapping processes, in such a case. Creating an exact matching cache may make the outcome more tragic, so tuning in a uniform way can be more human-compatible.

Linux3.6 later. Removal of the support of the routing cache, all packets to be sent out, you must find the routing table. The current process may become the following logic:

Dst=lookup_fib_table (SKB);d st_nexthop=alloc_entry (DST); Neigh=bind_neigh (dst_nexthop); Neigh.output (SKB); release _entry (Dst_nexthop);
It's a perfect process. However, in the implementation level of the protocol stack, there are new problems. That is, alloc/release can bring huge memory jitter, and we know that memory allocation and deallocation is a transaction that must be done outside the CPU. It costs a lot of money. Although there is slab cache in Linux, we know the same. The cache is layered.

In fact, Linux after 3.6. A new routing cache is implemented. A route entry is no longer cached. Because the tuple that needs to be skb exactly matches, but caches the next hop, the cache must go through the lookup_fib_table routine.

It's a feat. Because the cache thing is unique unless there are some exceptions! This solves the problem of solving many-to-one and many-to-few problems. Before you can find the cache, you must first find the routing table. And after the search is done, theoretically you already know the next jump, unless there are some exceptions (again!).

This new next-hop cache is just to avoid allocating/releasing memory! Pseudo-code such as the following:

Dst=lookup_fib_table (SKB);d St_nexthop=lookup_nh_cache (DST), if dst_nexthop = = Null;then Dst_nexthop=alloc_    Entry (DST);    If Dst_nexthop.cache = = true;    Then        Insert_into_nh_cache (dst_nexthop);    Endifendifneigh=bind_neigh (Dst_nexthop); Neigh.output (SKB); if Dst_nexthop.cache = = Falsethen    release_entry (DST _nexthop); endif
In this way, the route cache no longer caches the entire route entry, but instead caches the next hop for the routing table lookup results.



In general, a route entry has only one next hop. So this cache is extremely meaningful. This means that. Most of the time, when the result of a route lookup is a deterministic DST. Its next-hop cache is hit. It is no longer necessary to allocate a new dst_nexthop struct again, but to use it directly in the cache. Assuming very unfortunate, no hits, then allocating one dst_nexthop at a time, inserting it as far as possible into the next hop cache, assuming again very unfortunate, no successful insertion, then set the NoCache flag, which means that the Dst_nexthop will be released directly after the use is complete.

The preceding paragraphs describe the next-hop cache hit scenario. Then under what circumstances will not be hit, this is very easy, is nothing more than in the above Lookup_nh_cache routine to return null, there are a few cases will cause it to occur. For some reason, the existing routing items are deleted or updated, and so on.

I'll then give a description of the MTU problem with a peer-to-peer virtual NIC, and before that, I'll explain the second common scenario, which is redirecting routes.



        so-called redirect routing, which updates one of the route entry entries for this node's routing table, be aware of. This update is not permanent, but temporary. So the Linux approach is not to change the routing table directly, but to change the next hop cache! This process is asynchronous, with pseudo-code such as the following:

# ip_out routines run the IP send logic, it first looks up the standard routing table, and then finds the next hop in the next hop cache, to decide whether to assign a new dst_nexthop again Nexthop. Unless you start by specifying the NOCACHE flag. Otherwise, you'll almost always be looking for a next-hop cache failure, which in turn creates a new dst_nexthop and inserts it into the next-hop cache to be used when a packet is sent, which avoids allocating/releasing new memory space again and again.    Func ip_out:dst=lookup_fib_table (SKB);    Dst_nexthop = Loopup_redirect_nh (skb.daddr, DST);    if dst_nexthop = = NULL;    Then Dst_nexthop=lookup_nh_cache (DST);    endif if dst_nexthop = = NULL;        Then Dst_nexthop=alloc_entry (DST);        If Dst_nexthop.cache = = true;        Then Insert_into_nh_cache (dst_nexthop);    endif endif Neigh=bind_neigh (dst_nexthop);    Neigh.output (SKB);    If Dst_nexthop.cache = = False Then Release_entry (Dst_nexthop); The endifendfunc# Ip_route_redirect routine creates or updates a dst_nexthop and inserts it into a linked list, which is the destination address of the packet as the lookup key.    Func ip_route_redirect:dst=lookup_fib_table (ICMP.REDIRECT.DADDR);    Dst_nexthop = New_dst_nexthop (DST, icmp.redirect.newnexthop); INSERT_INTO_REDIRECT_NH (dst_nexthop); Endfunc 

These are the next hop cache logic for the kernel after 3.6, which is worth noting. It does not reduce the overhead of routing lookups, but rather reduces the overhead of memory allocation/deallocation. Route lookups are just going around. However, the route lookup result is a route item, which has a hierarchical relationship with the next-hop struct and the neighbor structure, such as the following:
Route Item-Next hop struct-neighbor item
When a packet is sent, it must bind a next-hop struct after the route lookup has ended, and then bind a neighbor. The routing table is just a static table, and the data channel does not have permission to change it, it is simply used to find. The protocol stack must construct a next-hop struct with the found route-item information. This time the importance of caching the next hop, because it reduces the cost of construction!

Finally, we can look at the effect. Assuming you're just looking at the code, you might be frustrated when you see the rt_dst_alloc call in input or the output path. But suppose you use the following command to look at the actual results:
watch-d-N 1 "Cat/proc/net/stat/rt_cache"
The time. You will find that the counters for the In_slow_tot and Out_slow_tot two fields are added very slowly. Even stagnant!

This means that the vast majority of packets are hit by the next hop during the receive and send process cache! If you find an exception, that is not the case, one or both of them grow very fast, then there may be two reasons:
1. Your kernel may not be upgraded to a sufficiently high version number
This means that your kernel has bugs in the initial version number of 3.10. Rt_cache_stat_inc (In_slow_tot); The call occurred before the following code:

if (res.fi) {    if (!itag) {        rth = rcu_dereference (Fib_res_nh (res). Nh_rth_input);        if (Rt_cache_valid (Rth)) {            skb_dst_set_noref (SKB, &RTH->DST);            Err = 0;            goto out;        }        Do_cache = true;}    } Rth = Rt_dst_alloc (Net->loopback_dev,           in_dev_conf_get (In_dev, Nopolicy), false, Do_cache);
That is, it leaves the code for the age of the routing cache, and mistakenly treats the next-hop cache as the routing cache. You just need to transplant rt_cache_stat_inc (In_slow_tot) to Rt_dst_alloc.


2. You may have used a peer-to device. However, the MTU is not set correctly
We know that the Ipip tunnel device is a virtual network card device on Linux, and the packet is actually sent out to pass through the process of encapsulating an IP header again, assuming that finally the data is sent via ETHX, the MTU defaults to 1500. Assuming that the MTU of the Ipip tunnel device is also 1500 or less than 1500 minus the necessary head overhead, the operation to cause another update of the MTU, and a next-hop cache that includes the MTU information, assuming that the MTU needs to be updated again, means that the next-hop cache needs to be updated.

In a general physical device. This is not a problem, because the MTU is already known before the IP layer sends the data. However, for IPIP tunnel equipment, when the data is sent, the protocol stack does not know that the packet needs to be encapsulated again until it actually sends the data to the tunnel. Therefore, it is not aware of the fact that the MTU is too large to send the data, especially in the case of Gso,tso. Things are going to be more complicated. At this point we have two solutions:
1). Properly lower the MTU value of the Ipip tunnel, ensuring that even after re-encapsulation, it is only a length overload. This will not cause the MTU to be updated again to release the next hop cache.
2). Start with the code!
According to the rt_cache_valid of the code, do not let the next-hop cache flag become Dst_obsolete_kill, and this is related to the MTU, and in __IP_RT_UPDATE_PMTU. Just make sure that the initial MTU of the next hop cache is not 0. This can add an inference, after Rt_dst_alloc, to initialize the Rth field:

if (dev_out->flags& (iff_loopback| iff_pointopoint))    rth->mtu = dev_out->mtu;else    rth->mtu = 0;
After testing, the effect is good!

BTW, as with very many security protocols. The routing table entry and the next-hop cache also use the version number to manage its validity. This table entry is valid only if the ID of the table item is identical to the global ID, which simplifies the refresh operation. When the refresh occurs, you just need to increment the global version number ID.

Now, I can summarize it. After Linux3.6, the route cache is removed, instead of the next hop cache, there are a lot of strange, such as the processing of redirect routing ... This is primarily an effective way to reduce the overhead of memory management rather than the cost of finding itself. Let's talk about the cost of memory and the cost of finding it here.

The two are not a level, the memory overhead is mainly related to the memory management data structure and architecture, this is a complex category, and the cost of the lookup is relatively simple, only with the algorithm's time space complexity and architecture-related, but why with the cost of the search for memory cost, this is always a philosophical problem without solution!

Linux3.5 the next-hop cache after the kernel is routed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.