Next Hop cache for routes after Linux3.5 Kernel
Before Linux3.5 (included), there was a route cache. The original intention of this route cache was good, but the reality is often regrettable. The following are two problems:
1. ddos attacks against the hash algorithm (the article describing the problem is full of resources and will not be repeated );
2. the cache egress device is a route entry of a p2p device that reduces performance.
These problems are essentially caused by the incompatibility between the routing cache lookup method and the route table lookup method. The route cache must be accurate, so it must be designed as a one-dimensional hash table, and the route table search algorithm is prefix matching, so it can be multidimensional. Route Search will eventually find the route entry. without considering the routing policy, it is meaningless to insert the route entry of the egress device as a p2p device into the routing cache.
There is only one next hop in the Peer set of p2p devices, that is, its peer. Therefore, p2p devices do not even need to bind their peers! However, if such a route is inserted into the route cache, it will occupy a huge amount of memory. Imagine if there are 10 million IP addresses that need to communicate, there will also be 10 million IP addresses in the source IP address set, route cache entries may be created. to the extreme, if there are only a few route entries in the system, the overhead of searching a route table may be lower than the overhead of searching for a route cache. In particular, if the route result is a p2p device, you only need to find a way to cache this unique entry. This is the one-to-many difference. This time, we found that not only is it meaningful, but it cannot be underestimated.
If the system has an Ethernet Card eth0, because the same network segment has multiple neighbors and different destination IP addresses may have different next hops, We Have To cache each route entry related to eth0, then, exact match is performed for each packet. However, if there is a p2p network card in the system, there is only one neighbor. for point-to-point devices, the peer device has only one device logically, it is unique and definite. It is the only neighbor in the neighbor set of the point-to-point device. In fact, no neighbor binding process is required, as long as packets are sent from the point-to-point device, this packet will surely reach the unique peer end. In this case, if We cache each route entry related to the p2p Nic, it will be of little significance. However, this cannot be done for the Linux route cache mechanism, because before finding the route cache and the route table, we do not know that this packet is finally sent from a p2p Nic.
A solution is to set a NOCACHE flag if the route table shows that the egress device is a p2p device, indicating that the outbound device is not cached. The data packet is released after it is sent, I think this implementation is simple and clear. It was originally intended to be implemented in May, but it was also intended to improve the performance of one of our gateway products. However, when I leave my office, this will no longer happen, until recently, I encountered this problem again. However, I have a better suggestion, that is, upgrading the kernel to 3.6 +. However, if you have to maintain old products based on the earlier kernel version, it is unavoidable to modify the code. Fortunately, I have been dealing with 2.6.32 code for six years, regardless of the old company or the new company.
Expansion: It is really embarrassing to search for routes. You can be sure that there may be hundreds of thousands of routes on a device, however, the number of nodes in the neighbor set connected to the node can be expressed in one byte, and the number of neighbors of most nodes may not exceed 10! We spent a lot of energy on what cache queries and what is the longest prefix match. In the end, we were trying to get a few needles in the sea of hundreds of thousands. So, this has always been a challenging field. Compared with TCP acceleration, this field is more closed-loop. It is not affected by other factors, but only by the algorithm itself! In fact, not only p2p devices, but even ethX devices, have a sad ending. After configuring dozens of routes, there may be only five or six next hops. p2p devices are just more extreme, for p2p devices, we can write routes like this:
Route add-host/net a. B. c. d/e dev tunlX
However, for the ethX device, we generally have to write the route:
Route add-host/net a. B. c. d/e gw A. B .C.D
That is to say, the p2p device directly informs the packet from the device. However, for ethX devices (or all broadcast network devices and NBMA devices ), address Resolution or Next Hop Resolution is required to know where to issue the resolution. In addition, the route cache will also affect the neighbor subsystem. Simply put, the route entry references the neighbor. Before the route entry is released, the neighbor cannot be released, even if a p2p device does not need neighbor resolution, it must be specially processed at the code level. Unfortunately, this special processing is not found in the Linux kernel, and the routing items of the p2p device will still be inserted into the routing cache.
The above is the dilemma of Route Search. The dilemma lies in the Multi-to-one or multi-to-few ing process. In this case, creating a precisely matched cache may make the ending even more sad. Therefore, you can use a unified method to optimize your business. After Linux3.6, all data packets must be routed to the route table to be sent! The current process may become the following logic:
dst=lookup_fib_table(skb);dst_nexthop=alloc_entry(dst);neigh=bind_neigh(dst_nexthop);neigh.output(skb);release_entry(dst_nexthop);
This is a perfect process, but at the implementation level of the protocol stack, there is a new problem, that is, alloc/release will bring huge memory jitter, we know, memory Allocation and release are a transaction that must be completed outside the CPU. The overhead is huge. Although slab cache is available in Linux, we also know that cache is hierarchical. In fact, after Linux 3.6, A New Route cache is implemented, instead of caching a route entry, because it requires precise matching of skb tuples, but caches the next hop, the cache must go through the lookup_fib_table routine.
This is an innovation, because the cache is unique, unless some exceptions occur! This solution solves the problem of many-to-one and many-to-few. before finding the cache, you must first find the route table. After finding the route table, theoretically you already know the next hop, unless otherwise specified (reiterate again !) This new next hop cache is only used to avoid memory allocation/release! The pseudocode is as follows:
dst=lookup_fib_table(skb);dst_nexthop=lookup_nh_cache(dst);if dst_nexthop == NULL;then dst_nexthop=alloc_entry(dst); if dst_nexthop.cache == true; then insert_into_nh_cache(dst_nexthop); endifendifneigh=bind_neigh(dst_nexthop);neigh.output(skb);if dst_nexthop.cache == falsethen release_entry(dst_nexthop);endif
In this way, the route cache no longer caches the entire route entry, but caches the next hop of the route table search result.
In general, a route entry has only one next hop, so this cache is extremely meaningful. This means that, in most cases, when the result of the route query is a fixed dst, the next hop cache will hit, and a new dst_nexthop struct will no longer need to be assigned, instead, you can directly use the cache. If it is unfortunate that there is no hit, re-allocate a dst_nexthop and insert it to the next hop cache as much as possible. If it is unfortunate again that it is not successfully inserted, set the NOCACHE flag, which means that the dst_nexthop will be directly released after use.
The above section describes the cache hit condition of the next hop, so under what circumstances it will not be hit. This is very simple. It is nothing more than when the lookup_nh_cache routine above returns NULL, there are several situations that may lead to this, for example, deleting or updating existing route entries for some reason. This will be explained later through a p2p virtual network card mtu problem. Before that, I will also elaborate on another common situation, that is, redirection routing.
The so-called redirection route will update a route entry in the route table of the current node. Note that this update is not permanent, but temporary, therefore, Linux does not directly modify the route table, but modifies the next hop cache! This process is asynchronous. The pseudocode is as follows:
# The IP_OUT routine executes the IP sending logic. It first searches for the standard route table and then finds the next dst_nexthop in the Next Hop cache to determine whether to re-allocate a new dst_nexthop, unless you specify the NOCACHE flag at the beginning, it will almost always find the next hop cache failure and create a new dst_nexthop and insert it to the next hop cache to be used for subsequent packet sending, this avoids re-allocating/releasing new memory space each time. Func IP_OUT: dst = lookup_fib_table (skb); dst_nexthop = loopup_redirect_nh (skb. daddr, dst); if dst_nexthop = NULL; then dst_nexthop = lookup_nh_cache (dst); endif if rows = NULL; then dst_nexthop = alloc_entry (dst); if rows = true; then insert_1__nh_cache (dst_nexthop); endif neigh = bind_neigh (dst_nexthop); neigh. output (skb); if dst_nexthop.cache = false then release_entry (ds T_nexthop); endifendfunc # IP_ROUTE_REDIRECT will create or update a dst_nexthop and insert it into a linked list. The linked list uses the destination address of the data packet as the search key. Func IP_ROUTE_REDIRECT: dst = lookup_fib_table (icmp. redirect. daddr); dst_nexthop = new_dst_nexthop (dst, icmp. redirect. newnexthop); Forward (dst_nexthop); ENDC
The above is the next hop cache logic of the kernel after 3.6. It is worth noting that it does not reduce the overhead of route search, but reduces the overhead of memory allocation/release! The route query result is a route entry. It has a hierarchical relationship with the next hop structure and the neighbor structure. The relationship is as follows:
Route entry-Next Hop structure-neighbor entry
When a data packet is sent, it must be bound to a next hop structure after the route query is completed, and then bound to a neighbor. The route table is just a static table and the data channel has no permission to modify it, it is only used for search. The protocol stack must use the found route item information to construct a next hop structure, which reflects the importance of caching the next hop, because it reduces the constructor overhead!
Finally, let's take a look at the effect. If you only look at the code, you may be frustrated when you see the rt_dst_alloc call in the input or output path, however, if you use the following command to view the actual results:
Watch-d-n 1 "cat/proc/net/stat/rt_cache"
You will find that the counters for in_slow_tot and out_slow_tot are very slow and even stuck! This means that the vast majority of data packets hit the next hop cache during receiving and sending! If you find an exception, that is, not the case, either of them is growing rapidly, there may be two reasons:
1. Your kernel may not be upgraded to a high enough version.
This means that your kernel has a bug. In the first version of 3.10, the call of RT_CACHE_STAT_INC (in_slow_tot); occurs before the following code:
if (res.fi) { if (!itag) { rth = rcu_dereference(FIB_RES_NH(res).nh_rth_input); if (rt_cache_valid(rth)) { skb_dst_set_noref(skb, &rth->dst); err = 0; goto out; } do_cache = true; }}rth = rt_dst_alloc(net->loopback_dev, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);...
That is to say, it leaves the code of the time when the route cache exists, and mistakenly regards the next hop cache as the route cache! You only need to port RT_CACHE_STAT_INC (in_slow_tot) to rt_dst_alloc.
2. You may have used a p2p device, but you have not set the MTU correctly.
We know that the ipip Tunneling Device is a virtual network card device on Linux, and the packet to be sent must be re-encapsulated with an IP header. If the data is finally sent through ethX, the MTU is 1500 by default. If the MTU of the ipip Tunneling Device is 1500 or less than 1500 minus the head overhead, the MTU will be updated, the next hop cache contains the MTU information. If the MTU needs to be updated, the next hop cache needs to be updated.
In general physical devices, this is not a problem, because MTU is known before the IP layer sends data, but for ipip tunneling devices, when data is sent, the protocol stack does not know that the final data packet needs to be re-encapsulated before actually sending data to the tunnel. Therefore, it does not know that the MTU is too large and the data cannot be sent, especially in the case of gso and tso, things will be more complicated. There are two solutions:
1) appropriately reduce the mtu value of the ipip tunnel to ensure that the length is not overloaded even if it is encapsulated again. In this way, the next hop cache will not be released after the MTU is updated.
2). Start with the code!
According to the rt_cache_valid of the Code, do not change the next hop cache flag to DST_OBSOLETE_KILL, which is also related to MTU, but in _ ip_rt_update_pmtu, you only need to ensure that the initial mtu of the next hop cache is not 0. This can be added with a judgment. After rt_dst_alloc, when initializing the rth field:
if (dev_out->flags&(IFF_LOOPBACK|IFF_POINTOPOINT)) rth->mtu = dev_out->mtu;else rth->mtu = 0;
After testing, the effect is good!
BTW, like many security protocols, route table items and the next hop cache also use version numbers to manage their validity. Only when the table item ID is consistent with the global ID, this indicates that the table item is valid. This simplifies the refresh operation. When a refresh occurs, you only need to increase the global Version Number ID.
Now, we can summarize it. After Linux3.6, the route cache is removed and replaced by the next hop cache. There are many flaws in this cache, such as the handling of redirection routes... this mainly reduces the overhead of memory management rather than the overhead of searching. Here we will talk about the memory overhead and the overhead of searching. The two are not hierarchical. The memory overhead is mainly related to the memory management data structure and the architecture. This is a complicated category, and the overhead of searching is relatively simple, it is only related to the time space complexity and architecture of the algorithm. However, it is always an unsolved philosophical question why the search overhead is used for memory overhead!