Conntrack Hash lookup in protocol stack processing/bloom filter/cache Find/Package/package/layered processing style

Source: Internet
Author: User

1. The advantages and disadvantages of the routing cache has been around for many years, the essence of "the fastest memory minimization, the slowest memory maximization", so that the result of maximizing the utilization of resources, not only improve access efficiency, but also save resources. This is the basic principle of all cache designs.
For memory access, almost all CPUs have built-in cache, level two cache, several affinity cores even designed the level three cache or four cache, then the physical memory, then the finely optimized disk swap partition, and finally the Remote Storage, These storage spaces are progressively larger and the access overhead is progressively larger, constituting a pyramid-type storage System. Such a design to achieve the greatest effect, its main use of the principle of local, to the large said is the general nature of the Levi flight principle. This principle is not only effective in memory access, but also human civilization is the Levi flight principle built up, I do not want to speak from the south of the last North American immigration and westward movement, so this paragraph is stopped.
By the way, the routing table for the protocol stack is meant to be found, not to mention the Cisco Express Forward (CEF system), which is enough to route the cache. Early Linux (up to 2.6.39, not 3.X), in the System routing table first and built a route cache, each cache table entry is a successful route query results, built a time-out expiration. The Linux routing cache design assumes that packets destined for a destination address will come in succession. This is also a generalization of the principle of locality. But unlike the memory access and the migration tide of a single destination, Linux does not serve as a router for a data stream, while the Linux route cache holds Sourceip/destinationip value pairs. Therefore, an entry in one of the routing tables will expand the N route cache entry:
1.1.1.0/24-192.168.1.254==>{(2.2.2.1,1.1.1.1,192.168.1.254), (2.2.2.2,1.1.1.1,192.168.1.254), ( 2.2.2.3,1.1.1.2,192.168.1.254), (2.2.2.3,1.1.1.10,192.168.1.254) ...}
If the two-tuple data that is based on the source ip/destination IP is huge at the same time (especially in the backbone router scenario), the number of routing caches will far exceed the number of routing table entries, and I wonder where the route cache is compared to the query advantage of the routing table. Would there be such an efficient route cache query algorithm? If so, why not apply directly to the longest prefix-matching routing table lookup? It is clear that the routing table lookup is far more restrictive than the route cache lookup.
The idea of routing the cache is good, but it is not suitable for software implementation, if there are tcam hardware, it may be possible to achieve parallel multi-dimensional vector exact matching, but for the software implementation, because the protocol stack natural multi-CPU core utilization of the drawbacks, the most efficient way is to grade hash, Even the most efficient multi-level hash implementation is not a free lunch. If you consider the abnormal traffic, it is easy to put your routing cache table to burst, such as the construction of a large amount of different source IP address to access different destination IP address.
The principle of Tcam does not speak, online information is more.
2.conntrack Hash Lookup NetFilter has a conntrack module that can establish any five-tuple connection tracking. However, if a packet comes in, a lookup must be made to associate it to a connection trace. The efficiency of this search has a direct impact on the quality of the device. The lookup algorithm is simple, first calculates a hash based on a five-tuple, and then iterates over the list of conflicting links that the hash value is set to, until an exact match is found, and if none is created.
When it comes to hashing, if only theoretically, it is an extremely efficient time/space balance algorithm, regardless of space waste, it can be a constant time to find the results, the other extreme, it degenerate into a simple list, find the time complexity and traversal of the same. The time complexity of the real hash algorithm is between the two. So Linux, when implementing Conntrack, should dynamically adjust the size of the hashsize like the trie routing tree, that is, proper rehash operations to keep the average number of elements on each conflicting list within a certain range, This will allow you to maintain good scalability and scalability, and your cost is simply to consume memory. But Linux does not seem to do so.
Repeatedly want to optimize the nf_conntrack of the search performance, but always subject to the following ideal, that is, "Ideally, the conntrack of the search only need to do a hash calculation, and then traverse an element or a single element of the list of elements", if the realization of this ideal, still use laborious optimization? I assume that the average number of elements in the conflict list is 10, so as long as the 10,000 hash buckets can accommodate 10*10000 connections, the efficiency is already high. But the ideal is only ideal after all! Hash value of the input is a five-tuple, the algorithm is fixed, so if the hash algorithm is not good enough, the calculation results of the hash degree and input is highly correlated, the theory and facts have proved that the hash algorithm involved in the search can not achieve the ideal situation, want to make hash output and input independent, It is necessary to adopt the concept of operation in symmetric cryptography: substitution, substitution, confusion, diffusion, etc., you can refer to the operation steps of Des/aes, you can see that this is still not a free lunch!
Hash lookup needs to be efficient, and if an algorithm is good enough for the amount of hash input, that's it. If the extent of the hash is very bad, then the length of the conflict linked list will be particularly long and very short, which for a fair scheduling system, the impact on performance is very large.
3. Bloom Filter-based forwarding if the Linux routing cache is bound to be dismissed, there must be an alternative. In fact, I'm not just on Linux, I'm talking about general scenarios. As long as the software implementation of the routing cache, it is necessary to consider the efficiency of the cache lookup. The cache only shows the advantage of a slow path only if the number of entries is fixed and the number is small, otherwise the rest is just thought.
If you pull the hardware, then how to toss is OK, but at this time I certainly do not talk about hardware. I want to use pure software to optimize forwarding efficiency.
It is important to remember that one thing is that you cannot improve the performance of every part of the whole, just as the perpetual motive is impossible, and you must pour the unimportant part of your abilities into the parts that need to be optimized.
The benchmark for optimization is to differentiate between a fast path and a slow one, first to find the fast path, and if it fails, enter a slow-track query and add the result to the fast route. The fast road passes quickly, the premise is that it is the platoon it, the capacity is small and limited, like the Roman Republic time civil rights, is to rely on the struggle to obtain, to the Caracalla period, the civil rights became the right to acquire, the advantage nature is gone. The efficient alternative to traditional route caches can be achieved with bloom.
For each next hop maintenance of a fixed size (but the runtime can be dynamically adjusted) of the Bloom filter, the arrival of the packet, first traverse all the next hop Bloom filter, the result is nothing more than three kinds:
A. Only one Bloom filter returns a value of 1;
B. There are multiple bloom filters with a return value of 1;
C. The return value of no bloom filter is 1;

For a, the direct to the next hop, for B, indicating the existence of false Positive, at this time, it is stated that all the bloom filters that return 1 may indicate the correct result, either return to the slow path, or broadcast between all possible results, for C, You must return to the slow path at this point. Please note that we certainly hope that the result is a, at this time the calculation is very fast, n hash calculation can be, as for the calculation method, the processor may be serial, or multiple processors in parallel.
For Bloom filters, for easy removal of elements, use an int placeholder with counter instead of just a bit bit representing 0 and 1. In order to prevent false Positive results will always be false Positive, all the bloom filter in memory to have a backing store, save the elements of the list, each interval for a period of time, in the background update hash algorithm, Bloom update operation. I'm not in favor of introducing a new level, such as introducing a cache layer, because it increases maintenance and requires complexity management. The bell also needs to be bell people, since in order to avoid the same IP address pair of the same calculation results, then change the algorithm instead of introducing a new layer, at the same level, compensation is not necessary.
All in all, Bloom filter hash algorithm must be kept delicate, occupy space must be small.
4. Packet or packet OpenVPN the virtual network card simulates jumbo frames in order to increase the throughput of encryption/decryption and reduce the overhead of system calls. But is it good to use jumbo frames for real physical links?
Small package is flexible, this is also the true meaning of packet exchange, then big Bag? Cumbersome, an infinitely large package is the circuit switching stream, which will occupy the link for a long time. Just like a collector or an excavator. Fortunately in reality the link that enables jumbo frames it really has the ability to transmit jumbo frames (which means it's way wider!). ), there is nothing wrong with that.
But if you have a wide path, do you have to transfer jumbo frames? Wouldn't it be more efficient to run small frames? Two-way 10 lanes for the running of the card is no problem, but run the car estimated better, if not to transport, then the throughput of transport aircraft, although not more than the freighter, but the delay is much smaller. In fact, on a post-departure link, even if you transmit jumbo frames, it is not guaranteed to be split on the way. So based on energy consumption, packet switching efficiency, fragmentation/reassembly overhead considerations, jumbo frames are no advantage, I was in the OpenVPN to simulate jumbo frames, because I can control everything on the OpenVPN link, although this requires the path of physical network cards and virtual network card data frames are jumbo frames, But I can test that the Shard/reassembly overhead for these jumbo frames is much larger than the OpenVPN packet encryption/decryption and system call overhead.
So what is the purpose of jumbo frames? Just to offset the benefits of packet switching?? In fact, this is for the host optimization, especially running Windows host, this host is generally at the edge of the network, as end-to-end processing, so for the arrival of the data frame, if the interruption is too frequent, will inevitably reduce the application layer processing performance, after all, the total resources are limited! Jumbo frames can reduce the number of interrupts.
I used to say that jumbo frames are timely and not absolute, and that's right for high-performance single-link. For data networks with narrow-band links, jumbo frames add overhead and block roads on the last kilometer of the link. The fragmentation/reassembly of intermediate nodes may not be just once, as long as there is a need to analyze the protocol header, you need to reorganize the shards, such as state NAT, such as firewall, such as data flow classification ... For the frequency of interruption, the current Gigabit card, Wanchaoca can be in the network card chip, reorganization, and then interrupt the CPU, but also can accumulate data frame to a certain amount, and then interrupt the CPU once, then interrupt for polling, details please refer to the Intel IGB e1000e driver README.
5. The Layered processing style protocol stack design is layered, which does not mean that the implementation of the protocol stack must also be layered.
Early implementations, or implementations that were affected by early implementations, such as UNIX's streaming modules, such as the NDIS framework for Windows, were implemented strictly according to the layering principle, and the advantage of this implementation was that the caller could insert any processing logic at any existing point, without explicitly setting the hook point, For the NDIS framework of windows, as long as you invoke the API Composite NDIS Framework Convention, the filter driver can be implemented arbitrarily. Not only in the kernel state processing, even the socket level, Windows also provides the SPI LSP mechanism, in addition to everyone recognized BSD socket interface, TCP implementation, UDP implementation, IP implementation is built-in, the other you can be arbitrary external, even if the TCP/IP stack , nor is it necessary, if you look at the properties of the NIC, you will find that the TCP/IP protocol can be loaded separately, and if you install VMware, the VMware Bridging protocol will be installed automatically. In short, everything is flexible to install, there is no built-in Hook point, you can arbitrarily insert their own processing module between the layers.
This design is designed to facilitate developers to implement their own logic, developers just need to understand the relevant interface to insert their own logic anywhere, developers can even not understand the network protocol stack processing logic. Of course, the disadvantage of this design is also obvious, because only open the relevant API and not open the details, then you can not open but you really need to call the interface, so it is difficult for developers to implement the stack hierarchy between any combination.
Linux protocol stack design is completely not layered, but completely based on callbacks, it does not distinguish between protocol processing, small port drivers, and so on, because all callbacks are registered well sequenced, so if you want to insert a hook between two callbacks, you must first remove the previous link relationship. Linux takes another approach to dealing with this situation, which allows developers to organize calls to and from each other at their own level. For example, you can call another Hard_xmit callback inside a Hard_xmit callback function, or you can call XMit or any xmit in a recv. This arbitrary combination is recursive, seemingly chaotic, but flexible. You will find that the Linux Bridge,bonding,vlan is implemented this way, and these mechanisms do not insert a filter layer NDIS driver like NDIS, but instead directly combine various xmit. This implementation style of Linux requires developers to be familiar with the logic details of the network protocol stack processing, where they get a data unit that has the semantics of the network protocol stack rather than just a buffer in a function.
In addition to the boundary of the protocol layer, within the protocol layer processing, Linux defines a number of hook points, NetFilter is a very important framework, based on which it can achieve a very good firewall. Note that NetFilter is not just for firewalls, but a set of frameworks built into the protocol stack that can implement arbitrary packet queue,stolen and so on, on which IPSec VPNs can be implemented without the need to insert any hierarchy. In the ointment, Linux has less hook mechanism for sockets. If you want to hook the connect operation, you must match TCP's SYN flag on output chain ...
With the implementation of load balancing as an example, for NDIS, it is necessary to implement an intermediate layer filter driver, and for Linux, it is convenient to use bonding lb mode in addition to Ipvs.
6. Stool riding ditch in the ditch for the early water-flushing public toilets, here withheld ... Several answers, how many reasons.

Conntrack Hash Lookup in protocol stack processing/bloom filter/cache Find/package and packet/layered processing style

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.