Conntrack HASH Lookup/Bloom filtering/CACHE lookup/large package and small packet/Layered processing style in protocol stack processing, conntrackbloom

Source: Internet
Author: User

Conntrack HASH Lookup/Bloom filtering/CACHE lookup/large package and small packet/Layered processing style in protocol stack processing, conntrackbloom
1. the Hierarchical Storage System of the advantages and disadvantages of Route CACHE has been in existence for many years. Its essence is to "minimize the fastest memory and maximize the slowest memory ", this results in the maximization of resource utilization, which not only improves access efficiency, but also saves resources. This is the basic principle of all CACHE designs.
For memory access, almost all CPUs have a built-in level-1 cache and level-2 cache. The core with good affinity even has a level-3 cache or level-4 cache, followed by the physical memory, then there is a precisely optimized disk swap partition, and finally remote memory. These storage space increases step by step, and the access overhead also increases step by step, forming a pyramid-type storage system. To achieve the maximum effect of such a design, the local principle is mainly used. To be bigger, it is the general principle of column-dimension flight in nature. This principle is effective not only in memory access, but also in the construction of human civilization, such as the Levi flying principle. I don't want to talk about the last North American immigration tide and the westward movement from the south ape, therefore, this section is blocked.
Broadly speaking, the route table of the protocol stack is generated for search, not to mention Cisco's Express Forward (CEF system). Simply put, the route cache is enough. In earlier Linux versions (from to 2.6.39, not about 3.X), a route cache was built on the system route table first. Each cache table item is a successful route query result with a built-in timeout period. The premise of the Linux route cache design is that data packets destined for a destination address will arrive continuously. This is also a promotion of local principles. However, unlike memory access and migration to a single destination, Linux, as a router, is not a data flow service, while the Linux route cache stores SourceIP/DestinationIP value pairs, therefore, the items in a route table will expand N route cache items:
1.1.1.0/24-1900001.254 ==>{ (2.2.2.1, 1.1.1.1, 192.168.1.254), (2.2.2.2, 1.1.1.1, 192.168.1.254), (2.2.2.3, 1.1.1.2, 192.168.1.254), (2.2.2.3, 1.1.1.10, 192.168.1.254 )...}
If the source IP address/Target IP address-based dual-group data is huge at the same time (especially in the backbone router scenario), the number of Route caches will far exceed the number of route table items, I was wondering where the advantage of Route cache over route table query lies in? Is there such an efficient routing cache query algorithm? If so, why not apply it directly to the route table with the longest prefix match? It is clear that route table search is far from stricter than route cache search.
The idea of routing cache is good, but it is really not suitable for software implementation. If there is a TCAM hardware, it may be able to achieve precise matching of parallel multi-dimensional vectors, but for software implementation, due to the disadvantages of the inherent multi-CPU core utilization of the protocol stack, the most efficient method is hierarchical hash, and even the most efficient multi-level hash implementation is not a free lunch. If abnormal traffic is taken into account, it is easy to support your route cache table. For example, you can create a large number of different source IP addresses to access different target IP addresses.
I will not talk about the principles of TCAM. There are many online materials.
2. There is a conntrack module in the hash Lookup Netfilter of conntrack, which can establish any 5-tuple connection trace. However, if a packet comes, you must perform a search to associate it with a connection trace. The search efficiency directly affects the quality of the device. The search algorithm is very simple. First, a hash is calculated based on the quintuple, and then traversed on the conflicting linked list with the hash value until an exact match is found. If not, a new one is created.
Speaking of hash, if we only look at it theoretically, it is an extremely efficient time/space equilibrium algorithm, without considering space waste, it can find the result through constant time, another extreme, it degrades to a simple linked list, and the search time complexity is consistent with the traversal. The real hash algorithm has a time complexity between the two. Therefore, when implementing conntrack in Linux, You Should dynamically adjust the hashsize like the Trie routing tree, that is, perform the appropriate rehash operation, to keep the average number of elements on each conflicting linked list within a fixed range, you can maintain good scalability and scalability, at the cost of consuming only memory. But Linux does not seem to do this.
I have repeatedly tried to optimize the nf_conntrack search performance, but it is always subject to the following ideal: "Ideally, the conntrack search only requires one hash computing, then traverse the linked list of an element or single-digit element. If this is achieved, will it be difficult to optimize it? I suppose that the average number of elements in the conflicting linked list is 10, so as long as there are 10000 hash buckets, it will be able to accommodate 10*10000 connections, which is very efficient. But after all, the ideal is just an ideal! The input for hash value calculation is a quintuple, and the algorithm is fixed. Therefore, if the hash algorithm is not good enough, the hash degree of the calculation result is highly correlated with the input, both theory and facts prove that the hash algorithm involved in search cannot achieve the ideal situation. To make the hash output irrelevant to the input, we must adopt the operational concept in symmetric key learning: replacement, replacement, for details about obfuscation and spread, refer to the DES/AES operation steps to see that this is still not a free lunch!
Hash search requires high efficiency. If an algorithm is good enough for the input hash, that's the end. If the hash degree is poor, the length of the conflicting linked list may be particularly long or short, which has a great impact on the performance of a fair scheduling system.
3. If the Linux route cache is subject to class, there must be an alternative. In fact, I am not only talking about Linux, but about General scenarios. As long as it is a software-implemented route cache, the efficiency of cache search must be considered. The cache only shows the advantage of slow paths when the number of entries is fixed and the number of entries is small. Otherwise, the rest is just an idea.
If the hardware is involved, it would be okay, but I will not talk about the hardware at this time. I want to use pure software to optimize forwarding efficiency.
Remember that one thing is important: you cannot improve the performance of every part of the whole, just as it is impossible to keep motivated, you must devote less important capabilities to the areas that need optimization.
The benchmark for optimization is to distinguish between a high-speed and slow-speed path. First, find the high-speed route. If it fails, go to the slow-speed path query and add the result to the high-speed route. The expressway was very fast, provided that it was exclusive with a small capacity and limited capacity. Just like civil rights in the Republic of Rome, it had to be obtained through struggle. In the karkara period, citizenship becomes a matter of gain, and the advantage is naturally lost. Bloom can be used as an efficient alternative to traditional route cache.
Maintain a fixed size (but can be dynamically adjusted during runtime) Bloom filter for each next hop. When the data packet arrives, first traverse all the Bloom Filters For The Next Hop. There are no more than three results:
A. Only one Bloom filter returns 1;
B. the return value of multiple Bloom Filters is 1;
C. The return value of no Bloom filter is 1;

For a, it can be directly sent to the next hop. For B, it indicates that there is a False Positive. In this case, all Bloom Filters that return 1 may indicate correct results, either return the slow path or broadcast between all possible results. For c, the slow path must be returned. Note that, of course, we want the result to be a. At this time, the computation is very fast, and N hash calculations are enough. As for the calculation method, you can serial one processor or multiple processors in parallel.
For the Bloom filter, to facilitate the deletion of elements, the int placeholder with counter is used instead of a bit that represents 0 and 1. To prevent False Positive results from being always False Positive, all Bloom Filters must have a backup storage in the memory, saving the linked list of elements at intervals of time, update the hash algorithm in the background and perform the Bloom update operation. I am not in favor of introducing a new level, such as introducing another cache layer, because it will increase the maintenance volume and require complexity management. To avoid the same calculation result of the same IP address pair, the algorithm is changed instead of introducing a new layer. Compensation is unnecessary at the same layer.
All in all, the hash algorithm of the Bloom filter must be exquisite, And the occupied space must be small.
4. large packets or small packets OpenVPN simulates giant frames on the virtual network card. In order to increase the encryption/Decryption throughput and reduce the system call overhead. But is it okay to apply a mega frame to a real physical link?
Small Packets are flexible, which is also the true meaning of group exchange. What about big packets? When the end is too large, an infinitely large package is a circuit switching stream, which occupies the link for a long time. Just like a card or an excavator. Fortunately, in reality, the link that enables a giant frame is capable of transmitting a giant frame (indicating that its path ratio is wider !), There is no problem.
However, if you have a wide path, do you have to transmit giant frames? Isn't it more efficient to run a small frame? Two-way 10-lane routing is no problem for the collection card, but it is better to run the car. If you have to transport the cargo, the throughput of the transport plane is no better than the freight wheel, but the delay is much lower. In fact, even if you transmit a giant frame on the link after a departure, it is not guaranteed to be split on the way. Therefore, considering the energy consumption, group switching efficiency, and shard/reorganization overhead, the giant frames have no advantages. The reason why I have simulated giant frames in OpenVPN is that, it is because I can fully control everything on the OpenVPN link. Although the data frame of the physical and virtual network interfaces is a giant frame, I can test it to prove that, the overhead of sharding/restructuring for these giant frames is much higher than the overhead of OpenVPN packet encryption/decryption and system call.
So what is the original intention of a giant frame? Just to offset the benefits of group exchange ?? In fact, this is an optimization for the host, especially for running Windows Hosts. These Hosts are generally at the edge of the network and are processed as end-to-end processes. Therefore, if the incoming data frame is interrupted too frequently, it will inevitably reduce the processing performance of the application layer. After all, the total resources are limited! Giant frames can reduce the number of interruptions.
I once said that mega frames are suitable for use and are not absolute. This is true for a single high-performance link. For data networks that use narrowband links, giant frames add overhead and are congested on the last kilometer of the link. The sharding/reorganization of intermediate nodes may not be performed once. As long as the protocol header needs to be analyzed, the sharding needs to be reorganized, such as status NAT, such as firewall, such as data stream classification... for the Interruption Frequency, the current Gigabit card and 10-Gigabit card can all be split and reorganized in the NIC chip, and then the CPU is interrupted. At the same time, data frames can be accumulated to a certain amount, the CPU is then interrupted once, and then switched to polling. For details, see the Intel IGB e1000e driver readme.
5. the layered processing style of the protocol stack is designed to be layered, which does not mean that the implementation of the protocol stack must also be layered.
Early implementations, or implementations affected by earlier implementations, such as UNIX stream modules, such as Windows NDIS frameworks, are implemented in strict accordance with the layering principle, the advantage of this implementation is that the caller can insert any processing logic at any existing point without explicitly setting the HOOK point. For the Windows NDIS framework, as long as you call the API composite NDIS framework convention, the filter driver can be implemented at will. Not only in kernel state processing, but even at the socket level, Windows also provides the spi lsp mechanism, in addition to the BSD socket interface, TCP implementation, UDP implementation, IP implementation and so on are built-in and others can be connected to any external network. Even the TCP/IP stack is not required. If you look at the attributes of the network card, you will find that the TCP/IP protocol can be installed and uninstalled separately. If you install VMWare, the VMWare bridging protocol will be automatically installed. In short, everything can be installed flexibly, without built-in HOOK points, you can insert your own processing module in any layer.
This design is designed to facilitate developers to implement their own logic. Developers only need to understand the relevant interfaces to insert their own logic at any location. developers can even not understand the network protocol stack processing logic at all. Of course, the disadvantages of this design are also obvious. Because you only open the relevant APIs but not open the details, you cannot call the interfaces that are not open but you really need, therefore, it is difficult for developers to implement any combination of protocol stack layers.
The protocol stack design of Linux is not hierarchical, but fully callback-based. It does not differentiate protocol processing, small port drivers, and so on. Because all Callbacks are registered and sorted, therefore, if you want to insert a HOOK between two callbacks, you must first remove the previous link. Linux uses another method to handle this situation. It allows developers to organize the mutual calls between different layers on their own. For example, you can call another hard_xmit callback in a hard_xmit callback function, or you can call xmit or any xmit in a recv .. this arbitrary combination is recursive, seemingly chaotic, but actually flexible. You will find that Linux's bridge, bonding, and vlan are implemented in this way. These mechanisms do not insert a filter layer NDIS driver like NDIS, but directly combine various xmit. This implementation style of Linux requires developers to be very familiar with the details of the network protocol stack processing logic. What they get in a function is a data unit with the semantics of the network protocol stack, not just a buffer.
Apart from the protocol layer boundary, Linux defines several HOOK points within the protocol layer. Netfilter is a very important framework based on which it can implement a great firewall. Note that Netfilter is not only used for firewalls, but is a set of built-in framework of the protocol stack. It can implement arbitrary data packets such as QUEUE and STOLEN. On this basis, IPSec VPN can be implemented without inserting any layers. In the US, Linux has fewer socket HOOK mechanisms. If you want to HOOK the connect operation, you must match the TCP syn flag in the OUTPUT chain...
Take the implementation of Server Load balancer as an example. For NDIS, an intermediate layer filter driver is required. For Linux, except IPVS, It is very convenient to use the bonding lb mode.
6. There are several answers to the early flush-style public toilets in the bucket where the stool is located, for many reasons.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.