Conntrack Hash lookup in protocol stack processing/bloom filter/cache Find/Package/package/layered processing style

Source: Internet
Author: User

1. The advantages and disadvantages of the routing cache have existed for many years. The essence of this is "minimizing the fastest memory." Maximizing the slowest memory "results in maximizing resource utilization. It improves access efficiency and saves resources. This is the basic principle of the entire cache design.
For memory access, almost all CPUs have a first-level cache built in. Level two cache, the affinity of a few core even designed the level three cache or even the four cache, then the physical memory, then a finely optimized disk swap partition, and finally the remote Storage. These storage spaces are progressively larger, and access costs are progressively larger, constituting a pyramid-type storage System. This design is designed to achieve maximum effectiveness. The main use of the principle of local, to the large said is the general nature of the Levi flight principle. This principle is not only effective in memory access, and human civilization is such a levy flight principle, I do not want to speak from the south of the last North American immigrant tide and westward movement, so this paragraph stopped.


By extension The routing table for the protocol stack is meant to be found, not to mention the Cisco Express Forward (CEF system), which is enough to route the cache.

Early Linux (up to 2.6.39 in the past.) Don't talk about 3. X), in the System routing table first and build a route cache, each cache table entry is a successful routing query results, built-in a time-out expires. The Linux routing cache design assumes that packets destined for a destination address will come in succession.

This is also a generalization of the principle of locality. However, unlike the migration wave of memory access and a single destination, Linux is not serviced as a router, and Linux's route cache holds Sourceip/destinationip value pairs. Therefore, an entry in one of the routing tables will expand the N route cache entry:
1.1.1.0/24-192.168.1.254==>{(2.2.2.1,1.1.1.1,192.168.1.254), (2.2.2.2,1.1.1.1,192.168.1.254), ( 2.2.2.3,1.1.1.2,192.168.1.254), (2.2.2.3,1.1.1.10,192.168.1.254) ...}
Assuming that at the same time the crossing of the two-tuple data based on the source ip/target IP is huge (especially in the case of the backbone router), the number of routing caches will far exceed the number of routing table entries, I think. The query dominance of the route cache compared to the routing table where is it now? Would there be such an efficient route cache query algorithm? If so, why not apply directly to the longest prefix-matching routing table lookup? It is very clear that the routing table lookup is far more restrictive than the route cache lookup.
The idea of routing the cache is good, but it is not suitable for software implementation, false with Tcam hardware, may be able to achieve parallel multi-dimensional vector exact matching, but for the software implementation, because the protocol stack natural multi-CPU core utilization of the drawbacks, the most efficient way is to grade hash, Even the most efficient multi-level hash implementation is not a free lunch. Assume that the exception traffic is considered. Very easy to put your routing cache table to burst, such as the construction of a large amount of different source IP address to access different destination IP address.
The principle of Tcam does not speak, online information is more.
The 2.conntrack Hash lookup netfilter has a conntrack module in it. It can establish a random five-tuple connection tracking. But assuming a packet comes in, a lookup must be made. So that it can be associated to a connection trace. The efficiency of this search has a direct impact on the quality of the device. The lookup algorithm is very simple, first computes a hash based on a five-tuple. The hash value is then traversed on the list of conflicting links until an exact match is found, assuming none is new.
Talking about Hash. Assuming only theoretically, it is an extremely efficient time/space phase equilibrium algorithm that, without regard to space waste, can find results through constant time, and an extreme, but degenerate into a simple linked list. Find time complexity and traversal consistency. The time complexity of the real hash algorithm is between the two. So Linux is implementing Conntrack. This should be like the trie routing tree. Dynamically adjusts the size of the hashsize. The appropriate rehash operation is performed to keep the average number of elements on each conflicting list within a certain range. This will allow you to maintain good scalability and scalability, but your cost is only consumed by memory.

But Linux does not seem to be doing so.
Repeatedly want to optimize Nf_conntrack's search performance, but always subject to the following ideal. That is "ideally." The Conntrack search only needs to do a hash calculation. Then iterate over an element or a list of single-digit elements ", assuming that this ideal is implemented. Are you still using it to optimize? I assume that the number of average elements in the conflict list is 10. So as long as 10,000 hash barrels, can accommodate 10*10000 a connection, the efficiency is very high. But the ideal, after all, but ideal. The input of the hash value calculation is five tuples, the algorithm is fixed. Therefore, assuming that the hash algorithm is not good enough, the calculation results of the hash degree and input is highly correlated, both theoretical and factual evidence. It involves finding the hash algorithm can not do the ideal situation, to make hash output and input independent, it is necessary to adopt the concept of symmetric key operations: replacement, replacement, confusion, diffusion, etc., can refer to the operation of Des/aes, you can see this is still not a free lunch!


Hash lookup requires efficiency, assuming an algorithm is good enough for the input hash level. That's it. Suppose the hash level is very bad. Then the length of the conflict list will be particularly long, especially short, which for a fair scheduling system, the impact on performance is very large.
3. Bloom Filter-based forwarding the assumption is that Linux routing cache must be dismissed, then there has to be an alternative. In fact, I'm not just about Linux, I'm talking about general scenarios. As long as the software implementation of the routing cache, you must consider the efficiency of the cache lookup. The cache only has a fixed number of entries and is very small. Will reflect the advantages of the slow path, otherwise the rest of the mind.
If you pull the hardware, then how to toss is OK, but at this time I certainly do not talk about hardware.

I want to use pure software to optimize forwarding efficiency.
It is important to remember that one thing is that you cannot improve the performance of every part of the whole, just as the perpetual motive is impossible, and you must pour the unimportant part of your abilities into the parts that need to be optimized.


The benchmark for optimization is to differentiate between a high-speed path and a slow one. First, the high-speed path is searched, and if it fails, it goes to the slow route query. At the same time, the results are added to the highway warp. The motorway is very fast, provided it is exclusive, small and limited, just like the citizenship of the Roman Republic. is to rely on the struggle to obtain, to the Caracalla period. Citizenship becomes the right, the advantage of nature is gone. The efficient alternative to traditional routing caches is achieved using bloom.
For each next hop maintenance of a fixed size (but the execution period can be dynamically adjusted) of the Bloom filter, the arrival of the packet, first traverse all the next hop of the bloom filter, the result is nothing more than three kinds:
A. Only one Bloom filter returns a value of 1;
B. There are multiple bloom filters with a return value of 1.
C. The return value of no bloom filter is 1;

For a, go straight to that next hop. For B, the description exists false Positive. At this point, it is stated that any bloom filter that returns 1 may indicate the correct result, either return to the slow path or broadcast between all possible results, for C. You must return to the slow path at this point.

Please note that we certainly hope the result is a. At this time the calculation is very fast, n hash calculation can be, as for the calculation, can be in one processor serial, can also be multiple processors parallel.
For Bloom filters. To make it easier to delete elements, use an int type with counter instead of a bit that represents 0 and 1. In order to prevent the result of false Positive always false Positive, all bloom filters have a backing store in memory. Save the list of elements, each interval for a period of time, in the background to update the hash algorithm, Bloom update operation.

I am not in favour of introducing a new level. For example, a cache layer is introduced. Because that would add to the amount of maintenance. Complexity management is required. The bell also needs to be ringing, since in order to avoid calculating the same IP address pair, it is necessary to change the algorithm instead of introducing the new layer at the same level. Compensation is not necessary.
In a word, the hash algorithm of the Bloom filter must be kept stationary, the occupied space must be small.
4. Packet or packet OpenVPN the virtual network card simulates jumbo frames in order to increase the throughput of encryption/decryption and reduce the overhead of system calls. But is it good to use jumbo frames for real physical links?
Small package is flexible, this is also the true meaning of packet exchange, then big Bag? Cumbersome, an infinitely large package is a circuit-switched stream. It will occupy the link for a long time.

Just like a collector or an excavator. Fortunately in reality the link that enables jumbo frames it really has the ability to transmit jumbo frames (which means it's way wider!).

), there is nothing wrong with that.


However, if you have a very wide path, do you have to transfer jumbo frames? Wouldn't it be more efficient to run small frames? Two-way 10 lanes are no problem for running a set of cards. But running a coupe is expected to be better. Assuming that the cargo is to be shipped, the throughput of the transporter is no better than the freighter. But the delay is much smaller.

In fact. On a post-departure link. Even if you transmit jumbo frames, it is not guaranteed to be split in the middle of the journey.

So based on energy consumption. Packet switching efficiency, fragmentation/reassembly overhead considering that jumbo frames have no advantage, I've modeled jumbo frames in OpenVPN. Because I was able to control everything on the OpenVPN link, even though the physical network card and the data frame of the virtual network card are jumbo frames, but I can prove it by test. The Shard/reassembly overhead for these jumbo frames is much larger than the OpenVPN packet encryption/decryption and system call overhead.
So. What is the purpose of jumbo frames? Just to counteract the benefits of packet switching?? In fact, this is optimized for hosts, especially for Windows hosts. Such hosts are generally at the edge of the network. As end-to-end processing. Therefore, for the incoming data frame, assuming that the interruption is too frequent, it is bound to reduce the processing performance of the application layer, after all, the total resources are limited. Jumbo frames can reduce the number of interrupts.
I used to say that jumbo frames are timely and not absolute, and that is true for high-performance single-link. Jumbo frames add overhead to data networks that approach narrow-band links. and blocked the road on the last kilometer of the link.

The Shard/reassembly of the intermediate node may be more than once, as long as the protocol header needs to be analyzed. are required to reorganize shards, such as stateful NAT, for example firewalls. For example, data flow classification ... For the frequency of interruptions, the current Gigabit card, Wanchaoca can be in the network card chip in the Shard, reorganization. Then interrupt the CPU. At the same time it is also possible to accumulate data frames to a certain amount and then interrupt the CPU once. The next interrupt is polling. See the Intel IGB e1000e driver Readme for details.
5. The Layered processing style protocol stack design is layered, which does not mean that the implementation of the protocol stack must also be layered.
An early implementation, or an implementation affected by an early implementation. For example, a UNIX streaming module. For example, the NDIS framework for Windows. are implemented strictly in accordance with the layering principle, the advantage of this implementation is that the caller is able to insert whatever processing logic regardless of the existing point, without explicitly setting the hook point. For the NDIS framework of Windows. As long as you invoke the API Composite NDIS Framework Convention, the filter driver can be implemented arbitrarily. Not only in the kernel processing, even the socket level, Windows also provides the SPI LSP mechanism, in addition to everyone recognized BSD socket interface, TCP implementation. UDP implementation. IP implementation is built-in, other you can be arbitrary external, even the TCP/IP stack, it is not necessary, suppose you look at the properties of the NIC. You will find that the TCP/IP protocol can be loaded separately, assuming you have VMware installed. The VMware bridging protocol will be installed on its own initiative. In short, everything is flexible to install, no built-in hook points. You are free to insert your own processing modules between layers.
This design is designed to make it easier for developers to implement their own logic, and developers need to understand the interface to be able to insert their own logic in any location. Developers can even not understand the network protocol stack processing logic at all. Of course. The disadvantage of this design is also obvious, because only open the relevant API and not open the details, then you can not open but you really need to call the interface. It is therefore very difficult for developers to implement arbitrary combinations of protocol stack levels.
Linux protocol stack design is completely not layered, but completely based on callbacks, it does not distinguish between protocol processing, small port driver, and so on, because all callbacks are well sequenced, so if you want to insert a hook between two callbacks, you must first remove the previous link relationship. Linux uses a second way to deal with this situation, which allows developers to organize their own calls to and from each level. For example, you can invoke a Hard_xmit callback inside a Hard_xmit callback function, and you can invoke XMit or arbitrary xmit in a recv. This random combination is recursive, seemingly chaotic, but actually flexible.

You will find that the Linux bridge. Bonding VLANs are implemented in such a way that they do not insert a filter layer NDIS driver like NDIS, but instead directly combine various xmit. This implementation style of Linux requires developers to be familiar with the details of the network protocol stack processing logic. What they get in a function is a data unit that has the semantics of the network protocol stack, not just a buffer.
In addition to the boundary of the protocol layer, inside the protocol layer processing. Linux defines a number of hook points, and NetFilter is a very important framework, based on the ability to implement a very good firewall. Note that NetFilter is not used only for firewalls, but rather as a set of frameworks built into the protocol stack, which enables arbitrary packet Queue,stolen. On this basis, IPSec VPNs can be implemented without the need to insert at any level. In the ointment, Linux has less hook mechanism for sockets.

Assuming you want to hook the connect operation, you must match TCP's SYN flag on the output chain ...
With the implementation of load balancing as a sample, for NDIS, it is necessary to implement an intermediate layer filter driver, and for Linux, in addition to Ipvs, using the bonding lb mode is very convenient.
6. Stool riding ditch in the ditch for the early water-flushing public toilets, here withheld ... Several answers. How many reasons.


Conntrack Hash lookup in protocol stack processing/bloom filter/cache Find/Package/package/layered processing style

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.