Optimization of a netfilter nf_conntrack flow table lookup-Adds a per CPU cache for Conntrack

Last Update:2015-08-02 Source: Internet

Author: User

Tags goto

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sorrow needs to endure, happiness needs to be shared
I couldn't stand the performance of the Linux stack multiple times perf, but it was so powerful that I couldn't give it up, I wanted to implement a fast flow meter myself, but I had to abandon many of the features that depended on conntrack, like state Match,linux Nat and so on, admittedly, I've complained too much about NAT, but it's not that many people are using it anyway.
Once, I have done a conntrack based on offline statistics optimization, the idea is very simple, the use of dynamic computing mode instead of a unified hash algorithm. I'll take a sample of all five tuples that pass through the box and then analyze the data offline, for example, by stitching five of tuples into a 32-source IP address + 32-bit destination IP address + 8-bit protocol number + 16-bit source port + 16-bit destination port 104-bit long string (in my implementation, I ignored the source port, Because it is a variable, it is worth to be self-willed to ignore), and then according to the size of the hash bucket, for example, N, to Logn bit for a window size on a 104-bit string to slide, to find the largest range of different numbers, with this interval as the module interval, so that the data stream can be evenly distributed in the various hash If the data flow through many causes the conflict chain list too long, you can build a multidimensional nested hash, the hash table upside down look, it is like a balanced N fork tree Ah, n fork trie tree is this? The hash function here is "take some bits", which again shows the unity of trie and hash.
Although the above optimization is beautiful, but it is still complex, this optimization idea is I from the hardware cache design ideas for reference. However, compared with the hardware cache such as CPU cache, the software similar effect greatly reduced, because the software processing hash conflict can only traverse or find, while the hardware can be simultaneously. Please do not think that the algorithm is not superior to the school, which is the physical nature of the decision. The hardware uses the gate circuit, the flow is the current, and the current is like the flow parallel connected, the software uses the logic, the flow is the step, this is the algorithm, the algorithm is a series of logical steps of the combination, of course, there are many complex so-called parallel algorithms, but as far as I know, many effects are not good, Complexity brings more complexity, and ultimately it is not enough to continue to complicate, had to give up, in addition, so simple thing, make complex algorithm a little cannon hit fly.
Nf_conntrack Simple optimization-Add a cache if something does not match the subsequent processing rate and becomes a bottleneck, then add a cache to smooth the difference. The CPU cache is taking advantage of this idea. For the efficiency of nf_conntrack, we should also use the same idea. But how to do it, still need at least some qualitative analysis.
If you use tcpdump to grab the bag, you will find that the result is almost always a series of successive crawled packets belonging to the same five-tuple data stream, but also not absolute, sometimes there will be a packet inserted into a stream, a very reasonable grasp of the result may be the following:
Data Flow a positive direction
Data flow a positive direction
Data flow a reverse direction
Data flow a positive direction
Data Flow C Positive direction
Data flow a reverse direction
Data flow a direction
Data Flow b Inverse direction
Data Flow B Positive direction
Data Flow positive direction
....
Do you see the rules? Packets Arriving at box follow the non-strict temporal locality, that is, packets belonging to a stream will continue to arrive. As for spatial locality, many people say it is not obvious, but if you carefully analyze the data flow a,b,c,d ... Source/Destination IP tuple, you will find their spatial locality, which is the basic principle of TCAM hardware forwarding design. The "some bits" in the Tcam "some bit" indicates that these bits are the most scattered parts of space, which is a reverse use of spatial locality, such as the core transmission network, you will find that a large number of IP are going to North America or northern Europe.
I would like to use mathematics and statistics in this article to explain the law, but this behavior is not suitable for a public blog, when someone interviewed me asked me this question, I can only rush a few words, and then if necessary, I will e-mail the way to deep analysis, but for a blog, This is a way of showing off, and losing a lot of readers, and naturally there is no man for me to comment. The most important thing in a blog is to give a quick result, which is what to do. Anyway
If the above-mentioned "take some bits hash" based on "spatial local reverse utilization" is the original "efficiency from the rule" the law, then the cost of the rule is complicated, this complication let me not continue. There is a more universal principle than this law is "efficiency from simple", I like simple things and simple people, this time, I once again proved my right. Before I go on, I'll briefly describe where the nf_conntrack bottleneck is.
1.nf_conntrack the positive and negative tuple use a hash table, insert, delete, modify operation requires the global lock to protect, this will favor a large number of serialization operations.
2.nf_conntrack hash table using the Jhash algorithm, this algorithm operation too many steps, if the conntrack number of small, hash operation will consume a huge performance.
[Tips: If you understand cryptography in the Des/aes and other symmetric encryption algorithm, you will understand, replace, invert, XOR or operation can complete the best confusion of data, so that output is independent of the output, so as to achieve the best hash, however, the cost of this effect is the operation is complicated, encryption and decryption efficiency problem more here, This operation is so regular (a variety of boxes) that can be fully implemented with hardware circuitry, but without such hardware to use the CPU, this operation is extremely CPU-intensive, Jhash is the same, although not very. ]
The 3.nf_conntrack table is global across multiple CPUs, which involves data synchronization, although it can be minimized by RCU, but in case someone writes them.

In view of the above, gradually break down, the solution has.

The 1.cache is built on a per-CPU basis, which is fully compliant with the cache's localization design principle, not the CPU cache.
The 2.cache is as small as possible, saving the most likely hit of the data flow items, while ensuring that the cache misses are not too expensive.

3. Establish a reasonable cache replacement adaptive principle to ensure that the incumbent is in his position, and that the principle of self-abdication is not thought of by the enterprising

My design thinking is these, in the process of gradual implementation, I initially only retained a cache entry, which is the last one found in the Conntrack hash table of the item, which is fully consistent with the time locality, but I found in the test, If there is a slow stream such as ICMP in the network, the cache jitter will be very strong, and the TCP stream, ICMP is too slow, but according to the queuing principle, it will eventually queue to a TCP stream, resulting in cache replacement, in order to avoid this sad situation, I for Conntrack , that is, the conn structure joined the timestamp field, each time the hash is found, subtract the timestamp field with the jiffers at the time, and update the field as the current jiffers, the cache substitution is performed only if the difference is less than a predetermined value. This value can be obtained by weighting the network bandwidth.
But is this perfect? Not much! Considering the CPU cache design, I found that the Conntrack cache is completely different, for the CPU, due to the virtual memory mechanism, the cache will be saved within the address space of the same process (regardless of the more complex CPU cache principle ...), Therefore, the time locality is certain unless a branch jump or function call occurs. But for network packets, which are completely queued, the namespace of all the packets is the world IP address set, which means that there will be a random stream of packets inserted in the moment. One of the most common situations is data flow switching, such as data flow A and data flow B transmission rate, through the network bandwidth strength, they are likely to alternately arrive, or interval two or three packets alternately arrived, in this case, you have to take care of whom? This is a third principle: efficiency comes from fairness.

So, my final design looks like this:

1.cache is a linked list, and the length of the list is a parameter that is worth fine tuning
The cache list is too short: flow items are frequently replaced in the Conntrack hash table and the cache.
Cache list too long: Cache misses are too expensive to deal with stream items that cannot hit the cache.
Winner Principle: Winner All-in-all.and He who hath, shall add to him, and call him superfluous. No, even all he has to take over. (Matthew)Principle of equalization 1-for winners: The time to traverse the cache list cannot be longer than the standard hash calculation + traversal of the conflicting list (average).
Balancing principle 2-for losers: If you traverse the list without a hit, you lose some time that should not be lost, but keep the loss in an acceptable range.

Effect: The faster the data flow arrives, the easier it is to hit the cache at a very low cost, and the slower the data flow rate is, the less likely it is to hit the cache, but it doesn't have to pay a high price.

2. Timestamp-based cache substitution principle
The cache substitution is performed only if the successive packet arrival interval is less than a dynamically computed value.

My intermediate Step test code is as follows:

Modify the NET/NETFILTER/NF_CONNTRACK_CORE.C//EMAIL:[EMAIL&NBSP;PROTECTED]//1. Define the definition # a#ifdef a/* * Max_cache Dynamic calculation principle: * Cache list length = average conflict list length/3, where: * Average conflict list length = Net.nf_conntrack_max/net.netfilter.nf_conntrack_buckets * 3 = XP * */#define Max _cache 4struct conntrack_cache {struct nf_conntrack_tuple_hash *caches[max_cache];};D EFINE_PER_CPU (struct Conntrack_cache, Conntrack_cache), #endif//2. modifying resolve_normal_ctstatic inline struct Nf_conn *          RESOLVE_NORMAL_CT (struct net *net, struct sk_buff *skb, unsigned int dataoff, u_int16_t l3num,          u_int8_t protonum, struct nf_conntrack_l3proto *l3proto, struct Nf_conntrack_l4proto *l4proto,    int *set_reply, enum ip_conntrack_info *ctinfo) {struct nf_conntrack_tuple tuple;    struct Nf_conntrack_tuple_hash *h;    struct Nf_conn *ct; #ifdef A int i;  struct Conntrack_cache *cache; #endif if (!nf_ct_get_tuple (SKB, Skb_network_offset (SKB), Dataoff, L3num, ProtonuM, &tuple, L3proto, L4proto)) {pr_debug ("Resolve_normal_ct:can ' t get tuple\n");    return NULL;    } #ifdef A cache = &__get_cpu_var (Conntrack_cache);    Rcu_read_lock ();    if (0/* Optimization 3 */) {goto Slowpath;        } for (i = 0; i < Max_cache; i++) {struct Nf_conntrack_tuple_hash *ch = cache->caches[i];        struct Nf_conntrack_tuple_hash *ch0 = cache->caches[0];            if (Ch && nf_ct_tuple_equal (&tuple, &ch->tuple)) {ct = nf_ct_tuplehash_to_ctrack (CH);                    if (Unlikely (nf_ct_is_dying (CT) | |                !atomic_inc_not_zero (&ct->ct_general.use))) {h = NULL;            Goto Slowpath; } else {if (Unlikely (!nf_ct_tuple_equal (&tuple, &ch->tuple))) {n                    F_ct_put (CT);                    h = NULL;                Goto Slowpath; }            }/*******************************Optimization 1 Introduction The *****************************************//* does not directly ascend to the first, but rather increases according to the interval of two cache hits, and the ascending step is inversely proportional to the interval of time *//* This avoids the sharp jitter of the cache queue itself. In fact, if the time interval of the hit is weighted by the historical interval value, the effect will be better *//*********************************************************************************                /* * Improve hit priority based on time locality */if (i > 0/* && optimize 1 */) {                CACHE-&GT;CACHES[0] = ch;            Cache->caches[i] = Ch0;        } h = ch;    }} ct = Null;slowpath:rcu_read_unlock ();    if (!h) #endif/* Look for tuple match */h = nf_conntrack_find_get (NET, &tuple);        if (!h) {h = init_conntrack (NET, &tuple, L3proto, L4proto, SKB, Dataoff);        if (!h) return NULL;    if (Is_err (h)) return (void *) H;        } #ifdef A else {int J;        struct Nf_conn *CTP; struct Nf_conntrack_tuple_hash *chp;/*********************** Optimization 2 Introduction **************************//* Only two consecutive packets arrive at a time interval less than nCache replacement *//* is performed to avoid cache jitter caused by slow streams such as ICMP *//************************************************************/        if (0/* Optimization 2 */) {goto skip; }/************************** Optimization 3 Introduction *****************************//* The cache is enabled only if the total number of conntrack is greater than 4 times times the number of hash buckets *// * Because of the small number of conntrack, after a hash operation can be positioned once, *//* or through a short traversal of the conflict linked list can be located, using the cache instead of performance *//***********************************        /if (0/* Optimization 3 */) {goto skip;        ct = nf_ct_tuplehash_to_ctrack (h);        Nf_conntrack_get (&ct->ct_general);        CHP = cache->caches[max_cache-1];        for (j = max_cache-1; j > 0; j--) {Cache->caches[j] = cache->caches[j-1];        } Cache->caches[0] = h;            if (CHP) {CTP = Nf_ct_tuplehash_to_ctrack (CHP);        Nf_conntrack_put (&ctp->ct_general);    }}skip:if (!ct) {ct = Nf_ct_tuplehash_to_ctrack (h); } #else ct = nf_Ct_tuplehash_to_ctrack (h); #endif/* It exists; We have (non-exclusive) reference.        */if (nf_ct_direction (h) = = ip_ct_dir_reply) {*ctinfo = ip_ct_established + ip_ct_is_reply;    /* Please set reply bit if this packet OK */*set_reply = 1; } else {/* Once we ' ve had the comms, always established. */if (Test_bit (Ips_seen_reply_bit, &AMP;CT-&G            T;status) {pr_debug ("Nf_conntrack_in:normal Packet for%p\n", CT);        *ctinfo = ip_ct_established; } else if (Test_bit (Ips_expected_bit, &ct->status)) {pr_debug ("nf_conntrack_in:related Packet for%p\n            ", CT);        *ctinfo = ip_ct_related;            } else {pr_debug ("Nf_conntrack_in:new Packet for%p\n", CT);        *ctinfo = ip_ct_new;    } *set_reply = 0;    } SKB-&GT;NFCT = &ct->ct_general;    Skb->nfctinfo = *ctinfo; return CT;}  2. Modify Nf_conntrack_initint nf_conntrack_init (struct net *net) {  int ret; #ifdef A int i; #endif if (Net_eq (NET, &init_net)) {ret = Nf_conntrack_init_init_net ();    if (Ret < 0) goto Out_init_net;    } ret = nf_conntrack_init_net (NET);    if (Ret < 0) goto Out_net; if (net_eq (NET, &init_net)) {/* for use by REJECT target */Rcu_assign_pointer (Ip_ct_attach, Nf_conntra        Ck_attach);        Rcu_assign_pointer (Nf_ct_destroy, destroy_conntrack);    /* HOWTO Get NAT offsets */Rcu_assign_pointer (Nf_ct_nat_offset, NULL);        } #ifdef A/* Initializes the Conntrack cache queue per CPU */FOR_EACH_POSSIBLE_CPU (i) {int J;        struct Conntrack_cache *cache;        Cache = &per_cpu (Conntrack_cache, i);        for (j = 0; J < Max_cache; J + +) {Cache->caches[j] = NULL;    }} #endif return 0;out_net:if (net_eq (NET, &init_net)) nf_conntrack_cleanup_init_net (); Out_init_net: return ret;}

Would like to see the people have the opportunity to test. Effects and questions can be sent directly to the mailbox shown in the code comment.

Optimization of a netfilter nf_conntrack flow table lookup-Adds a per CPU cache for Conntrack

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More