Optimization of a Netfilter nf_conntrack stream Table query-adding a per cpu cache for conntrack, netfilterconntrack

Source: Internet
Author: User

Optimization of a Netfilter nf_conntrack stream Table query-adding a per cpu cache for conntrack, netfilterconntrack
Sorrow needs to endure, happiness needs to be shared
I cannot bear the performance of conntrack as a result of the Linux Protocol Stack's multiple perf operations, but its functionality is so powerful that I cannot give it away, I want to implement a quick stream table by myself, but I have to abandon the many functions dependent on conntrack, such as state match and Linux NAT. Although I complain too much about NAT, but in any case, isn't there a lot of people using it.
I used to optimize the conntrack search based on offline statistics. The idea is very simple. I used dynamic computing mode to replace the unified hash algorithm. I will sample and record all the quintuple groups that pass through the BOX, and then analyze the data offline, for example, concatenate a quintuple into a 32-source IP address + 32-bit destination IP address + 8-bit Protocol Number + 16-bit source port + 104-bit destination port long string (in my implementation medium, I ignored the source port because it is an easy variable and worthy of being ignored by me), and then according to the size of the hash bucket, for example, N, slide over a 104-bit string with the size of a logN-bit window to find the maximum number of differences, in this way, the data stream can be evenly distributed in each hash bucket. If too many data streams lead to too long conflicting linked lists, you can create a multi-dimensional nested hash to reverse the hash table, how is it like a Balanced N-Cross Tree? Isn't that the N-cross Trie tree? Here, the hash function is "getting some bits", which once again shows the unification of Trie and hash.
Although the above optimization is elegant, it is still complicated. This optimization idea is used for reference from the design idea of hardware cache. However, compared with the hardware cache, such as the CPU cache, the similar method of the software is much less effective, because the software can only traverse or search for hash conflicts, while the hardware can do so at the same time. The school should not consider that the algorithm is not superior enough, which is determined by the physical nature. The hardware uses the door circuit, the flow is the current, and the current is connected in parallel like the water flow, the logic used by the software, the flow is the step, this is the algorithm, algorithms are a combination of a series of logical steps. Of course, there are also many complicated so-called parallel algorithms. But as far as I know, many of the results are not good, and complexity brings more complexity, in the end, I couldn't afford to continue with the complexity, so I had to leave it alone. In addition, it was a bit difficult to develop complicated algorithms to crack the fly.
Simple Optimization of nf_conntrack-adding a cache if something does not match the subsequent processing speed becomes a bottleneck, then a cache is added to smooth the difference. CPU cache uses this approach. We should also use the same idea for the efficiency of nf_conntrack. But how to do this requires at least some qualitative analysis.
If you use tcpdump to capture packets, you will find that the result is almost always a series of consecutive captured packets belong to the same quintuple data stream, but it is not absolute, sometimes a packet is inserted into a stream. A reasonable packet capture result may look like the following:
Positive Direction of data stream
Positive Direction of data stream
Reverse Direction of data stream
Positive Direction of data stream
Positive Direction of data stream c
Reverse Direction of data stream
Data Stream a direction
Reverse Direction of Data Stream B
Positive Direction of Data Stream B
Positive Direction of data stream
....
Do you see the rule? The arrival of data packets to the BOX follows a non-strict temporal locality, that is, the data packets belonging to a stream will continue to arrive. As for spatial locality, many people say it is not obvious, but if you carefully analyze data streams a, B, c, d... source/Target IP tuples, you will find their spatial locality, which is the fundamental principle of the TCAM hardware Forwarding Table design. In TCAM, "Some bits" indicates that these bits are the most scattered parts of the space, which is a reverse application of the Space locality, such as the core transmission network, you will find that a large number of IP addresses go to North America or northern Europe.
I hope to explain this rule in mathematics and statistics in this article, but this behavior is not suitable for a public blog. When someone interviews me, they ask me this question, I can only take a few words in a hurry. If necessary, I will use an email for in-depth analysis. However, for a blog, this method seems to be tricky, in addition, many readers will be lost, and no one will give me any comments. The most important thing in a blog is to give results quickly, that is, how to do it. Let's get down to the truth.
If the above-mentioned Optimization Based on "Space locality reverse utilization" and "getting some bit hash" is based on the original law of "Efficiency Comes from rules", then the cost of rules is complicated, this complexity makes it impossible for me to continue. Another principle that is more universal than this law is "Efficiency Comes from simplicity". I like simple things and simple people. This time, I once again proved my correctness. Before proceeding, I will briefly describe the bottleneck of nf_conntrack.
1. nf_conntrack's forward and reverse tuple uses a hash table, insert, delete, and modify operations that require global lock protection. This will support a large number of serialized operations.
2. The nf_conntrack hash table uses the jhash algorithm. This algorithm has too many operations. If the number of conntracks is small, the hash operation will consume a huge amount of performance.
[Tips: If you understand symmetric encryption algorithms such as DES/AES in cryptography, you will understand that replacement, inversion, and exclusive or operations can achieve optimal data obfuscation, making the output irrelevant to the output, in this way, the Best hash is achieved. However, the cost of this effect is that the operation is complicated, and the encryption and decryption efficiency problems are many here. This operation is so rule (various boxes) that it can be fully implemented using hardware circuits, however, if the CPU is not used for such hardware, this operation consumes a lot of CPU, and jhash is also the case, although not very much.]
3. The nf_conntrack table is global among multiple CPUs, which involves data synchronization. Although it can be mitigated to the maximum extent through RCU, what if someone writes them.


In view of the above, the solution is available gradually.


1. The construction of cache is based on each CPU. This fully complies with the cache localization design principle. Isn't the same for CPU cache.
2. the cache should be as small as possible, save the most likely hit data stream items, and ensure that the cost of missing cache is not too high.

3. Establish a reasonable adaptive cache replacement principle to ensure that the incumbent seek his/her role and does not think about the principle of self-rejection.


My design philosophy is the above. In the gradual implementation process, I initially kept only one cache entry, that is, the last item found in the conntrack hash table, this fully complies with the time locality. However, during my test, I found that if there are slow streams such as ICMP in the network, the cache jitter will be very high. Compared with TCP streams, ICMP is too slow, however, according to the queuing principle, it will eventually jump into a TCP stream, resulting in cache replacement. To avoid this sad situation, I am a conntrack item, that is, the conn struct is added with the timestamp field. Each time the hash finds the timestamp field, the current jiffers is used to subtract the timestamp field, and the field is updated to the current jiffers, cache replacement is performed only when the difference value is smaller than a predetermined value. This value can be obtained through network bandwidth weighting.
But is this perfect? Far away! Considering the CPU cache design, I found that the conntrack cache is completely different. For the CPU, due to the virtual memory mechanism, the cache must be stored in the address space of the same process (not considering the more complex CPU cache principle ...), therefore, unless a branch jump or function call occurs, the time locality is certain. However, network data packets are completely determined by queuing theory statistics. The namespace of all data packets is a collection of IP addresses all over the world. In a matter of time, any stream of data packets will be inserted. One of the most common scenarios is data stream switching. For example, the transmission rate of Data Stream a and Data Stream B is equivalent to the network bandwidth, and they are likely to alternate, or two or three data packets are sent at intervals. In this case, who do you want to take care? This is the third principle: efficiency comes from fairness.

Therefore, my final design looks like the following:


1. cache is a linked list. The length of the linked list is a parameter worth fine-tuning.
The cache linked list is too short: the stream items are frequently replaced in the conntrack hash table and cache.
The cache linked list is too long: The cost of missing cache items is too high for stream items that cannot hit the cache.
Winner principle: the winner can only eat. If there is anything, add it to him to make him redundant. No, even all of him will be taken. (Matthew) Balancing Principle 1-for the winner: the time spent traversing the cache linked list cannot be longer than that of the standard hash calculation + traversing the conflicting linked list (average ).
Balancing Principle 2-for the loser: If the linked list is traversed and does not hit, although some time should not be lost is lost, the loss is maintained within an acceptable range.

Effect: The faster the data stream reaches, the more likely it will hit the cache at a very low cost. The slower the data stream reaches the speed, the less likely it will hit the cache. However, there is no need to pay a high price.


2. timestamp-based cache replacement principle
Cache replacement is performed only when the arrival interval of consecutive data packets is smaller than a value calculated dynamically.

My intermediate test code is as follows:


// Modify net/netfilter/nf_conntrack_core.c // Email: marywangran@126.com // 1. define # define A # ifdef A/** MAX_CACHE dynamic calculation principle: * cache linked list length = average conflicting linked list length/3, wherein: * Average conflicting linked list length = net. nf_conntrack_max/net. netfilter. export * 3 = experience value **/# define MAX_CACHE 4 struct conntrack_cache {struct prepare * caches [MAX_CACHE] ;}; DEFINE_PER_CPU (struct conntrack_cache, conntrack_cache); # endif/2. modify resolve_normal _ Ctstatic inline struct nf_conn * struct (struct net * net, struct sk_buff * skb, unsigned int dataoff, u_int16_t l3num, struct protonum, struct limit * l3proto, struct limit * l4proto, int * set_reply, enum ip_conntrack_info * ctinfo) {struct nf_conntrack_tuple tuple; struct limit * h; struct nf_conn * ct; # ifdef A int I; struct conntrack_cache * ca Che; # endif if (! Trim (skb, skb_network_offset (skb), dataoff, l3num, protonum, & tuple, l3proto, l4proto) {pr_debug ("resolve_normal_ct: Can't get tuple \ n "); return NULL ;}# ifdef A cache =&__ get_cpu_var (conntrack_cache); rcu_read_lock (); if (0/* optimized 3 */) {goto slowpath ;} for (I = 0; I <MAX_CACHE; I ++) {struct nf_conntrack_tuple_hash * ch = cache-> caches [I]; struct nf_conntrack_tuple_hash * ch0 = cache-> Caches [0]; if (ch & nf_ct_tuple_equal (& tuple, & ch-> tuple) {ct = cursor (ch); if (unlikely (nf_ct_is_dying (ct) |! Atomic_inc_not_zero (& ct-> ct_general.use) {h = NULL; goto slowpath;} else {if (unlikely (! Nf_ct_tuple_equal (& tuple, & ch-> tuple) {nf_ct_put (ct); h = NULL; goto slowpath ;}} /*************************************** optimization 1 Overview ************************************* * *** // It is not directly upgraded to the first one, instead, it increases based on the interval of two cache hits, and the number of improvement steps is inversely proportional to the time interval. * // * This avoids the sharp jitter of the cache queue itself. In fact, if the hit interval can be weighted to the historical interval value, better results *//*********************************** **************************************** **************** // improves the priority of hit items based on time locality */if (I> 0/* & Optimization 1 */) {cache-> caches [0] = ch; cache-> caches [I] = ch0;} h = ch ;}} ct = NULL; slowpath: rcu_read_unlock (); if (! H) # endif/* look for tuple match */h = nf_conntrack_find_get (net, & tuple); if (! H) {h = init_conntrack (net, & tuple, l3proto, l4proto, skb, dataoff); if (! H) return NULL; if (IS_ERR (h) return (void *) h ;}# ifdef A else {int j; struct nf_conn * ctp; struct nf_conntrack_tuple_hash * chp; /********************** optimization 2 Introduction ************* * *********** // cache replacement is executed only when the arrival interval of two consecutive data packets is less than n * // * to avoid cache Jitter Caused by slow streams such as ICMP *//**************************** * ******************************/if (0/* optimized 2 */) {goto skip ;} /************************** optimization 3 Introduction ********** **** * ************** // Cache is enabled only when the total number of conntracks is four times greater than the number of hash buckets * // * Because conntrack if the number is small, after a hash operation, you can locate it at a time, * // * or traverse a short conflicting linked list, the use of cache reduces the performance *//******************************* * **********************************/if (0/ * optimization 3 */) {goto skip;} ct = nf_ct_tuplehash_to_ctrack (h); nf_conntrack_get (& ct-> ct_general); chp = cache-> caches [MAX_CACHE-1]; for (j = MAX_CACHE-1; j> 0; j --) {cache-> caches [j] = Cache-> caches [J-1];} cache-> caches [0] = h; if (chp) {ctp = nf_ct_tuplehash_to_ctrack (chp ); nf_conntrack_put (& ctp-> ct_general) ;}} skip: if (! Ct) {ct = nf_ct_tuplehash_to_ctrack (h) ;}# else ct = nf_ct_tuplehash_to_ctrack (h); # endif/* It exists; we have (non-exclusive) reference. */if (NF_CT_DIRECTION (h) = IP_CT_DIR_REPLY) {* ctinfo = forward + IP_CT_IS_REPLY;/* Please set reply bit if this packet OK */* set_reply = 1 ;} else {/* Once we 've had two way comms, always ESTABLISHED. */if (test_bit (IPS_SEEN_REPLY_BIT, & ct-> status) {pr_debug ("nf_conntrack_in: normal packet for % p \ n", ct); * ctinfo = IP_CT_ESTABLISHED ;} else if (test_bit (IPS_EXPECTED_BIT, & ct-> status) {pr_debug ("nf_conntrack_in: related packet for % p \ n", ct); * ctinfo = IP_CT_RELATED ;} else {pr_debug ("nf_conntrack_in: new packet for % p \ n", ct); * ctinfo = IP_CT_NEW;} * set_reply = 0 ;} skb-> nfct = & ct-> ct_general; skb-> nfctinfo = * ctinfo; return ct ;}// 2. modify nf_conntrack_initint nf_conntrack_init (struct net * net) {int ret; # ifdef A int I; # endif if (net_eq (net, & init_net) {ret = forward (); if (ret <0) goto out_init_net;} ret = nf_conntrack_init_net (net); if (ret <0) goto out_net; if (net_eq (net, & init_net )) {/* For use by REJECT target */assign (ip_ct_attach, assign); assign (nf_ct_destroy, destroy_conntrack);/* Howto get NAT offsets */assign (nf_ct_nat_offset, NULL );} # ifdef A/* initialize the conntrack cache queue per CPU */for_each_possible_cpu (I) {int j; struct conntrack_cache * cache; cache = & per_cpu (conntrack_cache, I ); for (j = 0; j <MAX_CACHE; j ++) {cache-> caches [j] = NULL ;}# endif return 0; out_net: if (net_eq (net, & init_net) nf_conntrack_cleanup_init_net (); out_init_net: return ret ;}


People who want to see it have the opportunity to test it. Results and questions can be directly sent to the mailbox shown in the code comment.

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.