Linux 4.6 kernel optimizations for TCP Reuseport

Source: Internet
Author: User

Busy all day, home from work will always be a little relaxed, it is certain. Time waits for a person, as long as has the remaining time, wants to want to order the thing which oneself likes. On the off-duty bus, with a mobile phone that regrettable screen witnessed some of the new features of Linux 4.6, I am interested in two points, the first is about Reuseport, this is also explained in this article, and another is about KCM (Kernel Connection multiplexor ), and this is what I plan to write this weekend, these are memories, and all I have experienced, happened to the weather forecast that there is a rainstorm tonight, aroused some interest, so easy, insufficient or written unclear, but also hope that someone can point out. I want to start with Q&a, this is also in line with the expectations of the public, if you can really understand what my q&a want to say, then q&a after the next content, not to see.
Q&a when someone asks me something about reuseport, our conversation is basically as follows:
Q1: What is Reuseport?
A1:reuseport is a socket multiplexing mechanism that allows you to bind multiple sockets to the same IP address/port pair, so that multiple services can be established to accept connections to the same port.

Q2: When a connection is made, how does the system decide which socket to handle it?
A2: For different cores, the processing mechanism is not the same, in general, the Reuseport is divided into two modes, namely, hot backup mode and load Balancing mode, in the earlier kernel version, even if added to the Reuseport option support, is only a hot backup mode, After the 3.9 kernel, all changed to load-balanced mode, and the two modes did not coexist, although I always wanted them to coexist.

"I didn't further say what is hot backup mode and load Balancing mode, which means I'm waiting for the questioner to ask further questions."

Q3: What is hot backup mode and load Balancing mode?
A3: Let me explain this separately.
Hot backup mode: That is, you create a socket for n Reuseport, but only one work, the other as a backup, only the current socket is no longer available, it will be replaced by the latter, the order of their work depends on the implementation.
Load Balancing mode: All n reuseport sockets you create can work at the same time, and when the connection arrives, the system will take a socket to handle it. This can achieve the purpose of load balancing, reduce the pressure of a certain service.

Q4: How do you pick up a socket?
A4: This is different for hot backup mode and load Balancing mode.
Hot backup mode: In general, will all reuseport the same IP address/port socket on a linked list, take the first one, if the socket is hung, it will be removed from the list, and then the second will be the first one.
Load Balancing mode: As with the hot backup mode, all reuseport the same IP address/port socket will be hung on a linked list, you can also think of an array, it will be more convenient, when there is a connection, the source of the packet ip/source port as a hash function input, The result is modeled on the number of Reuseport sockets, resulting in an index that indicates the array location of the socket that corresponds to the work socket.

Q5: Then will the first packet be processed by socket m, and subsequent packets will be processed by socket n?
A5: This problem is actually very easy, look at the algorithm will find that as long as these packets belong to the same stream (the same five-tuple), then it will be the result of each hash is the same index, so the socket processing it is always the same!

Q6: Why do I feel a bit mysterious? "This last question is the one I asked myself to answer from a colleague ..."
A6: It's really hyun! TCP will keep itself connected, we'll leave it alone. For UDP, such as a transaction need to interact with 4 packets, the first packet of tuple hash results indexed to thread 1 socket problem, it is taken for granted thread 1 processing, before the second packet arrives, thread 1 hangs, then the thread's socket location will be another thread, For example, thread 2 socket replacement! When the second packet arrives, it will be handled by the socket of thread 2, but thread 2 does not know the transaction state that thread 1 holds about this connection ...

On the realization of reuseport above is basically about the reuseport question and answer, in fact, can also derive more interesting questions and answers, strongly recommend this as a face test.
We look at Q6/A6, this problem does exist, but it is not a big problem, this is related to the implementation, so it is not the reuseport mechanism itself problem. We can use the array instead of the linked list, and maintain the size of the array, even if thread n hangs, its socket location is not occupied by an existing socket, but must be a new substitute thread (after all, thread n has been hung, to create a new supplement) of the socket occupy, this will solve the problem, All that is needed is for any thread to serialize the state information before it is hung, and then deserialize the information when the new thread starts.

To this end, I did not mention any information about the Linux 4.6 kernel optimized for TCP Reuseport, the typical title party!
However, the above is almost all the information, if there is anything else, then you can only paste the code. In fact, before the optimization of Linux 4.6 (for TCP) and Linux 4.5 (for UDP), my above-mentioned answer was inaccurate, Before 4.5, the Linux kernel implementation of the reuseport is not what I imagined, but in order to explain the concepts and mechanisms, I have to use the above easier to understand the way to elaborate, the principle is one thing, the realization is the same, please forgive me for the principle of the elaboration and the Linux implementation does not match.
It is conceivable that 4.5/4.6 's so-called Reuseport optimization, it is only a more natural way to achieve it, on the contrary, the previous implementation is not natural! Recall three years, how many people asked me about Reuseport, including several interviewers, if asked further "Q: Are you sure this is how Linux is implemented?" "Then I must answer:" No! Not so, the implementation of Linux is very rubbish! "And then you hear me gushing about a lot of metaphysical things, and end up talking in a less relaxed atmosphere.

To summarize, in fact, the Linux 4.5/4.6 so-called Reuseport optimization is mainly reflected in the query speed, before optimization, have to be in the hash conflict linked list to traverse all the sockets to know exactly which to take (based on a bubbling score scoring mechanism, Not completing a round of bubbling traversal, not determining who score the highest), is so inefficient because the kernel mixes all Reuseport sockets and other sockets together, the lookup is flat, the normal practice should be to divide them into a group, to do a hierarchical lookup, first find this group (this is very easy) , and then find the specific socket in the group. Linux 4.5 makes the above optimizations for UDP, while Linux 4.6 introduces this optimization to TCP.

Imagine a total of 10,000 sockets in the system hash to the same conflict linked list, where 9,950 are reuseport the same set of sockets, if according to the old algorithm, need to traverse 10,000 sockets, if using a grouping-based algorithm, up to 51 sockets can be traversed, Once you find that group, you can find the index of the target socket in one step hash!

Reuseport find implementation before Linux 4.5 (4.3 kernel) The following is the implementation of the Linux 4.3 kernel, which is not optimized, and how it can be seen is not intuitive. It uses the method of traversing the hash conflict list to reuseport the exact location of the socket:
    result = NULL;    badness = 0;    UDP_PORTADDR_FOR_EACH_ENTRY_RCU (SK, node, &hslot2->head) {        score = compute_score2 (SK, net, saddr, sport,                      daddr, Hnum, dif);        if (Score > Badness) {//bubble sort            //Find a more appropriate socket, need to re-hash            result = SK;            badness = score;            Reuseport = sk->sk_reuseport;            if (reuseport) {                hash = UDP_EHASHFN (NET, daddr, Hnum,                           saddr, sport);                matches = 1;            }        } else if (score = = Badness && reuseport) {//Reuseport socket Hash Locator            //Find the same reuseport socket for positioning            matches++;            if (Reciprocal_scale (hash, matches) = = 0)                result = SK;            hash = Next_pseudo_random32 (hash);        }    }

The reason for traversal is that all reuseport sockets and other sockets are flattened into the same table without prior knowledge of how many sets of Reuseport sockets and how many sockets are in each group, such as the following example:
reuseport group1-0.0.0.0:1234 (SK1,SK2,SK3,SK4)
Reuseport group2-1.1.1.1:1234 (SK5,SK6,SK7)
Other sockets (SK8,SK9,SK10,SK11)

Assuming that they are all hashed to the same position, the possible order is as follows:
Sk10-sk2-sk3-sk8-sk5-sk7-...
Although Sk2 has been matched, there are more precise sk5 behind it, which means that all 11 sockets must be traversed before you know who will bubble to the top.

Reuseport Lookups for Linux 4.5 (for UDP)/4.6 (for TCP) Let's look at what's new in the 4.5 and 4.6 cores for reuseport, which adds something magical:
    result = NULL;    badness = 0;                      UDP_PORTADDR_FOR_EACH_ENTRY_RCU (SK, node, &hslot2->head) {score = Compute_score2 (SK, net, saddr, sport,        DADDR, Hnum, DIF);            if (Score > Badness) {//in reuseport case, meaning to find a more appropriate socket group, need to re-hash result = SK;            Badness = score;            Reuseport = sk->sk_reuseport;                if (reuseport) {hash = UDP_EHASHFN (NET, daddr, Hnum, Saddr, Sport);                    if (SELECT_OK) {struct sock *sk2;                    A group was found, followed by a hash within the group.                    Sk2 = Reuseport_select_sock (SK, hash, SKB, sizeof (struct UDPHDR));                        if (sk2) {result = Sk2;                        SELECT_OK = false;                    Goto found;            }} matches = 1; }} else if (score = = Badness && ReusePort) {//The expectation of this else if branch is to look for a more matching Reuseport group when the layered lookup is not applicable, noting that 4.5/4.6 is looking for a reuseport group.            In a sense, this returns to the algorithm before 4.5.            matches++;            if (Reciprocal_scale (hash, matches) = = 0) result = SK;        hash = Next_pseudo_random32 (hash); }    }

We focus on Reuseport_select_sock, this function is the second layer of the key to find the group, in fact, should not be called a lookup, but should be called positioning more appropriate:
struct sock *reuseport_select_sock (struct sock *sk,                   u32 hash,                   struct sk_buff *skb,                   int hdr_len) {    ...    prog = rcu_dereference (Reuse->prog);    Socks = read_once (reuse->num_socks);    if (likely (socks)) {/        * paired with SMP_WMB () in Reuseport_add_sock () */        SMP_RMB ();        if (Prog && SKB)//can use BPF to inject its own positioning logic from the user state, better implement policy-based load balancing            SK2 = RUN_BPF (reuse, socks, prog, SKB, Hdr_len); 
   
    else            //Reciprocal_scale simply limits the results to [0,socks] within this interval            sk2 = Reuse->socks[reciprocal_scale (hash, socks)];    }    ...}
   

It's not that magical, is it? It's basically covered in q&a.

My own Reuseport search realized that when I saw this idea of Google, the Linux kernel didn't have this implementation built in. I was doing a multi-processing optimization on OpenVPN based on the 2.6.32 kernel, and using UDP, I ported Google's reuseport patch after experiencing the desperate failure of a frustrating multi-process of UDP, but it was more desperate, wonderful and simple to implement (the Magic must be simple!). An idea of how it could have been achieved (in fact this has persisted to the 4.5 version of the kernel)??

Because I can make sure that there are no other reuseport sockets in the system, and I can make sure that the number of CPUs on the device is 16, the definition array is as follows:
#define MAX        18struct sock *reusesk[max];
Whenever OpenVPN creates a reuseport UDP socket, I add its order to the Reusesk array, and the final lookup algorithm is modified as follows:
    result = NULL;    badness = 0;                      UDP_PORTADDR_FOR_EACH_ENTRY_RCU (SK, node, &hslot2->head) {score = Compute_score2 (SK, net, saddr, sport,        DADDR, Hnum, DIF);            if (Score > badness) {result = SK;            Badness = score;            Reuseport = sk->sk_reuseport; if (reuseport) {hash = INET_EHASHFN (NET, daddr, Hnum, SADDR, Htons (Sport)); #ifd                EF EXTENSION//Direct FETCH index indicates socket result = Reusesk[hash%max]; If there is only one set of Reuseport sockets, return directly, otherwise fall back to the original logical if (Num_reuse = = 1) break; #endif m            Atches = 1;            }} else if (score = = Badness && reuseport) {matches++;            if (((u64) hash * matches) >> = = 0) result = SK;        hash = Next_pseudo_random32 (hash); }    }
Very simple to modify. In addition, whenever a socket is destroyed, in addition to its array corresponding to the index bit is set to NULL, the other index to the element has no effect on the subsequent creation of a new socket, only need to find an element of the null location to add in it, This solves the problem of sockets being directed to the wrong socket because of changes in socket position (because the location element indicated by the index has changed due to the moving position). The impact of this problem is sometimes dramatic, such as all subsequent sockets moving forward, will affect multiple sockets, and sometimes the effect is slight, such as the last socket to fill the location of the set to null, regrettably, Even if the kernel is 4.6, it uses the latter method, that is, the end of the Fill method, although just move a socket, but the problem still exists. Fortunately, the reuseport of the 4.6 kernel supports BPF, which means that you can write your own code to control the selection of sockets in the user state, and inject them into the kernel's reuseport selection logic in real time.

Q7&a7,q8&a8 one last question
Q7: Is there any unified way to deal with the addition and deletion of Reuseport sockets? For example, to create a new workload thread, a worker thread hangs out of this dynamic behavior.
A7: Yes. That's the "consistent hash" algorithm.

First, we can hash all sockets in a Reuseport socket group into the following 16bits linear space (where the first socket occupies the endpoint position) according to its corresponding PID and a unique identifier such as the memory address, as shown in:




So n a socket divides the linear space into n intervals, we call this hash process the first class of hash! Next, when a packet arrives, how can it be mapped to a certain socket? In this case, the second type of hash operation, the object of the operation is the source ip/source port pair of packets carried, the result of the hash corresponds to the 16bits linear space above, as shown in:




The socket that corresponds to the first first class hash value on the left side of the second type of hash value is selected as the socket, as shown in:




It can be easily seen that if the first type of hash value node is deleted or newly added (meaning that the socket is destroyed and created), it is affected only by the second type of hash node between the node and the first first-class hash node on its right, as shown in:




This is the simplified version of the "consistent hash" principle. If you want to be in the new, destroyed, not affected at all, then do not toss these algorithms, or honestly engage in the array bar.
After understanding the principle, let's look at how to achieve this mini consistent hash. True consistency hash implementation is too heavy, there are a lot of information on the Internet can be found, I just give a thought, not the best. According to, we can see, in the final analysis need an "interval lookup", that is, ultimately need to do is "to determine the second kind of hash results fall in the division of the first kind of hash results in which interval", so in the intuitive, can be used is the binary tree interval matching, in this, I split the above interval into a two-pronged tree:




Germanic to the left, Rome to the right! Next, you can find two points on this binary tree.

Q8: What is the difference between TCP and UDP in Reuseport lookup processing?
Each connection of a8:tcp can be maintained by a full five-tuple of information by itself to maintain a unique identity, only need to follow the unique five-tuple information to find a TCP connection, but for the listen state of the TCP socket is different, a SYN from the client arrives, the five-tuple information has not been established , it is time to find out which socket in the Reuseport socket group to handle this syn. After this socket is determined, it is possible to establish a unique five-tuple identity with the client that sent the SYN, so for TCP, only sockets with the listen state need to be supported by the Reuseport mechanism. For UDP, all sockets require the support of the reuseport mechanism because UDP does not maintain any connection information, that is, the protocol stack does not record which client has been or is communicating with it, so for each packet, Requires the Reuseport lookup logic to correspond to a socket that handles it.

Do you know what the saddest thing is about every day? That is repeating the words over and over again every day.

Linux 4.6 kernel optimizations for TCP Reuseport

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.