Multi-core Linux kernel Path Optimization-multi-core platform TCP optimization

Source: Internet
Author: User

Multi-core Linux kernel Path Optimization-multi-core platform TCP optimization
0. Declaration: 0). About the source

Yesterday, I promised my shoes factory owner that I had a complicated mood last night. I didn't catch up with this article. Today I wrote the rest on the plane...

1). Native Linux code

This article assumes that the reader has a clear understanding of the TCP implementation source code in Linux, so it will not analyze the source code about TCP in Linux kernel, such as the tcp_v4_rcv process.

2). About my optimization code

Due to a lot of complicated factors involved, this article does not provide complete and compilable optimization source code (the entire source code is not completed by me alone, and I dare not make a claim without the consent of my associates, but the idea is mine, so what I can demonstrate is just the principle and the part of the code I am responsible ).

3). About TCP

This article does not talk about the TCP protocol and its details (For details, refer to RFC and various papers, such as various traffic control and control algorithms ), only the content included is the optimization implemented by the framework outside the TCP protocol.

1. Linux TCP implementation 1.1.Linux TCP implementation is divided into two parts at the protocol layer. 1) connection handshaking process TCP will first establish a connection through three handshakes, and then data can be transmitted. The TCP specification does not specify any implementation methods. The current socket specification is only one of them. Linux implements the BSD socket specification.
In Linux, TCP connection processing and data transmission processing are combined at the code level.
2) data transmission and processing is relatively simple.
1.2.Linux TCP has two parts in terms of system architecture: 1 ). the Linux kernel processes the protocol stack in the Soft Interrupt environment. At the top of the process, there are three branches: directly copying the skb to the user buffer zone, simply put the skb into the prequeue queue, and simply put the skb into the backlog queue.
2) User process processing Linux socket API processing is performed in the context of the user process. Through section 1.1, we know that these are all merged at the code level, so a socket will be executed by various stream operations for intuitive consideration, which requires a lot of lock overhead.
. Connection processing overall diagram I will provide a connection processing overall diagram, in which the red line indicates where competition occurs, and it is these places that prevent concurrent processing of TCP connections, as shown below:

 

 

Let me explain the meanings of these red lines one by one:
Red Line 1: Because the user process and the protocol stack operate on the same socket, if the user process is copying data packets, the protocol stack must stop the same operation, and vice versa, therefore, it is necessary to temporarily lock the socket. However, the overhead of this large lock is too large, so the implementation of the Linux kernel protocol stack adopts a more elegant method.
Protocol Stack lock socket: Due to Soft Interrupt Processing protocol stack, it may run in any context after the hard interrupt, so it cannot sleep, so it must be a spin lock slock, which is maintained by the socket itself, this lock not only protects the race state between user processes, but also protects the race state between operations on the same socket protocol stack on different CPUs (very common, A listener socket can reach many connection requests at the same time [Sadly, these requests cannot be processed at the same time!]).

User process lock socket: User processes can sleep at any time. Therefore, non-spin lock xlock can be used to protect the competing states among multiple processes, however, in order to be mutually exclusive with the Soft Interrupt of the same socket as the kernel protocol stack operation, you must first obtain the slock of the socket before obtaining the xlock, when you get the xlock or do not get the xlock for sleep, release the slock. The related logic is as follows:

 

 

Stack_process {... spin_lock (socket-> slock); // 1 process (skb); spin_unlock (socket-> slock );...} user_process {... spin_lock (socket-> slock); // 2 while (true ){... spin_unlock (socket-> slock); sleep; spin_lock (socket-> slock); // 2 if (xlock occupied successfully) {break ;}} spin_unlock (socket-> slock );...}

It can be seen that Linux uses the above methods to perfectly solve two types of problems. The first type of problem is the synchronization and mutex between execution streams operating on the socket, in the second type of problem, the locks between Soft Interrupt context and process context are different.

 

After understanding the socket lock, let's take a look at what the backlog queue is doing. In fact, it is very easy to push the skb to a queue of the process that is occupying the socket. When the process completes the task and is ready to release the socket, if skb is found in the queue, then process them in their context. This is actually a transfer of duties. This transfer can also bring about some optimization results, that is, the skb is processed directly in the context of the user process to which the socket belongs, thus avoiding some cache refresh.
Red Line 2: this line has already been involved in the interpretation of the Red Line 1, mainly the competition between 1 and 2 in the above Code logic. This competition is not the most intense. In essence, they are vertical competition. The competition between a kernel-State Soft Interrupt and a process context is on the same CPU. In general, the probability of such competition is very low, because the same CPU can only execute one execution stream at a time. If it executes soft interruptions in the kernel state at this time, the user-state process must be sleeping or preemptible, for example, sleep in accept.
User-State Processing and kernel-state processing, this vertical competition will hardly happen on a single CPU, and the user-state xlock is simply to solve the competition between user processes, in the face of such competition, the kernel transfers data packet processing responsibilities through a backlog. In fact, there is no competition in xlock, and the existence of backlog brings a little optimization effect.
Red Line 3 (combined with the Red Line 3): this red line is used to solve the competition among multiple user processes. The reason for this is that the TCP connection processing diagram is not the data transmission processing diagram, because the connection diagram is more representative of the problem, on the server side, A main process fork has more than N sub-processes or multiple threads can be created at the same time on an inherited socket, which almost becomes the principle of server design, when multiple processes/Threads reach the slock at the same time, the competition will be fierce. Looking at the red line 3, we found that the accept for the same socket must be queued.
If you look at the implementation of Linux TCP, you will find that a large number of data structure operations do not use locks. In fact, the only person who enters there is only one who queues as early as the entrance.
On a single CPU, this will not have any consequences, because a single CPU, even under the coordination of the most advanced time-sharing preemptible scheduler, is essentially a queuing model and can be queued at best, the slock and xlock implemented by TCP only regulate this queuing rule. However, in terms of multiple CPUs, do you think this is necessary?
Red Line 4: Finished queuing when multiple processes/threads in the user State access the same socket at the same time. Now let's look at the processing of the kernel protocol stack. If multiple CPUs are interrupted by connection requests at the same time, and a large number of requests are directed to the same listening socket, you will have to queue at the Red Line 4! This situation does not exist in a single CPU. It is impossible for multiple CPUs to run to the same location at the same time...

2. In multi-core CPU architecture, the bottleneck of Linux TCP connection processing performance is described based on the above red lines. So far, I have shown several bottlenecks, all of which are queuing points. The above block diagram is actually very good, and these queuing points are understandable, because the design of a system is to solve all problems by self-inclusion. As long as competition exists, it must be solved by queuing, therefore, the above framework does not need to be changed, and the lock remains in the original place. The problem is not the existence of a lock. The problem is that the lock is not easy to obtain!
The processing of TCP runs on various CPUs, And the CPU is treated as a kind of resource. This is the core concept of the operating system, but what is the opposite? If the CPU is used as the service provider and the socket is used as the resource, the problem will be solved. This idea is very important. I first came into contact with it in nf-HiPAC. I could have reversed the match and rule, and then guided me to design DxR Pro ++ with this idea, now, we still have this idea. TCP optimization can still be done.
3. Linux TCP connection Processing Optimization overall consideration is a TCP implementation after connection processing is optimized. I only care about what will be returned from the accept API. As long as I can quickly obtain a client socket without damaging the listen, the bind API itself is necessary to modify its implementation.
. Vertical Split socket I split a listen socket into two parts. The upper half and lower half correspond to the user process, and the lower half correspond to the kernel protocol stack. The original socket looks like:

 

 

My socket is changed to the following:

 

3.1.1. Remove the red line 1 and the Red Line 2. The upper half and lower half of a socket are associated only through an accept queue. In fact, even if the upper half is occupying the socket, the lower half can still be processed.
However, it turns out that, when the accept process is bound to the CPU, the elimination of the No. 1 red line does not have the expected performance improvement, the elimination of the Red Line 2 has little impact on performance improvement. If the CPU is not bound, the performance improvement is better than the binding. It seems that the horizontal and vertical parts are not independent, and they will affect each other.
3. 2. the principle of Horizontal Split socket horizontal split is to split the lower half of a socket into multiple ones, and one is bound to each CPU. Similar to softirqd, the Red Line 4 can be eliminated. But why not disassemble the upper half of the socket? I first thought that the process should decide on its own, and later thought it was also necessary to forcibly bind one. As for what user processes can unbind themselves, add a layer to hide the Per CPU socket. In this case, the file descriptor in the user process only indicates a socket descriptor, which actually points to the nr_cpus Per CPU socket, as shown in:

 

 

In this way, red lines 3, red lines 3, and red lines 4 are all eliminated. As a matter of fact, it seems that everything is done, but there are two problems during the test. Next I will describe these two problems in the disassembly of the data structure.
3. 3. global Listeners hash table because Listener will copy nr_cpus to each CPU, so all Listeners will be added to the local hash table of each CPU. This kind of space-for-lock parallel operation is worthwhile, because the server does not have a large number of listening services.
3. 4. if each CPU in the system is bound with an Accept process, the local Accept queue and the global accept queue can ensure that all connection requests are processed by specific processes, however, if one of the CPUs is not bound to any accept process, the client socket queuing to the Accept queue of the CPU will not be returned to any process, resulting in the customer's socket starvation.
Therefore, a global Accept queue is introduced. The related code logic is as follows:
Stack_enqueue_socket {if (no user process associated with the Listener is bound to the current CPU) {spin_lock (g_table-> slock); enqueue_global (g_table, cli_socket ); spin_unlock (g_table-> slock);} else {local_irq_save enqueue_local (per_cpu (table), cli_socket); local_irq_restore} else {if (the current process is not bound to the current CPU) {spin_lock (g_table-> slock); cli_socket = dequeue_global (g_table); spin_unlock (g_table-> slock);} else {local_irq_save cli_socket = dequeue_local (per_cpu (table )); local_irq_restore} return cli_socket ;}


In fact, we can see that the global Accept queue is specially set for the stubborn processes, but the performance can still be improved at the cost of a small lock, because the lock granularity is much smaller.

. Soft Interrupt CPU distribution problem because each CPU is bound to the lower half of a Listener socket, and almost all data structures are locally maintained, the CPU is officially part of TCP. Therefore, one thing must be guaranteed, that is, three handshakes must be handled by one CPU; otherwise, an error will occur. However, many systems with irqbalance enabled may distribute interruptions to different CPUs. For example, SYN from a specific client is processed by CPU0, while ACK from three handshakes is processed by cpu1, this causes an error. To avoid this situation, the underlying layer must maintain the ing between data streams and CPUs. This can be solved through RFS technology.

When the first SYN packet arrives, the CPU used to handle the handshake is determined, that is, the current CPU, this has nothing to do with the CPU to which the user-state process jumps. Therefore, implementation is much simpler than RFS. The Code logic is as follows:

 

Netif_receive_skb {... hash = myhash (skb) cpu = get_hash_cpu (hash); if (cpu =-1) {record_cpu_hash (hash, cpu);} else if (current_cpu! = Cpu) {enqueue_to_backlog (skb, cpu); then the IPI trigger is interrupted and processed by other CPUs} else {normal receiving logic }...}

 

As follows:


3. 6. the relationship with REUSEPORT and fastsocket this optimized version has nothing to do with REUSEPORT. REUSEPORT is a patch of google and is very easy to use. This technology is used in one of our products, it can achieve load balancing between multiple sockets listening for the same IP address and port. However, the effect of this load balancing is entirely dependent on the hash algorithm. In fact, Sina's fastsocket is optimized on the REUSEPORT. It mainly optimizes CPU Affinity, and the effect is very good. Similar optimizations include RPS/RFS patch. However, fastsocket goes further and not only continues the RPS/RFS results, but also solves the connection performance bottleneck, the main method is to use the CPU binding technology to horizontally split a Listen socket based on the REUSEPORT, and copy multiple Listen Sockets for reuseport.
Fastsocket decides whether to bind the CPU by the user process, so which CPU processes the data packets set by the user process, while my practice is the opposite, from bottom to top, I used the CPU that happened to be interrupted by the first SYN Packet as the CPU that was processed after the handshake process. It has nothing to do with the user process settings, even if the user process is not bound to the CPU, it only takes the client socket at a low cost from the global Accept queue at best. If the CPU is bound, it is almost a full-Line Lock-free operation. Combined with section 3.5, let's take a look at how fastsocket gets the CPU corresponding to the data packet:

 

 

Netif_receive_skb {... socket = inet_lookup (skb); cpu = socket-> sk_affinity; if (current_cpu! = Cpu) {enqueue_to_backlog (skb, cpu); then the IPI trigger is interrupted and processed by other CPUs} else {normal receiving logic }...}

 

Fastsocket has an operation to query the socket table. For connection processing, you can obtain the socket bound to the CPU in the local table. If you think fastsocket adds a search here, you will be wrong. In fact, this is another optimization point of fastsocket, namely Direct TCP, that is, to find the specific socket at this location, even routes and other items can be put in, and all subsequent search results can be put in, so that you can perform one search and use it multiple times.

The processing of Direct TCP does not seem to comply with the Protocol Stack's hierarchical processing rules. In fact, when such a layer-4 protocol is processed at the underlying layer, in some situations, the performance price will also be paid:
1) What if a large number of packets are forwarded?
2) What if there are a large number of packages that are UDP?
3) What if it is an attack package?
4) What if these packages are to be flushed by Netfilter next?
...
This is a bit like another similar Direct technology, namely socket's busy poll. To achieve high performance, you must understand the behavior of your system. For example, you know that this server is dedicated to processing TCP proxy requests, enabling Direct TCP or busy poll is advantageous.
I think my optimization solution is a more general solution. In fact, only single-point fine-tuning does not involve any other situations. As long as you can ensure that the interruption behavior of data packets arrives elegantly, the subsequent processing is successful and completely automated. Even in the face of the above four abnormal data packets, the cost of a hash calculation is very small, and there is no need to face the memory barrier caused by RCU locks. Nowadays, with the emergence and popularity of Multi-queue nic and PCI-E MSI, there are many technologies to ensure:
1). When data packets of a stream identified by the same tuples arrive, the same CPU is always interrupted;
2). The CPU load for processing different streams is very balanced.
As long as the above two points are done, nothing else will be taken care of. For the TCP connection package, the data packet will arrive at the local Accept queue or the global Accept queue along the way from the same CPU without a lock, next, we will switch to another highway with the same width and Multiple lanes. The new highway is operated by the process.

Finally, the optimized TCP connection processing diagram is as follows:

 


PS: In this solution, you do not need to modify the application or link to any libraries. You can replace the kernel completely. Fastopen is currently not supported.

How should we optimize the non-handshake packet processing of TCP?

4. before optimizing TCP data processing in Linux, we thought that the typical C/S processing server of TCP was a one-to-many model, therefore, the popular socket programming model, listen + accept + fork, is almost consistent until now. Various MPM technologies are derived during this period. This model itself makes people think that TCP connection processing is like that, and there is a bottleneck in itself! In fact, both fastsocket and my solutions have solved one-to-many problems and the model has become many-to-many.
How can we optimize TCP data transmission? You know, it is not a one-to-many model, it is a natural one-to-one model, and TCP is a strictly ordered protocol, in order to preserve the order, it is a silly idea to adopt parallel processing. In order to optimize it, we first give the possible bottlenecks.
4. 1. one-to-one TCP connection bottleneck analysis because there is a large amount of data transmission in the TCP connection, so memory copy is a bottleneck, but this article does not focus on this, this can be solved through various technologies, such as DMA, zero copy, scattered/clustered IO, etc. This article focuses on the optimization of CPU affinity. In addition, considering the number of one-to-one connections in this ESTABLISHED status, there are also operations on it.
Considering the persistent short connections and a large amount of concurrency, let's take a look at the pressure of the ESTABLISHED hash table in the kernel.
1). Continuous new connections (the handshake is complete, and the customer socket will insert the table), lock the table, and queue;
2). continuous connection release, lock table, and queue;
3). A large number of connections exist at the same time, and a large number of TIME_WAIT queries are delayed and the connections are working normally;
4). TIME_WAIT socket also faces the issue of continuous creation and release.
...
If it is a persistent connection, it may be mitigated, but the root cause of the disease is endless.
Because this is a one-to-one Symmetric Model, the order of TCP is not good for parallel local operations, so I did not perform horizontal splitting on it, without horizontal splitting, this is meaningless for Vertical Split, because multiple CPUs are bound to compete for queues. Therefore, it is totally different from the Optimization Principle of connection processing.
. How to optimize the one-to-one TCP connection data transmission processing? The target is already very clear, that is, local query, local insertion/deletion, to minimize the impact of RCU memory barrier.
Unlike connection processing, we cannot take the CPU as the main character, because the processing of a connection can only be performed on one CPU at the same time. Therefore, two principles are established to accelerate processing and maintain the cache:
1) Add a local cache for the ESTABLISHED table, which can be unlocked;
2) Process Migration is prohibited when the user process is waiting for data.

 

My logic is as follows. It turns out to be really good:

 

Tcp_user_receive/poll {0. routine RPS operations 1. set the process to disable migration. 2. add the socket to the local ESTABLISHED table of the current CPU (optimization point 1: If the local table is not added) enable interrupt 3. sleep_wait_for_data_or_event 4. disable reading data 5. remove the socket from the local ESTABLISHED table (optimization point 1: Remove not required) enable interrupt 6. clear non-migrated bits} tcp_stack_receive {1. routine tcp_v4_rcv operation 2. if the local ESTABLISHED table of the current CPU is not found, search for the global ESTABLISHED table 3. routine tcp_v4_rcv operations}

 

Obviously, the above logic is based on the following judgment: Since the process has waited for data on a CPU, the best way is to temporarily pin the process to this CPU, we also inform the kernel protocol stack to deliver the Soft Interrupt packets to the CPU. The dingtalk process is temporary and only receives a round of data. This obviously requires a flag in task_struct or a bit, the Soft Interrupt Routine that notifies the kernel protocol stack is implemented by adding the socket to the local ESTABLISHED cache hash table.
After a process obtains a round of data, in order not to affect the global domain scheduling behavior of the Linux system (Linux domain scheduling can achieve load balancing of the process without damaging the cache heat ), you need to release the dinged process and pay attention to optimization point 1. Do you need to remove the socket from the local table? Since process migration is generally a small probability event, the process will not be migrated under a considerable probability, so removing the socket can save some mips.
4. 3. the client socket and connection processing are completely separated to process one-to-one TCP socket data transmission and the connection processing described in the first part of this article are irrelevant. After the accept returns, which process/thread processes the client socket and whether it is bound to a specific CPU does not affect the connection processing process.
4.4.routine Optimization of Linux native protocol stack-prequeue and backlog in fact, due to data copying, the kernel protocol stack always tends to let the user process parse and process a skb by itself, what the kernel needs to do is to put the skb in a queue, which is a very common optimization. There are two queues in the Linux protocol stack: prequeue and backlog. The former is when the protocol stack finds that no user process occupies a socket, it tries to mount the skb to a user prequeue queue, when a user's process calls recv, dequeue and process it by themselves. The latter means that when the protocol stack finds that the current socket is occupied by a user's process, it mounts the skb to the backlog queue, dequeue and process the socket when the user process releases it.
4.5.routine Optimization of Linux native protocol stack-RPS/RFS optimization this is actually something similar to the software-simulated RSS. There are already a lot of materials, its purpose is to try to make the CPU processed by the Protocol Stack and the CPU processed by the user process the same CPU. We know that connection processing is not feasible because of the one-to-many relationship, however, for one-to-one customer socket, this effect is very good.
4.6.routine optimization of the Linux native protocol stack-early demuxLinux protocol stack expects that, if the first packet of a data stream is received locally, the subsequent route search will be skipped and only socket search will be implemented. This is a typical optimization solution for short-circuit query. It is similar to one query and multiple use by multiple parties, similar to what I did for nf_conntrack.
4.7.routine Optimization of Linux native protocol stack-busy poll, as its name implies, means to send skb to the protocol stack rather than sending it to the underlying layer, which is a bit like self-delivery. However, since the underlying layer is not processed by the protocol stack, It is not sure whether the packet is sent to itself. Therefore, this poses a certain risk, however, if you can determine the behaviors of your system in the statistical sense, the optimization results can also be achieved.

5. conclusion 5. 1. optimization of TCP connection processing if you regard a Listener as a socket processed by the kernel protocol stack, you can easily find many points worth optimization, for example, here we split the lock into finer granularity, where we can determine what to seize, and then the code will become more and more complex and huge. However, if you think of a Listener as an infrastructure, it will be much simpler. Infrastructure serves customers. It has two basic properties:
1). It will always be there
2). It is not controlled by the customer
Shouldn't a TCP Listener Be an infrastructure? It should be a group with the CPU. as a resource and then serve the connection request, it should not be the guy who grabs the CPU, and should not have "How can it get more CPU time, how can I obtain other Listener resources .... to optimize it.
There are many such examples in the Linux kernel, such as ksoftirqd/0 and ksoftirqd/0/1. kernel threads such as kxxxx/n can be seen as infrastructure and provide the same service on each CPU, if you do not interfere with other CPUs, the Per-CPU management data will always be used. According to this idea, when you need to provide a TCP Service, it is easy to think of building a similar infrastructure where it will always be there, one for every CPU, to provide a connection service for a new connection, the service product is a customer socket. As for how to hand it over to the process, a service window is provided through two types of Accept queues! Let's assume that a listening process or thread has crashed, which will not affect the TCP Listener infrastructure at all. Its Process independence continues to provide customer Sockets for other processes, until all the listening processes are suspended or exited, it is easy to track by reference count.
In this optimized version, you can think that, from two types of Accept queues down, the TCP Listener infrastructure is irrelevant to the user's listening process and will not be affected, even if user listening processes continuously jump, bind, and unbind between CPUs, they do not affect the TCP Listener infrastructure under the Accept queue, only the Accept queue from which they obtain the customer's socket is affected. The following is a brief diagram:

 

 

Similarly, TCP Listener has been fully implemented, and the user process socket is isolated from the Accept queue, and the NIC or Nic queue is isolated from the interrupt scheduling system, therefore, there are no four types of optimization versions: more queues than CPU, equal queues to CPU, fewer queues than CPU, and no queues at all.
5. 2. this article is not about the optimization of TCP transmission processing. A major reason is that this is not the core bottleneck, the scheduling system of Linux can cope with the impact of process switching and migration on the cache. Therefore, you will find that the only optimization I have done is for these two aspects. At the same time, adding a local ESTABLISHED hash cache will bring about a small amount of performance improvement. This idea comes from the nf_conntrack cache I added and the slab Hierarchical cache.
5. 3. consistency with packet forwarding a Linux forwarding performance was optimized some time ago. At that time, a virtual output queue (VOQ) was introduced to solve the problem of acceleration ratio. In order to achieve line rate pps, in the case of forwarding, there is a problem with the N-acceleration ratio of network card bandwidth. Therefore, when we created the forwarding infrastructure, we pulled the network card and cooperated with each other, instead of the CPU. In fact, if the network card is on multiple queues, the two are exactly the same. The NIC queue can be bound to the CPU, and a CPU can be used to process input and a dedicated output.
In combination with the skb pool of Per CPU (here is an example of truck shipping), each Listener has a skb pool on each CPU, so that for the local data packets, lockless skb allocation is possible.
Now we can unify the forwarding and local reception for TCP. for forwarding, the egress is another Nic, and for local reception, the egress is an Accept Queue (handshake packet) or user buffer (Transfer package ). If there is no user process at the exit, assume that it is not put into the queue, but directly discarded, then the Accept queue of Per CPU is really a network card.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.