Linux Kernel soft RPS for load balancing and distribution of network reception soft interruptions
Routine Linux Soft Interrupt distribution mechanism and problematic Linux interrupt are divided into the upper and lower half. Generally (this is true), the interrupted CPU executes the interrupt processing function, the Soft Interrupt (lower half) is triggered on the current CPU. After the hard interrupt processing is returned, the soft interrupt is started and interrupted on the current CPU, or wake up the Soft Interrupt kernel thread on the CPU to handle the Soft Interrupt pending in the hard interrupt.
In other words, the upper half of the interrupt and Soft Interrupt related to Linux and the same interrupt vector are executed on the same CPU. This can be seen through the raise_softirq interface. The logic of this design is correct, but it does not work well on some less intelligent hardware. The kernel has no way to control the distribution of soft interrupts, so it can only listen to the launch of hard interrupts. There are two types of situations:
1. the hardware can only interrupt one CPU according to the above logic. If the system has multiple CPU cores, only one CPU can handle soft interruptions. This will obviously cause unbalanced system load among various CPUs.
2. Hardware blindly and randomly interrupt multiple CPUs. Note the word "blind. This is related to the motherboard and bus, and has little to do with the interrupt source. Therefore, the business logic of which CPU is interrupted is not associated with the interrupt source. For example, the main board and the interrupt controller do not care about the packet content of the NIC, it will not interrupt different CPUs according to the metadata of data packets... that is, the interrupt source has almost no control over which CPU is interrupted. Why must it be an interruption source? Therefore, it only knows its own business logic, which is an end-to-end design solution.
Therefore, Linux's Soft Interrupt scheduling lacks some controllable logic and less flexibility. It relies entirely on the broken CPU in the hardware interrupt source. In this regard, the hardware interrupt source is isolated by the interrupt controller and bus from the CPU, so the cooperation between them is not good. Therefore, you need to add a Soft Interrupt scheduling layer to solve this problem.
This article does not describe a general solution for the above problems, because it is only for network packet processing, and RPS was designed by google at the beginning, its design is highly customized, with a single goal of improving the performance of Linux servers. I have transplanted this idea to improve the performance of the Linux router.
RPS-Based Soft Interrupt distribution optimization in the Linux forwarding optimization article "Linux forwarding Performance Evaluation and Optimization (forwarding Bottleneck Analysis and Solution, I tried the Server Load balancer distribution for the network card to receive the Soft Interrupt. At that time, I tried to divide the Soft Interrupt into the upper half:
Upper Half: used for skb distribution across different CPUs.
Lower half: the actual protocol stack receiving and processing of the user skb.
In fact, the idea of using RPS after Linux 2.6.35 may be better, and there is no need to split the network to receive soft interruptions. It is based on the following facts:
Fact 1: If the NIC is very high-end, it must support the Hardware Multi-queue feature and multi-interrupt vector. In this case, you can directly bind the interruption of a queue to a CPU core without the need for soft interruption to resend the skb.
Fact 2: when the network card is very low, if the network card is very low, for example, it does not support multiple queues, it does not support multiple interrupt vectors, and it cannot perform load balancing on the interrupt, so there is no need for soft interruptions for distribution, but it is not better to directly drive the distribution inside (in fact, this is really not good )? In fact, even if a single interrupt vector is supported for load balancing between CPUs, it is best to disable it because it will damage the CPU cache affinity.
The above two facts cannot be used to interrupt complex time-consuming operations or complex computing. Interrupt Processing functions are device-related. Generally, they are not managed by the framework, but by the driver itself. The main framework of the protocol stack only maintains one interface set, and the driver can call APIs in the interface set. Can you ensure that the driver writer can use RPS correctly instead of misuse it?
The correct method is to hide all of these mechanisms. Only one external configuration is provided. You (the driver writer) can enable and disable it. You don't have to worry about how it works.
Therefore, the final solution is the same as my original one. It seems that RPS is also like this. Modify NAPI poll callback in Soft Interrupt path! However, the poll callback is also driven and maintained. Therefore, a HOOK is attached to the Public path of the data packet to process the RPS.
Why do we need to disable the CPU interruption of the low-end network card? The answer is: we can do better with software! The simple and stupid blind interruption of Server Load balancer based on simple hardware may (almost certainly) be self-defeating!
Why? Because the hardware of the simple low-end network adapter does not recognize network streams, that is, it can only identify a data packet, but cannot recognize the information of the packet tuples. If the first packet of a data stream is distributed to CPU1, and the second packet is distributed to CPU2, public data of the stream, such as what is recorded in nf_conntrack, the CPU cache utilization is relatively low, and the cache jitter is severe. For TCP streams, the delay uncertainty of concurrent processing of TCP serial packets may also lead to disordered data packets. Therefore, the most direct idea is to distribute all data packets of a stream to a CPU.
I want to know how to modify the native RPS code. The RPS feature of Linux is introduced by google staff, and their goal is to improve the server processing efficiency. Therefore, they focus on the following information:
Which CPU is providing services for this data stream;
Which CPU is interrupted by the NIC that receives the stream packet;
Which CPU runs the Soft Interrupt that processes the stream data packet.
Ideally, the above three CPUs should be the same for efficient utilization of CPU cache. Native RPS implementation is for this purpose. For this purpose, of course, a "stream table" has to be maintained in the kernel, which records the CPU information of the above three types. This stream table is not a real tuples-based stream table, but a table that only records the preceding CPU information.
My requirements are different. I focus on data forwarding instead of local processing. Therefore, I focus on the following:
Which CPU is interrupted by the NIC that receives the stream packet;
Which CPU runs the Soft Interrupt that processes the stream data packet.
In fact, I am not really interested in which CPU schedules to send data packets. The sending thread only schedules a skb from VOQ and then sends the data packets, the data packet content (including the protocol header) is not even accessed. Therefore, the cache utilization is not the primary consideration of the sending thread.
Therefore, when Linux is used as a server, it focuses on which CPU is used to provide services for the stream of data packets, which CPU data sending logic can be ignored when Linux is used as a router (although it can also be optimized by setting a second-level cache relay [last ). Linux as a router, all the data must be fast and as simple as possible, because it does not have the inherent latency of Linux as the server processing when the server is running-query the database, business logic processing, etc, the inherent latency of this service processing is much greater than that of network processing. Therefore, as a server, network protocol stack processing is not a bottleneck. What is a server? The server is the end of the data packet. Here, the protocol stack is only an entry and an infrastructure.
When running as a router, the network protocol stack processing latency is the only latency, so you need to optimize it! What is a vro? A vro is not the destination of a data packet. A vro is a place where data packets have to pass through, but it must be left as quickly as possible!
Therefore, I did not directly adopt the native RPS approach, but simplified hash computing and did not maintain any State information. I just calculated a hash:
[Plain] view plaincopyprint?
Target_cpu = my_hash (source_ip, destination_ip, l4proto, sport, dport) % NR_CPU;
Target_cpu = my_hash (source_ip, destination_ip, l4proto, sport, dport) % NR_CPU; [my_hash only hashes the information evenly enough.]
That's all. Therefore, only the above sentence can be found in get_rps_cpu.
Here there is a complexity to consider. If an IP segment is received and is not the first, layer-4 information will not be obtained, because it may distribute them and titles to different CPUs for processing, when the IP layer needs to be restructured, data exchange and synchronization between CPUs will be involved. This issue is not considered for the moment.
Net rx Soft Interrupt Server Load balancer general framework this section provides a general framework, the network card is very low-end, assuming the following:
Multi-queue is not supported;
Server Load balancer interruption is not supported;
Only CPU 0 is interrupted.
Shows its framework:
This section describes how to optimize the CPU affinity for the output processing thread. Because the logic of the output processing thread is relatively simple, it is to execute the scheduling policy and then the NIC sends the skb, it does not touch data packets frequently (note that, because VOQ is used, the Layer 2 Information of the data packet is encapsulated when it is put into VOQ, partial IO can be dispersed/clustered. If not supported, only memcpy ...), therefore, the significance of CPU cache to it is not as high as that of the processing thread that has received the protocol stack. However, in any case, it still needs to touch this skb once. In order to send it, and it also needs to touch the input Nic or its own VOQ, so if the CPU cache is compatible with it, it will certainly be better.
In order not to let the pipeline process too long, resulting in increased latency, I tend to put the output logic in a separate thread. If the CPU core is sufficient, I still tend to tie it to a core, it is best not to bind it to the same core as the input processing. So which one to bind or which one is better?
I tend to share the second-level cache or the third-level cache CPU. The two cores are responsible for network receiving and network sending scheduling. This forms a local input and output relays. According to the motherboard structure and general CPU core encapsulation, the following suggestions can be used:
Why don't I analyze the code implementation first? Based on this fact, I didn't fully use the native Implementation of RPS, but corrected it. I didn't perform complicated hash operations, I relaxed some constraints to make computing faster and stateless things do not need to be maintained at all!
Second, I found that I gradually couldn't understand the code analysis I wrote before, and it was hard to understand a large number of batch of code analysis books. I felt it was difficult to find the corresponding version and patch, however, the basic idea is the same. Therefore, I prefer to organize the process of event processing, rather than simply analyzing the code.
Disclaimer: This article is the final compensation for general-purpose devices at the bottom. If there is a hardware combination solution, we should naturally ignore this article.
This article permanently updates the link address: