Linux kernel soft RPS for network receive soft interrupt load balanced distribution

Source: Internet
Author: User
Tags thread logic

The routine Linux soft interrupt distribution mechanism and problem Linux interrupts are divided into the upper and lower halves, generally (it is true also), the interrupted CPU executes the interrupt handler function, and triggers a soft interrupt on this CPU (the lower half), and so on when the hard interrupt processing returns, the soft interrupt is then interrupted to run on this CPU. Or wake the soft interrupt kernel thread on the CPU to handle the soft interrupt pending in the hard interrupt.
In other words, the upper half of the interrupt associated with the same interrupt vector and the soft interrupt are executed on the same CPU, which can be seen through the RAISE_SOFTIRQ interface. The logic of this design is correct, but it does not work well on the premise of some less intelligent hardware. The kernel has no way to control the distribution of soft interrupts, so it can only be left to the hard interrupt launch. This is divided into two types of situations:
1. Hardware can only interrupt a CPU according to the above logic, if the system has more than one CPU core, then only one CPU processing soft interrupt, which will obviously cause the system load between the various CPUs unbalanced.
2. Hardware blindly randomly interrupt multiple CPUs pay attention to the word "blind". This is related to the motherboard and the bus, and the interrupt source relationship is not big. Therefore, the specific interruption of which CPU and interrupt source business logic is also unrelated, such as the motherboard and the interrupt controller is not to ignore the packet contents of the network card, and not based on the metadata of the packet interrupt different CPU ... That is, the interrupt source has little to do with what CPU this thing can control. Why does it have to be an interrupt source? So only it knows its own business logic, and this is an end-to-end design problem.
Therefore, Linux on the soft interrupt scheduling, the lack of a little controllable logic, a little bit of flexibility, is entirely dependent on the hardware interrupt source interrupts the CPU, and this aspect, the hardware interrupt source due to the interruption of the controller and the bus and the CPU isolation, they are not good coordination. Therefore, it is necessary to add a soft interrupt scheduling layer to solve this problem.
This article describes not a common scenario for the above problem, because it is only for the network packet processing, and RPS at the beginning of the design by Google, its design is highly customized, the purpose is very simple, is to improve the performance of Linux server. I, however, transplanted this idea to improve the performance of Linux routers.

Optimization of soft interrupt distribution based on RPS in Linux forwarding optimization the article, "Linux Forwarding performance evaluation and optimization (forwarding bottleneck analysis and solution)", I tried to receive a soft interrupt load balanced distribution, then tried to divide the soft interrupt into the upper and lower part:
Top half: Used for SKB distribution between different CPUs.
Bottom half: The user SKB the actual protocol stack receive processing.
In fact, the idea of using a Linux 2.6.35 to join RPS might be better, without having to re-partition the network to receive soft interrupts at all. It is based on the following facts:
Fact 1: The network card is very high-end situation if the network card is high-end, then it must support the hardware multi-queue feature and multi-interrupt vector, so that you can directly bind a queue interrupt to a CPU core, without soft interrupt redistribution SKB.
Fact 2: The network card is very low-grade situation if the network card is very low, such as it does not support multi-queue, also does not support multiple interrupt vectors, and can not load balance the interrupt, so there is no need to let the soft interrupt distribution, directly to drive inside the distribution is not better (actually this is not good)? In fact, even if you support the CPU load balancing of a single interrupt vector, it is best to disable it because it destroys the affinity of the CPU cache.

Why the above two-point fact cannot be exploited by a complex time-consuming operation that cannot be performed by a complex computation. The interrupt handler function is device-dependent and is generally not the responsibility of the framework, but is the driver's own responsibility. The main framework of the protocol stack maintains only one set of interfaces, and the driver can invoke APIs within the interface set. Can you guarantee that the driver writer can use RPS correctly instead of misusing it?
The right thing to do is to hide all this, and the outside just provides a set of configurations that you (the driver writer) can open, close it, and how it works, you don't have to care.
So the final solution is the same as my original, and it seems that RPS is the same way of thinking. Modify the Napi poll callback in the soft interrupt path! However, the poll callback is also driver-maintained, so a hook is hooked on the public path of the packet data to handle the RPS.

Why to disable CPU interrupt load balancing for low-end network adapters The answer seems simple: because we can do better with our own software! Simple and stupid blind interrupt load balancing based on easy hardware can be (almost certainly) self-defeating!
Why is this? Because the simple low-end NIC hardware does not recognize the network stream, it can only recognize that this is a packet, and it does not recognize the tuple information of the packet. If the first packet of a data flow is distributed to CPU1, and the second packet is distributed to CPU2, the CPU cache will be less utilized for the stream's common data, such as those recorded in Nf_conntrack, and the cache jitter can be quite severe. For TCP streams, packet scrambling may also occur because of delay uncertainties in parallel processing of TCP serial packets. So the most straightforward idea is to distribute all the packets that belong to a stream on a single CPU.

My changes to the native RPS code know that the RPS feature of Linux is introduced by Google, whose goal is to improve the server's processing efficiency. They therefore focused on the following information:
which CPU is providing services for this data stream;
Which CPU is interrupted by the network card that received the stream packet;
Which CPU runs the soft interrupt that processes the stream packet.
Ideally, the three CPUs above should be the same CPU in order to achieve efficient use of the CPU cache. The native RPS implementation is the goal. Of course, for this purpose, the kernel has to maintain a "flow table", which records the above three types of CPU information. This flow table is not really a tuple-based flow table, but a table that simply records the CPU information above.
And my needs are different, I focus on data forwarding rather than local processing. So my focus is on:
which CPU is interrupted by the network card that received the stream packet;
Which CPU runs the soft interrupt that processes the stream packet.
In fact, I do not fancy which CPU dispatches a packet, the sending thread simply dispatches a SKB from Voq, sends it, does not process the packet, and does not even access the contents of the packet (including the protocol header), so cache utilization is not the primary consideration of the sending thread.
So when it comes to Linux as a server, which CPU is serving the stream where the packet resides, which CPU data sending logic can be ignored by Linux as a router (although it can also be optimized by setting level two cache relay [last]). Linux as a router, all the data must be fast, it must be as simple as possible, because it does not have Linux as Server Runtime server processing inherent delay-query database, business logic processing, and this service processing of intrinsic delay relative to the network processing delay, is much larger, therefore, as a server, network protocol stack processing is not a bottleneck. What is a server? The server is the end of the packet, where the protocol stack is just a portal, an infrastructure.
The network stack processing delay is the only delay when running as a router, so optimize it! What is a router? The router is not the end of the packet, and the router is where the packet has to go, but leave as quickly as possible!
So instead of using RPS's native approach directly, I simplified the hash calculation and no longer maintained any state information, just a hash:
TARGET_CPU = My_hash (Source_ip, DESTINATION_IP, L4proto, Sport, dport)% Nr_cpu;
[My_hash just enough to hash the information equally evenly!] ]
That's all. So the GET_RPS_CPU can only have the above sentence.
There is a complexity to consider, if you receive an IP shard, and not the first, then you can not get four layers of information, because they may be distributed to different CPU processing, and when the IP layer needs to be reorganized, it involves the data exchange between the CPU and synchronization problems, this problem is currently not considered.

NET Rx Soft Interrupt Load Balancer Overall framework This section gives a general framework, the NIC is very low, assuming the following:
Multi-queue is not supported;
Interrupt load balancing is not supported;
will only interrupt CPU0.

Its framework is as follows:




CPU Affinity Relay Optimization In this section a little bit about the output processing thread, because the output processing thread logic is relatively simple, is to execute a scheduling policy and then have a network card to send SKB, it does not touch the packet frequently (note that because of the use of VOQ, the packet is put into VOQ, Its two layers of information has been encapsulated, some can use the decentralized/aggregated IO, if not supported, can only memcpy ...), so the CPU cache to its meaning is not to receive the protocol stack processing thread of the large. However, it still has to touch this SKB once, in order to send it, and it also touch the input card or their own voq, so the CPU cache if the affinity, it will be better.
In order not to allow the pipeline to handle too long alone, resulting in a delay increase, I tend to put the output logic in a separate thread, if the CPU core is adequate, I would prefer to bind it to a core, preferably not tied to the core of the input processing and the same. So which one or what is it?
I tend to share the two cache or three cache CPU two cores are responsible for network receive processing and network send scheduling processing respectively. This creates a local relay for the input and output. According to the motherboard construction and the general CPU core package, you can use the recommendations shown:




Why I do not analyze code implementation first, based on the fact that I did not fully use RPS's native implementation, but instead of a number of modifications to it, I did not carry out a complex hash, I relaxed some constraints, the purpose is to make the calculation more quickly, stateless things do not need maintenance!
Second, I found that I gradually do not understand the code analysis I wrote before, but also difficult to understand the bulk of the code analysis of the book, I find it difficult to locate the corresponding version and patch, but the basic idea is exactly the same. So I tend to sort out the process of dealing with events rather than simply parsing the code.
Disclaimer: This article is for the bottom of the general equipment of the final compensation, if there is a hardware combination of the scheme, naturally ignore the practice of this article.

Routine complaints and sigh small illness fever, the company a lot of backlog of things, wife company recently always meeting! I will heart to the moon, but the moon shines in the ditch! What am I going to do tonight? A lot of people will think I am very tired, the evening will be a good night sleep! No! No, I am in correspondence at night, at DIY, in debug! In reading, in Unix, in cisco! In ancient Rome, Ancient Greece! Because this is the only time that I belong to myself! As long as the outside under the rain, the bigger the better, I can last 4 days do not sleep do not eat, the middle of a simple meal, there is water, keep the mouth into the organic matter minimized! I want to say, any of the company's buddies and I more than work overtime, dare to knock? including Huawei's!
This is not an outpouring of anger, it is a power. I remember I found out in grade five that I could have done this. Junior high School often this way, sometimes to solve a few super difficult math problems, to the university, which became commonplace, learning, sometimes playing video games, pure play, chatting with female friends for a few hours, she sleepy sleep, I waited, waiting for her to wake up. After work, changed n company, because overtime dead knock over several, an overnight + a day and a half can. Remember in a company every Thursday routine overnight, this can get me excited, especially on a rainy day! I hate the normal rest, I prefer non-weekend collective overnight, and then work in the evening, the next day we either take some, or sleepy, and I looked at them a kind of abusive feeling, time to where, time is there, to get the threshold! I have the ability to keep everyone in a vicious circle through non-weekends, but I won't do it because I'm a good person. So I often say, overtime is my patent, and not yours, overtime for you is a kind of torture, of course, nothing in the company stay to avoid the peak, bets hide his wife to hide the housework, earn expenses except!
Last night, it was raining outside! Harvest: Roman/Etruria relations; SONET/SDH framing Standard; this article. In the meantime take care of the little fever.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Linux kernel soft RPS for network receive soft interrupt load balanced distribution

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.