NIC interrupt load balancer under dense load SMP affinity and single-queue RPS

Source: Internet
Author: User


Simply put, each hardware device (such as: Hard disk, network card, etc.) need to have some form of communication with the CPU so that the CPU can know what happened, so the CPU may put down the matter to deal with the emergency, the hardware device actively disturb the CPU is the hardware interrupt.

About SMP IRQ Affinity?

New kernel, Linux improves the ability to assign specific interrupts to the specified processor (or processor group). This is known as the SMP IRQ affinity, which controls how the system responds to various hardware events. Allows you to restrict or reassign the workload of the server, allowing the server to work more efficiently. Take the network card interrupt as an example, when the SMP IRQ affinity is not set, all network card interrupts are associated to the CPU0, which causes the CPU0 load to be too high, and the network packets cannot be processed efficiently, which leads to the bottleneck. With SMP IRQ Affinity, multiple interrupts on the NIC are allocated to multiple CPUs, which can disperse CPU pressure and improve data processing speed.

Some introductions of irqbalance

Irqbalance is used to optimize interrupt allocation, which automatically collects system data to analyze usage patterns and puts the working state in performance mode or Power-save mode based on system load conditions. In performance Mode, Irqbalance distributes interrupts as evenly as possible to each CPU to take advantage of CPU multicore and improve performance. When in Power-save mode, the Irqbalance allocates interrupts to the first CPU to ensure the rest of the idle CPU's sleep time. Reduce energy consumption.

When no SMP IRQ affinity is configured and no irqbalance is turned on

After configuring the SMP IRQ affinity, the RPS is activated and the irqbalance is turned off

When making a network program, it is often necessary to understand the balance of interrupts and soft interrupts, and to know how many interrupts per second occur on which CPU.
The interrupt source information under Linux can be learned from the/proc/interrupts

Everyone will see his frequency is the same, because we have opened the irqbalance, this thing in the case of small load, is a very good service, his main function is reasonable to deploy the use of each CPU core, especially for the current mainstream multi-core CPU, Simply put, the pressure can be evenly distributed to the core of each CPU, to improve the performance of a lot of help.

[[Email protected] ~]# service irqbalance statusirqbalance (PID  21745) is running ... [Email protected] ~]#

First we can view the specific information about the CPU by accessing the/proc/cpuinfo information.

Gets the interrupt IRQ number for the Eth0 NIC, and assigns a value to the shell variable

Turn off the Irqbalance Auto-assign service so that we manually assign interrupt requests '

/etc/init.d/irqbalance stop

Specify the CPU to handle interrupt requests for the corresponding NIC

I'm here to choose CPU2 to handle this NIC interrupt ~

Here's 4 is the CPU's 16-binary expression

CPU Binary Oct

CPU 0 00000001 1

CPU 1 00000010 2

CPU 2 00000100 4

CPU 3 00001000 8

Here to share a script, directly counted on it ~

#!/bin/bash#echo "Statistics CPU 16 binary" [$#-ne 1] && echo ' is CPU core number ' && exit 1ccn=$1echo ' Print eth0 Affinity "for ((i=0; i<${ccn}; i++)) Doecho ==============================echo" Cpu Core $i is Affinity "((affinity= (1 <<i)) echo "Obase=16;${affinity}" | Bcdone

If the CPU is 8 cores ~

If the 16 core ~

 [[email protected] ~]# sh 16 Statistics Cpu 16 binary "Print eth0 affinity" ==============================cpu Core 0 is AFFINITY1==============================CPU Core 1 is affinity2==============================cpu Core 2 is affinity4=== ===========================CPU Core 3 is AFFINITY8==============================CPU Core 4 is affinity10============== ================CPU Core 5 is AFFINITY20==============================CPU core 6 is affinity40======================== ======CPU Core 7 is AFFINITY80==============================CPU Core 8 is AFFINITY100==============================CPU Core 9 is AFFINITY200==============================CPU core AFFINITY400==============================CPU core 11 is AFFINITY800==============================CPU core AFFINITY1000==============================CPU core AFFINITY2000==============================CPU Core is AFFINITY4000==============================CPU core affinity8000 

Then for the smp_affinity configuration, according to the number of 16 CPU to calculate, if you enter 5, that means that cpu0 and CPU2 are involved in.

You'll also notice that there's a smp_affinity_list in the catalogue, and he's a decimal expression.

Two configurations are interlinked, Smp_affinity_list uses the decimal, compared to the smp_affinity hexadecimal, more readable.

echo 3,8 >/proc/irq/31/smp_affinity_listecho 0-4 >/proc/irq/31/smp_affinity_list

Use Watch to view the effect after switching

[Email protected] ~]# watch-n 2 "cat/proc/interrupts |grep ETH"

Well, here's what you need to explain:

For single-queue NICs, smp_affinity and smp_affinity_list have no effect in configuring multiple CPUs. It's not absolute, we can use RPS to fix it.

This feature is mainly for single-queue network card Multi-CPU environment, such as network card support multi-queue can use SMP IRQ affinity directly bound hard interrupt, if not support multi-queue, then use RPS to solve the network soft interrupt load balancing, that is, a single NIC soft interrupt spread to multiple CPU processing, Avoid performance bottlenecks due to excessive single CPU load.

How to determine if your network card supports multiple queues ~
The rightmost ones are the multi-queue information.

This is a queue of 4 ~

This is a queue of 8 ~

The model used here is IBM x3630 M3

The NIC is msi-x.

In fact, in a large number of small packets of the system, irqbalance optimization almost no effect, but also make the CPU consumption distribution imbalance, resulting in the machine performance is not fully utilized, this time need to end it. Then we use the manual method to configure the next.

Anyway, under the same conditions strongly everyone chooses the multi-queue network card, the common has Intel's 82575, the 82576,boardcom 57711 and so on.

Multi-Queue network card is a technology, initially used to solve the network IO QoS problems, with the increasing bandwidth of the network IO, the single core CPU can not fully meet the needs of the network card, through the multi-queue network card driver support, the various queues through the interrupt binding to different cores. In fact, with bonding NIC binding to a certain extent can also do interrupt load balancing, two network card interrupt number can be tied to different CPU cores.

Since RPS simply balances packets to different CPUs, this time, if the application is on a CPU that is not the same CPU as the soft interrupt processing, the effect on the CPU cache will be large, then RFS ensure that the CPU that the application processes is the same as the CPU that the soft interrupt handles. This makes full use of the CPU cache.

There are two ways to configure it:

Connect each queue to a single CPU

/proc/sys/net/core/rps_sock_flow_entries 32768/sys/class/net/eth0/queues/rx-0/rps_cpus 00000001/sys/class/net/ Eth0/queues/rx-1/rps_cpus 00000002/sys/class/net/eth0/queues/rx-2/rps_cpus 00000004/sys/class/net/eth0/queues/ Rx-3/rps_cpus 00000008/sys/class/net/eth0/queues/rx-0/rps_flow_cnt 4096/sys/class/net/eth0/queues/rx-1/rps_flow_ CNT 4096/sys/class/net/eth0/queues/rx-2/rps_flow_cnt 4096/sys/class/net/eth0/queues/rx-3/rps_flow_cnt 4096

The other is the recommended method: This method can be well balanced under multiple tests.

/sys/class/net/eth0/queues/rx-0/rps_cpus 000000ff/sys/class/net/eth0/queues/rx-1/rps_cpus 000000ff/sys/class/net /eth0/queues/rx-2/rps_cpus 000000ff/sys/class/net/eth0/queues/rx-3/rps_cpus 000000ff/sys/class/net/eth0/queues/ rx-0/rps_flow_cnt 4096/sys/class/net/eth0/queues/rx-1/rps_flow_cnt 4096/sys/class/net/eth0/queues/rx-2/rps_flow_ CNT 4096/sys/class/net/eth0/queues/rx-3/rps_flow_cnt 4096/proc/sys/net/core/rps_sock_flow_entries 32768

Summary below:

Rps/rfs is primarily for single-queue NIC multi-CPU environments. Although these virtual queues are supported, they are software simulations. It is strongly recommended to use a network card that supports multiple queues.

Multi-Queue Multi-interrupt Nic can also use the RFS RPS after using the SMP affinity, where he is more like a receiver's mediator, maximizing the CPU cache.

Some of their own understanding, if there is no place, please friends spray!

These two days will put up the results of the test, 6 floor of the storeroom power outage, from the Buddies to apply a few r720xd test machine are not even up!

The network test here is Netperf ~

TAR-XZVF netperf-2.5.0.tar.gzcd Netperf-2.5.0./configuremakemake Install


Depending on the scope of the action, Netperf's command-line arguments can be divided into two broad categories: global command-line parameters, test-related local parameters, and the use of-separate: Netperf [global options]–-[test-specific options] Where: Global command-line parameters include the following options:-H Host: Specifies the server IP address of the remote running NetServer. -L Testlen: Specifies the length of time (in seconds) of the test-T testname: Specifies the type of test that is performed, including the following options for the local parameters that are related to the TCP_STREAM,UDP_STREAM,TCP_RR,TCP_CRR,UDP_RR test:-S size Set the socket send and receive buffer size for the local system-s-size set the remote system's socket send with receive buffer size-M size set the size of the local system send test grouping-m size set the remote system to receive the size of the test packet-D Socket settings tcp_nodelay options for local and remote systems

102 is the request side, you can see his network card run full is about 118M.

101 is the service side ~

Let's look at the client's results:

1) The remote system (that is, server) uses a socket buffer size of 229376 bytes

2) The Local system (that is, client) uses a socket size of 65507 bytes to send buffers

3) The test takes 120 seconds to experience

4) Throughput test result is 961 mbits/sec

1) The remote system (that is, server) uses a socket buffer size of 87380 bytes

2) The Local system (that is, client) uses a socket size of 16384 bytes to send buffers

3) The test takes 120 seconds to experience

4) Throughput test result is 941 mbits/sec

NIC interrupt load balancer under dense load SMP affinity and single-queue RPS

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.