RPM: Overview of Linux network IO parallelization technology

Source: Internet
Author: User

Ext.: http://codinginet.com/articles/view/201605-linux_net_parallel?simple=1&from=timeline&isappinstalled= Overview of 0Linux network io parallelization by Mikewei at 2016-05-21 00:30 Read (276)

The internet has experienced explosive growth over the past decade, and what technology platform has played a key role, I think, is Linux, which is still the most important cornerstone even in today's popular cloud computing. We may often hear about what is the best server programming language, how best to design the architecture, but never hear the discussion of what is the best server operating system, in fact its status is already important to us as a standard rather than a choice. From 2001 2.4 to 2003 2.6, then 2011 3.0, and today's 4.6,linux in performance, stability, ease of use and other aspects of continuous improvement. As a user many times we find that the problem is not whether Linux can keep up with our needs, but whether we can understand the many features of Linux in time to take advantage of it.

This article focuses on the technology related to Linux network IO (protocol stack). Remember about 10 years ago, single box maximum performance can handle tens of thousands of QPS, 100,000 concurrent connections, and now single box can handle hundreds of thousands of QPS, millions of concurrent connections. In addition to the hardware performance improvements, the kernel optimization technology also plays a big role. In addition, these optimization features are not always the default effective or optimal, when there is a need to tuning scenarios, I have been working in recent years, I have encountered a number of server performance problems and through simple tuning can be effectively improved, so it is necessary to understand them. Perhaps you have heard of the following kernel support features: interrupt affinity, multi-queue NIC, RPS, RFS, XFS, So_reuseport. There is a lot of information about them on the internet, but I have found very few systematic introductions from a global perspective, so I have the motivation to summarize this article.

First of all, I think we can summarize a main line: parallelization. Because the network protocol stack processing, is essentially CPU-intensive computing, so over the years the common idea of various key optimization patches, the basic is how to make full use of multi-core resources to achieve computational parallelism. Why is it so simple that a train of thought will come up with so many concepts, because the stack calculation itself is a complex layered processing process, in each layer of processing links have parallel optimization of space, these optimization patches are at these different levels of work. I will follow this context to make an introduction to these technical points.

Linux protocol stack

First, let's review the hierarchical structure of the Linux protocol stack, such as:

At the bottom is the hardware network card (NIC), which typically communicates with the operating system through two memory ring-shaped queues (rx_ring/tx_ring) plus an interrupt mechanism. When the NIC receives the packet, it writes the packet to the rx_ring and generates an interrupt. After the CPU receives an interrupt, the OS is stuck in an interrupt handler, which is called HARD-IRQ in the Linux kernel.

In older kernels, the NIC Hard-irq program takes the packet out of the rx_ring and puts it in a backlog queue in percpu, and then initiates a SOFT-IRQ to process the backlog queue ( The kernel interrupts the processing of work that does not require real-time synchronization to be performed in a quasi-real-time asynchronous time, the asynchronous mechanism is SOFT-IRQ. Newer kernels typically do not use the backlog queue, but instead use an improved mechanism called NAPI, except that HARD-IRQ no longer reads each packet directly, but instead directly initiates SOFT-IRQ, and in Soft-irq the packet poll from the RX_ The ring is removed and processed (can greatly reduce the number of interrupts).

Either the backlog or the NAPI, they are executed in the SOFT-IRQ context, and the packets are submitted to the IP layer for processing (for simplicity we all take the TCP/IP protocol as an example). When the IP layer finishes processing shards and routes, it is committed to the transport layer (TCP or UDP) for processing. Protocol-related logic we're not going to dwell on this, and eventually it puts this packet in the receive queue of the corresponding socket object and wakes up the process blocking it on the socket.

The user-state process uses the socket FD to manipulate the socket object in the kernel, which often blocks the operation of the socket-related read/write or epoll/select. It is also important to note that the Linux file mechanism allows multiple processes to compete concurrently with the same socket object through their respective FD. Typically scenarios such as multiple processes compete to accept a listening socket, and again, multiple processes compete to read a UDP socket.

Interrupt Scheduling

The protocol stack handles the software portion of the packet, always starting with the hard interrupt (HARD-IRQ) processing. There are a few facts you need to know about interrupt handling:

    • There are many interrupt requests with different functions in the computer system, which are uniquely identified by the interrupt number, for example, each NIC has its own interrupt number.
    • For each interrupt number, the system registers a handler (also the interrupt handler we usually say)
    • In HARD-IRQ handler (such as network card interrupt handlers), it is common to perform tasks that do not need to be done immediately (such as TCP/IP stack processing) by SOFT-IRQ asynchronously
    • SOFT-IRQ, as the name implies, is a similar interrupt mechanism for software construction, and it also distinguishes between different types according to the purpose and has corresponding handler. Its main meaning is to make the interruption to the system real-time impact as small as possible

Whether it's HARD-IRQ or SOFT-IRQ handler, they're all execution streams that need to be dispatched (just like threads), so the question is: how to efficiently schedule these execution flows to run in multiple cores is critical to system performance. Here are some of the current scheduling mechanisms:

    • The HARD-IRQ handler for the same interrupt number is executed serially, at the same time, only on one core.
    • HARD-IRQ handler for different interrupt numbers can be executed in parallel on different cores
    • Which core execution of an interrupt number is usually determined by the I/O APIC (Advanced Programmable Interrupt Controller) in the system, which provides the configuration interface (there is also a dynamic adjustment tool called Irqbalance optional)
    • HARD-IRQ handler in SOFT-IRQ, typically executed on the same core

We can use the following command to observe all the interrupt numbers in the system and their scheduling on each core:

cat /proc/interrupts

The following goes back to the topic of network IO, and looks at the protocol stack processing from the point of view of Nic interruption, such as:

Traditional network cards Each device has an interrupt number, which is assigned to each interrupt request by the above scheduling mechanism to a unique core to execute. In this scenario, you will find that the parallelization of protocol stack processing is based on the granularity of the NIC device.

If there is only a single NIC, you will find that interrupt processing CPU consumption is concentrated on a single core (remember that the default corresponding SOFT-IRQ will be executed on the same core); worse, if there is only one application process that handles the socket, It is likely that you will see that all CPU load is concentrated on a single core (in fact, the process scheduling policy optimizes the progress of the wake-up schedule to execute on the same core). How to optimize? Don't worry. There are many ways to do this.

If you have more than one NIC, you will typically gain performance gains as a result of using multicore parallelism, as shown in:

However, we have also found an exception, although the interrupt request of two network cards is dispatched to 2 cores, but they are hyper-threading technology on the same physical core virtual out of the 2 logical core, and can not be effectively processed in parallel. The workaround is to manually configure interrupt affinity (Bind the interrupt number with the specific core), such as the following command:

echo 02 > /proc/irq/123/smp_affinity
Multi-Queue Nic

As mentioned earlier, the traditional single network card can not make full use of multi-core, even multiple network cards, in the case of less than the number of cores is the same. The result is a multi-queue network card (Multi-queue NIC) that is widely used today, and in some materials this technique is also called RSS (Receive-side Scaling). This technique, in summary, supports the parallel of the single-Nic IO from the hard interrupt level, and works as shown in the following:

Multi-queue network card through the introduction of rx-queue mechanism, the input traffic level to multiple "virtual network card" is also Rx-queue, each rx-queue like a standalone device has its own interrupt number and can work independently in parallel. The number of Rx-queue can generally be configured to match the number of cores so that the multi-core resources can be fully utilized.

It is worth noting that the algorithm for splitting the input traffic to Rx-queue is calculated according to the hash (SRCIP, srcport, Dstip, Dstport) (think of why it is distributed randomly with hash instead of similar). Yes, mainly to avoid disorderly order). If the majority of traffic comes from a small number of ip:port scenarios, the multi-queue network card is not possible.

You can use the interrupts file mentioned earlier to observe the distribution effect of multi-queue NICs:

cat /proc/interrupts

There is also an article in detail about this topic can be used for reference.

RPS

On a server without a multi-queue network card, such as a typical scenario is a virtual machine or a cloud host, how to optimize network IO? The following is a pure software optimization scenario: RPS & RFS, an optimization patch that was developed by Google engineer Tom Herbert in the 2.6.35 kernel. It works as shown in the following:

RPS is an incoming traffic distribution mechanism that works at the NAPI layer (or near the entrance in SOFT-IRQ processing) and distributes the packets to the target core using the previously mentioned PER-CPU backlog queue. The default distribution algorithm is similar to a multi-queue mechanism and is mapped to a core using the Ip,port four-tuple hash. Similar to the multi-queue mechanism, if traffic comes from a small number of ip:port scenarios, the load will not be well balanced on multicore. We currently have RPS enabled on AWS virtual machines that are universally configured, and the optimization effect is very obvious.

The method of configuring RPS is also simple:

echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus

For more detailed configuration instructions, refer to this

How do you see the distribution of RPS? Since RPS distribution will do more SOFT-IRQ scheduling, we can observe the scheduling effect by observing the statistical interface of the SOFT-IRQ:

cat /proc/softirqs | grep NET_RX

Before we talk about servers with no multi-queue NICs, RPS can play an important role, so if you already have a multi-queue NIC, do you still need RPS? According to my current experience, there is no obvious improvement in configuring RPS under the circumstances of a queue network card. But I think there are still some situations where it makes sense to combine RPS, such as the number of queues is significantly less than the number of cores, and some RFS (described below) can be optimized to open rps+rfs.

If you are interested in looking at RPS key kernel code, you can check here.

There is also an article about some of the kernel implementation details for reference.

RFS

RFS is an extension based on the RPS distribution mechanism. It tries to solve this problem, that is, I do the software level of the distribution of packets, can be more than the hardware multi-queue scheme approximate random hash distribution method more intelligent and more efficient? For example, a problem that is distributed by hash is that the core of the process that is preparing to receive the packet is probably not the same as the core chosen by the hash, which leads to cache miss and cache line bouncing, which can have a significant performance impact on multi-core high concurrency scenarios.

RFS tries to optimize the problem by trying to distribute the received packet to the core of the process that receives it, first look at the schematic:

First RFS will maintain a global routing table (sockflowtable in the figure), which records a flowhash (hash value of four tuples) to the corresponding CPU core route. How do I create a table item? is when the process calls a socket's recvmsg system call (also including Recv/recvfrom), associates the socket's Flowhash value (determined by the flowhash of the last packet received) with the current CPU core. When the RPS does packet forwarding, it actually first determines whether RFS is enabled and can find a valid RFS route entry, otherwise it is still forwarded using the default RPS logos.

In addition RFS also maintains a per-queue local route table (Per-queue flowtable in the picture), what is it used for? The main purpose is to avoid the chaos that occurs when the packet on the original route path is not fully processed when the global routing table changes. Its principle is not complex, in the local routing table will be recorded in a flowhash (the actual implementation is Flowhash hash) the last packet forwarding of the associated CPU core, but also recorded at that time the core corresponding backlog queue tail label (Qtail). The next time a packet is forwarded to the flow, if the global route gives a change in the CPU core, then determine whether the current backlog queue of the first line is greater than qtail, if it is stated that the last forwarded packet has been processed, you can safely switch to the global route to the new CPU core, Otherwise, the last used CPU core is selected in order to ensure order.

It can be seen from the principle that RFS will have a better effect when each socket has a unique process/thread handling, and it is recommended that the binding process be combined for scenarios where multiple sockets are operating within the same process/thread. Thread-to-fixed CPU cores can further play the role of RFS (allowing forwarding routing rules to be fixed, further reducing the cache line bouncing).

RFS configuration is also relatively simple, there are two, one is the size of the global routing table, the other is the size of the local routing table (usually set to the former size/rx-queue number), the following example:

echo 32768 > /proc/sys/net/core/rps_sock_flow_entriesecho 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

For more detailed configuration instructions, refer to here.

If you are interested in looking at its implementation, you can start here (record routing) and here (query routing).

There is also an article about the implementation for reference.

Xps

The author is also Google's Tom Herbert, the kernel 2.6.38 was introduced. XPS solves a problem in a multi-queue network card scenario: By default when the protocol stack is processed to a network card device, if it is a multi-queue network card (there are multiple Tx-queue), a four-tuple hash will be used to select a tx-queue to send. Here is a performance loss is, in the multi-core scenario, may be stored in multiple cores at the same time send data to a tx-queue, because this operation needs to write the corresponding tx_ring and other memory, will trigger the cache line bouncing problem, resulting in a decrease in overall system performance. XPS provides a mechanism for assigning different tx-queue to a different set of CPUs to operate, so that only one or a few CPU cores will write to a certain tx-queue, avoiding or significantly reducing the cache line that the conflicting writes bring Bouncing problem.

Setting up XPS is very simple, similar to RPS, as in the following example:

echo 11 > /sys/class/net/eth0/queues/tx-0/xps_cpusecho 22 > /sys/class/net/eth0/queues/tx-1/xps_cpusecho 44 > /sys/class/net/eth0/queues/tx-2/xps_cpusecho 88 > /sys/class/net/eth0/queues/tx-3/xps_cpus

It is important to note that, according to the principle, it is meaningless and effective to set up XPS for non-multi-queue network cards. If a CPU core does not appear in any of the Tx-queue Xps_cpus settings, when the CPU checks the device, it will return to the default hash method to select Tx-queue.

If you are interested in its implementation, you can look at it from here.

Here is an introduction to this by the original author.

So_reuseport

We are talking about parallel optimizations at the bottom of the stack, but on the upper level, the socket layer, there is also an important optimization patch: so_reuseport socket option (be careful not to confuse so_reuseaddr). Its author or Tom Herbert~ (this article should thank the man:), was introduced in kernel 3.9.

What kind of problems does it solve? Consider a TCP server that listens on a unique port, and consider using multi-process/multithreading if you want to take advantage of multicore concurrency to improve overall throughput. A straightforward approach is that multiple processes compete with the accept listener socket, and you may experience one drawback of this approach is that individual processes/threads cannot guarantee load-balanced accept to a new socket. Solving this problem directly may require a cumbersome workaround (e.g. we used the connection tables +sched_yield method to ensure load balancing). There is also a popular processing mode where a thread is responsible for listen and accept, then the socket load is dispatch to a worker thread group, and each worker thread processes the IO of a subset of sockets. This mode is more appropriate for long-connection services, but a single thread accept can be a bottleneck if there is a short connection scenario with a large number of connect requests. Solving this problem may also need to consider the use of complex multi-threaded competitive accept, but there is still a socket access competition, cache line bouncing and other efficiency issues.

For the UDP server, there are similar problems, single-process reading is easy to achieve a single core bottleneck, multi-process competitive reading will have a certain performance loss. The principle of multi-process competitive reading is as follows:

The So_reuseport solves 2 problems of multi-process read-write with one-port scenario: Load balancing and access contention. With this option, multiple user processes/threads can each create a separate socket, but they share the same port, and the traffic on that port is distributed by default in a four-tuple hash of each socket (the latest kernel also supports a custom distribution strategy using BPF), and the idea is not very familiar. The principle is as follows:

In this way, the TCP/UDP server programming mode is very simple, multi-process/thread creation socket settings So_reuseport after bind, like a single process to deal with it. The performance can also be significantly improved, and our earlier experience was that after the UDP transformation, the QPS increased by a factor.

If you are interested in its implementation, you can look at the source code from here (UDP) and here (TCP).

Here is a more detailed article on the introduction.

Summarize

This article describes a series of technologies (multi-queue, RPS, RFS, XPS, So_reuseport) for the Linux kernel on network io parallelism, which are at different levels of the protocol stack, but both use a similar approach to improve the parallelism of network IO and minimize the cache line Bouncing. These great tools can help us build high-performance Web servers on any Linux platform.

RPM: Overview of Linux network IO parallelization technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.