Linux forwarding bottleneck analysis, evaluation, optimization, and solutions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Line Speed Concept

Many people misunderstand this line-speed concept. The so-called wire speed ability is the router/switch is like a wire. And this, is impossible. One of the concepts that should be considered is latency. Packet into the router or switch, there is a core delay operation, which is the routing, for routers, is the route lookup, for the switch is to query the mac/Port mapping table, this delay is not avoided, this operation requires a large number of computer resources, so whether it is a router or switch, It is impossible for a packet to be internally transmitted as near the speed of light as on an online cable. Analogy when you go through the crossroads, is not to look right and right?

So how do you measure the speed of a device? If a packet passes through a router, then there is no doubt about the delay, but the device has a queue or buffer, then imagine a packet of packets immediately from the input port into the device, and then a packet of packets immediately from the output port, which can be done, we do not number the packet, So you can't tell if the packet is the one that just went in, and that's the line speed.

We can use the capacitance to understand the forwarding device. One might think that capacitance has the effect of high frequency impedance I'm not talking about this, so I do not consider low-frequency, only high-frequency for example, capacitance has the function of storing charge, which is similar to storage and forwarding, capacitor charging process similar to the packet into the input queue buffer, capacitor discharge process similar to the data packet output from the output buffer , we can see that the speed of the current is constant before and after the capacitance, but for the specific charge, the charge emitted from the capacitor is not the charge that is just charging on the other side, and the charging discharge of the capacitor has inherent delay.

We go back to the forwarding device. For switches and routers, the metrics are different.

For the switch, the wire speed capability is the backplane total bandwidth, because its look-up table operation caused by the delay is not large, a lot of operations in the packet through the Exchange matrix process, so the backplane bandwidth directly lead to forwarding efficiency. For routers, the metric is the number of minimum packets per second of a port, assuming that packets are entered at 100 per second, 100 out of a second, and the wire speed is 100pps.

This article is for routers and not for switches. The router's core latency is in Route lookup, and this lookup is not affected by packet length, so the core of the router's speed capability is determined by the efficiency of the packet output, noting that it is not the efficiency of the packet input, because as long as the queue is long enough and the cache is large enough, the input is But the input operation involves the problem of how to dispatch. This explains why many routers support output flow control rather than input flow control because, even if the input flow control is perfect, it will be affected by the output efficiency of the router output port, the flow control results will no longer be accurate.

The night before the program was written, there was a story. I recently contacted the Junior high school when playing rock and roll playing stereo super Iron friends, he now engaged in stage design, lighting and other things. I asked him on the large stage, speakers placed in different positions, the distance after the stage, the front, the source is also different, how to do different channels or the same voice of the same sound synchronization, you know, good ears can hear the millisecond tone difference ... He told me to unify the arrival time, that is, the unified audio flow to each box time, and this is to do is to adjust the delay, require different location of the box path to have different delays. How big a help it is for my design plan.

Then, the next day, I began to organize this sad and ultimately heartbreaking Linux forwarding optimization scheme.

Summary of problems

As a soft routing operation, the Linux kernel protocol stack is less efficient than the other common operating system's own protocol stacks, as described below. However, the evaluation criteria based on industrial routers are indeed low.

On the market a variety of Linux kernel protocol based on the stack of router products, and even a large number of such articles on the Internet, such as what will be Linux into a router and so on, nothing more than open ip_forward, add a few iptables rules, make a configuration more convenient web interface ... I want to say these are too low-level, even super low-level. I would like to talk about the professional router my point of view, but today is a small birthday, play a day, do not write. Just sort out my plans.

What is the low efficiency of Linux forwarding? How to optimize? This is the problem to be explained in this article. Still, this article can be reproduced at will and based on this idea to achieve coding, but once used for commercial purposes, do not guarantee no personal or organizational accountability, so I try to use as vague as possible to explain the details.

Bottleneck Analysis Overview

1.DMA and memory operations

Let's consider a memory operation that is required in a packet forwarding process that does not consider DMA for the time being.

* Data packets from the network card copy to memory

*) CPU access memory read packet meta data

*) three-layer header modifications, such as TTL

*) forward to two layer after the package Mac head

* The packet is copied from memory to the output NIC

Why are these typical memory operations slow? Why do we always have such a big opinion about memory operations? Because access to memory needs to go through the bus, First Bus competition (especially in SMP and DMA) is a rumble process, in addition because the memory itself and CPU compared to a few orders of magnitude, so play, it will certainly slow ah! So the general is to use the CPU cache as much as possible, and this requires a certain local data layout, for packet reception and other IO operations, because the data from the outside, and process implementation of the local use can not be compared. So you have to adopt a technology like Intel I/oat to improve.

1.1.Linux as a server

Using standard 0 Copy map technology is fully qualified. This is because, running on Linux server and wire speed forwarding compared to a snail, the server in processing client request time is a hard time, can not optimize, this is compensatory principle. The only thing Linux services needs is a fast access to the client's packets, which can be done quickly via DMA. This article no longer specifically discussed as a server running 0 copies of the problem, their own Baidu bar.

1.2.Linux as a forwarding device

A DMA-mapped exchange technique is required to achieve 0 copies. This is the fundamental low performance of Linux forwarding. Because the input queue of input port and output queue of output port do not know each other, the queue lock overhead is too high and the bus scramble is too serious when the system resources and multiport data are routed to the single port output queue. DMA innuendo swaps require an awesome packet queue management facility, which is used to dispatch packets from the input port queue to the output port queue, and Linux has almost no such facility.

Although in recent years in the router area has been put forward the input queue management, but this technology for Linux is another world, and I introduced it into the Linux world.

2. Network card to packet queue buff management

In the Linux kernel, almost all data structures are needed when alloc, after free, even if it is kmem_cache, the effect is general, especially for high-speed wire-speed equipment (SKB memory copy, if not DMA, will be frequently copied, even with DMA, In many cases it is also not 0 copies).

Even the high-end network card in the SKB buffer management, also does not use the full sense of the pre-allocation of memory pool, so because of the frequent memory allocation, release caused by memory bumps, it is known that memory operation is the root of the problem, because it involves CPU Cache, bus scramble, atomic locks, in fact, Memory management is the root of the fundamental, this way too many, it directly affects the CPU cache, the latter will affect the bus ... Where to allocate memory, how much to allocate, when to release, when to reuse, this involves the memory area coloring technology. By analyzing Intel Gigabit NIC drivers, in my view, Linux does not do this.

3. Routing lookup and other find operations

Linux does not differentiate between routing table and forwarding, every time to find the longest prefix, although the trie algorithm is better than the hash algorithm in the massive routing table, but it can degrade the trie structure or frequent backtracking in the case of malformed route distribution. Routing cache efficiency is not high (query cost is too large, not fixed size, only retarded aging algorithm, resulting in massive address access, routing cache conflict Chain too long), and finally in the kernel protocol stack class.

If there is not a good forwarding, then the Linux protocol stack in the presence of the mass routing is a bottleneck for the line speed capability, which is an extensibility problem.

In addition, many of the query results can be cached in one place, but the Linux protocol stack does not have this cache. For example, the routing query result is the next hop, the next hop is associated with the output NIC, and the output NIC is associated with the next-hop MAC address and the source MAC address to be encapsulated, which should have been cached in a table entry, that is, in a forwarding post, whereas the Linux stack does not.

4. Unreasonable locks

Why lock it, because SMP. However, the Linux kernel is almost symmetric and locked, that is, for example, every time the Challo is locked by the table, why? Because of the fear that during the query the routing table changed ... However, you think carefully, in the high-speed forwarding scenario, find operations and modify the operation in the unit time ratio is how much? Do not think that you use a read-write lock is good, read and write lock does not also related to preemption operations (although we have proposed to close the preemption)? At least a few instruction cycles are wasted. The lock of these asymmetric time probabilities is unnecessary.

You only need to ensure that the kernel itself will not collapse, as to say IP forwarding errors, regardless of whether, in accordance with IP protocol, it itself is a best agreement.

5. Interrupt and soft interrupt scheduling

Linux interrupts are divided into the upper half and the lower part, the lower half of the dynamic dispatch, which can be run in the context of an interrupt or in the context of a stand-alone kernel thread, therefore, the runtime of the protocol stack processing in a soft interrupt is unpredictable for environments with real-time requirements. Linux native kernel does not implement solaris,windows interrupt priority, in some cases, Linux relies on its own dynamic and its excellent scheduling scheme can achieve very high performance, but for fixed tasks, Linux scheduling mechanism is obviously inadequate.

And what I need to do is to fix things that are not fixed.

6. Common Operating system kernel protocol stack of common problems

As a common operating system kernel, the Linux kernel is not just processing network data, it has many other subsystems, such as various file systems, various IPC, it can do is only available, simple, easy to expand.

Linux Native protocol stack is completely without network optimization, and the basic installed in the hardware is also not optimized for the general architecture, NIC interface on the PCI-E bus, if the DMA mismanagement, The performance overhead of bus occupancy and scramble will offset the benefits of DMA intent (which, in fact, has little to do with forwarding, and is only good for Linux running as a server because it only involves a single NIC)

[Note that I think the kernel processing path is not a bottleneck, which is determined by the tiered protocol stack, with some of the bottlenecks in each layer, such as memory operations (overhead) and look-up table operations (the cost of poor algorithm)]

Overview: Linux forwarding efficiency is affected by the following major factors

io/Queue Management/Memory modification copy of input/output (redesign similar crossbar queue management for DMA ring Exchange)

A variety of table query operations, especially the longest prefix matching, have not been established between many of their uniquely determined query operations

SMP processor Synchronization (lock overhead) (with large read locks and RCU locks) and cache utilization

Interrupts and soft interrupt scheduling

Linux forwarding Performance Promotion program

Overview

The idea of this scheme comes from a new generation of hardware routers based on crossbar. Design points:

1. Redesigned DMA package management queues (ideas from Linux O (1) Scheduler, crossbar array, and voq[virtual output queues))

2. Redesigned forwarding based on positioning rather than the longest prefix lookup

3. Long-distance processing (broken thread, processing pipelining, increase CPU affinity)

4. Data structure is not locked (based on the thread local data structure)

5. How to Achieve

5.1. Driver and kernel protocol stack modification

5.2. Full user-state protocol stack

5.3. Evaluation: User State protocol stack flexible, but in some platforms to deal with the space switch caused by the CACHE/TLB/MMU table flush problem

Kernel protocol stack scheme

Optimization framework

0. Routine optimization

1. Network card Multi-queue binding a specific CPU core (using RSS features to handle TX and RX separately)

[See "Effective Gigabit Ethernet Adapters-intel Gigabit NIC 8257X performance Tuning"]

2. Dynamic switch Backlog delay interrupt throttlerate as per package size statistics and interrupt delay (for Intel Gigabit cards)

3. Disable kernel preemption, reduce clock Hz, driven by interrupt granularity (see above)

4. If you are not ready to optimize NetFilter, disable netfilter when compiling the kernel, saving instructions

5. Compile option Remove Debug and Trace, save instruction Cycle

6. Turn on the hardware uninstall switch on the NIC (if any)

7. Minimize the number of user-state processes and reduce their priority

8). Native Network protocol stack optimization

Since no longer serves as a general purpose OS, the task of excluding Rx SOFTIRQ can be properly starved

*CPU Group (consider the Linux cgroup mechanism), a uniform set of CPUs for the data surface CPU, each CPU binding an RX SOFTIRQ or

* Add Rx SOFTIRQ to the Netdev_budget and time limit once executed, or

* As long as there is a package that deals with each one. A task on the control surface/management surface can be tied to another CPU.

Purpose:

The optimization scheme of the native protocol stack

1. Optimize I/O,DMA, reduce memory management operation

1. Reduce the pci-e of bus contention, using the crossbar of the full cross-cubic switching mode

[Tips:16 lines 8 bits PCI-E bus topology (non-crossbar! The network line speed is not full load 60% PPS]

2. Reduce the scramble DMA, reduce the lock bus [Tips: Optimize command lock, preferably with RISC, side can be adjusted high core Hz]

[Tips: Swap DMA mappings instead of copying data between the input/output buffer ring!] Now, only the idiot will be in the DMA copy memory, the correct way is the DMA remapping, swap pointers! ]

3. Use SKB memory pool to avoid jitter in memory management framework caused by frequent memory allocation/release

[Tips: Each thread is responsible for a single network card (even the inputs and outputs are handled by different threads better), keeping a pre-allocated loop buffer, mapping DMA]

Purpose:

Reduce cache refreshes and TLB refreshes, and reduce the work of kernel management facilities (such as frequent memory management)

2. Optimize Interrupt Distribution

1. Increase long path support to reduce the TLB and cache refresh caused by process switching

2. Use of multiple-queue network card to support interrupt CPU affinity utilization or simulate soft multiple queues to improve parallelism

3. At the expense of user-state process scheduling opportunities, all focus on the processing of the kernel protocol stack, multi-CPU multi-channel parallel

[Tips: If there are too many CPUs, the proposed division Cgroup]

4. Interrupt handling threading, kernel threading, multi core parallel CEO path, avoid switching jitter

5. Thread internal, according to Ixa NP micro-modular thinking of the use of modularity (the scheme is not implemented, pending)

Purpose:

Reduce cache refreshes and TLB refreshes

Reduced protocol stack processing interrupted too frequently interrupt [either using intrate, or introducing interrupt priority]

3. Optimized routing lookup algorithm

1. Separation of routing tables and forwarding, routing table and forwarding synchronization using RCU mechanism

2. Use thread-local data as far as possible

Each thread is published (generated by the routing table, OpenVPN multithreading, but fails), with positioning rather than the longest prefix lookup (DXR or the one I designed). If you do not copy a forwarding for each thread, you will need to redesign the RW lock or use the RCU mechanism.

3. Adopt Hash/trie method and DXR or I designed DXRPRO positioning structure

Purpose:

Positioning rather than finding structures

Use local table to avoid lock operation

4. Optimize lock

1. Query to locate the local table, no locks (even RW locks are not) do not prohibit the interruption

2. Critical section and Kernel thread association, can not help but interrupt, can not help preemption (in fact, kernel compile-time preemption has been closed)

3. Priority lock Queue Replacement scramble model to maintain cache heat

4). Using the self-rotating lock mechanism of windows

[Tips:linux ticket spin lock because of the method of probing global lock, which causes bus overhead and CPU synchronization overhead, Windows spin lock implements true queue lock with the method of probing CPU local variables, I designed the input output queue management structure (detailed below) thinking part of the spin lock design from windows]

Purpose: The size of the lock and only with the critical area of resources associated with the minimum size of

Optimization Detail Overview

1.DMA and input-output queue optimization

1.1. What's the problem?

If you are familiar with the Linux kernel protocol stack, you must know that the Linux kernel protocol stack is due to the ubiquitous "one good Thing" in software engineering that makes forwarding performance inefficient. This is the "decoupling of tight coupling".

The fundamental difference between Linux protocol stack forwarding and Linux server is that the latter's application services do not care what the data packet Input network card is, it does not care what the output card is, however, for Linux protocol stack forwarding, the input network card and the output card is really necessary to perceive each other. The underlying reason for the low efficiency of Linux forwarding is not that the routing table is inefficient, but that its queue management and I/O management mechanisms are inefficient, resulting in inefficiencies that are not difficult to implement on a technical implementation, but rather a flexible and scalable performance that the Linux kernel is pursuing, which must be removed from the network card Tight coupling of packet management between driver and protocol stacks.

We use the Intel Gigabit NIC driver e1000e to illustrate the above issues. By the way, the Intel Gigabit drive is also the case, not to mention the other, which is rooted in the common network card driver and protocol stack design is not for forwarding optimization.

Class

Create Rx Ring:rxbuffinfo[max]

Create TX Ring:txbuffinfo[max]

Rx Process:

i = where the current Rx ring travels to;

While (SKB is available in Rxbuffinfo) {

SKB = RXBUFFERINFO[I].SKB;

RXBUFFINFO[I].SKB = NULL;

i++;

Dma_unmap (Rxbufferinfo[i]. DMA);

[Tips: At this point, SKB has been disconnected from the drive, completely to the Linux protocol stack]

[Tips: At this point, the SKB memory is no longer maintained by the RX ring, the Linux protocol stack pulls away the SKB memory]

OS_RECEIVE_SKB (SKB);

[Tips: The Linux protocol stack is responsible for releasing SKB, calling interfaces like KFREE_SKB]

If (too many skb are picked up by the Linux protocol stack in the RX ring) {

alloc_new_skb_from_kmem_cache_to_rxring_rxbufferinfo_0_to_max_if_possible;

[Tips: Redistributing SKB from Linux core memory]

}

}

TX Process:

SKB = packets from the Linux protocol stack Dev_hard_xmit interface;

i = available locations in the TX ring

TXBUFFERINFO[I].SKB = SKB;

Dma_map (Txbufferinfo[i]. DMA);

While (there is a SKB available in txbufferinfo) {

DMA_TRANSMIT_SKB (Txbufferinfo[i]);

}

[Asynchronous wait for transmission complete interrupt or active invocation in NAPI poll]

i = Txbufferinfo index of transmission completion

While (Txbufferinfo has the completed SKB) {

SKB = Txbufferinfo[i];

Dma_unmap (Txbufferinfo[i]. DMA);

Kfree (SKB);

i++;

}

The above process can be seen, in the continuous forwarding of packets, will involve a large number of SKB alloc and free operations. If you think the above code is not so intuitive, here's a diagram:

Frequent operation of Alloc SKB and free SKB from Linux core memory is not only unnecessary, but also can damage the utilization of CPU cache. Do not expect in Keme_cache, we can see that all the network cards and sockets are almost sharing a core of memory, although it can be optimized through dev and kmem cache, but unfortunately, this optimization does not have a qualitative leap.

1.2. Build a new DMA ring buffer Management Facility-VOQ, establish a queue association between the input/output NIC.

Analogy Linux O (1) scheduler algorithm, each CPU globally maintain a unique queue, scattered to each network card, relying on the Exchange Queue DMA mapping pointer instead of copying data to optimize performance, to 0 copies, this is only one. There is little talk about swapping DMA mapping pointers instead of copying data, because almost all DMA-enabled NIC drivers do this, and if they don't, then someone will change the code to do so.

If you compare the crossbar switching array architecture of high-end routers with the real VOQ implementation, you will find that, logically, maintaining a data forwarding path between each pair of possible input/output cards is a good way to avoid team head blocking and competition. This will only affect the individual network card, do not need to add global lock. In the software implementation, we can also do this. You need to understand that Linux network card driver maintenance of the queue information is fragmented by the kernel protocol stack, since then, the input/output network cards are lost, resulting in the optimal binary graph algorithm can not be implemented.

In fact, you might think that the NIC as a collection, the data packet that needs to output is the most other set, forwarding operation need to do is to establish a path between the packet and the network card, this is a typical binary graph matching problem, however, if the operation of the path is separated from the binary graph problem, This is no longer the network card and packet between the two-graph matching problem. Because the decoupled routing module causes a packet to be forwarded for each packet, its output NIC is uniquely determined. This problem becomes a binary graph matching problem between the CPU set and the output NIC that handles output network card output operation.

Here's an optimization point, that is, if you have multi-core CPU, then you can for each network card output operation binding a unique CPU, binary map matching problem solved, the rest is the hardware bus contention problem (for high-performance crossbar routers, this is a binary map matching problem, But it's a bit different for a general-purpose system of bus architecture, which I'll talk about later, and as far as we're concerned, there's no other way than using a more cost-effective bus, like we use PCI-E 16Lines 8 bits. As a complete solution, I cannot hope that there is a multi-core CPU system at the bottom, if there is only one CPU, then we can hope that the Linux process scheduling system? Again, as a general-purpose operating system kernel, Linux does not optimize for network forwarding, so the process scheduling system is another optimization point for this scenario, which I'll talk about later.

Finally, give a sketch of the design plan of my packet queue management Voq.

In my Voq design for the Linux protocol stack, VOQ always have to cooperate with a good output scheduling algorithm in order to play the best performance.

2. Separating the routing table from the forwarding table and establishing the association between the lookup operations

The Linux protocol stack is not a distinction between routing tables and forwarding, which is clearly required on high-end routers. Admittedly, I did not want to make the Linux protocol stack as a professional router of the stack, but through this ranking of the second core optimization, its forwarding efficiency will be higher.

About three months ago, I refer to the DXR structure and learn from MMU thought designed a forwarding index structure, can achieve 3-step positioning, do not need to do the longest prefix matching process, you can see my article "with the DXR algorithm thought as the benchmark design of the route-positioning structure diagram", I am no longer in the depth of reference here. It is important to note that this structure can be generated according to the existing Linux protocol stack routing fib, and in the case of irregular routing, it can be dynamically rolled back to the standard DXR in the worst-case scenario, such as the routing items are not aggregated, and the routing items in the IPV4 address space are too divided and unevenly distributed. I call this structure I designed as DXR pro++.

As for the correlation between lookup operations, this is also a depth optimization, the bottom of the construction of high-speed query flow table to achieve the protocol stack short circuit (flow table can refer to conntrack design), this optimization thought directly referring to the NetFilter Conntrack and SDN flow table design. Although IP network is a stateless network, the forwarding strategy of intermediate routers should also be a stateless forwarding. However, this is the metaphysical concept of meaning. If you talk about depth optimization, you have to sacrifice a little purity.

Design a flow table, the definition of the stream can not be strictly in accordance with the five-tuple, but can be based on any field of the protocol header, the information stored in each table includes but not limited to the following elements:

* Streaming Table Cache routing items

* Streaming Table Cache neighbour

* Streaming Table Cache nat

* Streaming table Cache ACL rules

* Streaming Table Cache two-Layer header information

This allows you to save a stream table at the bottom of the protocol stack that can be queried at a high speed. The protocol stack after receiving SKB match this table of an item, once successful, can directly take out the relevant data (such as the routing item) direct forwarding, theoretically only one flow of the first packet will take the standard protocol stack slow path (in fact, after DXR pro++ optimization, once not slow ...). In direct fast forwarding, you need to perform a hook to perform standard routine operations, such as checksums, TTL decrement, and so on.

With regard to the above elements, it is particularly pertinent to note that the neighbour is related to the two-layer information. Data forwarding operations have always been considered bottlenecks in the delivery process, the data will involve the following time-consuming operations:> add Output network card MAC address as the source-memory copy > Add next hop MAC address as the target-memory copy again, we encountered a memory operation, Annoying memory operation! If we save these MAC addresses in a stream table, can we avoid them? Seemingly just can quickly locate, and cannot avoid memory copy ... Again, we need the features of the hardware to help, which is decentralized aggregation I/O (scatter-gather io), in principle, scatter-gather IO can use the discontinuous memory as a continuous memory to directly map DMA, so we just need to tell the controller, Where the Mac head of a frame to be sent is located, DMA can be transmitted directly, and there is no need to copy the MAC address to the memory area of the frame header. As shown in the following illustration:

In particular, it is important to note that the data in the above stream table cache entry is heavily redundant because the MAC address of next hop, the MAC address of the output card, can be uniquely determined by the routing item. The principle of saving redundant data is to optimize it, and the standard universal Linux kernel protocol stack is designed to avoid redundancy ... Since the redundant data is saved, the synchronization between the data item of the slow path and the data item of the fast track is a problem that must be solved. I am based on the asymmetry of reading and writing, start using event to notify updates, such as data items in a slow path (routing, MAC information, nat,acl information, etc.), once the information changes, the kernel will specifically trigger a query operation, the rapid flow table associated with the table items disable off. It is noteworthy that this query operation does not need to be too fast, because compared to fast forwarding, the frequency of data synchronization is a slow astronomical number of levels ... Similar to Cisco-like devices, you can create several kernel threads to periodically refresh a slow path table entry to discover changes to data items, triggering an event.

[Tips: Can be a high-speed search flow table structure can be used in a multi-level hash (using Tcam similar scheme), can also learn from my DXR pro++ structure and NF-HIPAC algorithm multidimensional interval matching structure, I personally more highly respected NF-HIPAC]

3. Routing Cache Optimization

Although the Linux route cache has been dismissed, but the reason for the class is not the cache mechanism itself bad, but the Linux route cache design is not good. So the following points can be used as an optimization point to try.

* Limit the size of the routing soft cache to ensure lookup speed [implement well-designed aging algorithm and substitution algorithm]

[Use of the time locality of Internet access and spatial locality (need to use counting statistics)]

[Self PK: If you have my 3-step positioning structure, is it still used to routing cache?]

* The use of prefabricated IP addresses to route cache, to achieve a step-by-step positioning

[The so-called commonly used IP need to update according to the count statistics, can also be static settings]

4.SOFTIRQ napi scheduling optimization when RSS multi-queue NIC is not supported

* The device according to the protocol header hash value evenly on different CPUs, remote wake-up SOFTIRQ, simulated RSS soft implementation

The current network receive soft interrupt processing mechanism is, which CPU is interrupted by the network card, which CPU to handle the soft interruption of the network card, in the case of the network card can only interrupt fixed CPU, this will affect the parallelism, such as only two network cards, but there are 16 of the core CPU. How do you mobilize as many CPU cores as possible? This requires modifying the network receive soft interrupt processing logic. I want multiple CPUs to take turns processing the packets instead of fixing the interrupted packets to handle them. The modification logic is as follows:

1. All Rx SOFTIRQ Kernel threads form an array

struct task_struct Rx_irq_handler[nr_cpus];

2. All poll list consists of an array

struct List_head Polll[nr_cpus];

3. Introduction of a spin lock to protect the above data

spinlock_t Rx_handler_lock;

4. Modify the Napi scheduling logic

void __napi_schedule (struct napi_struct *n)

{

unsigned long flags;

static int curr = 0;

unsigned int hash = Curr++%nr_cpus;

Local_irq_save (flags);

Spin_lock (&rx_handler_lock);

List_add_tail (&n->poll_list, Polll[hash]);

Local_softirq_pending (hash) |= NET_RX_SOFTIRQ;

Spin_unlock (&rx_handler_lock);

Local_irq_restore (flags);

}

[Tips: Note the combination of DMA/DCA,CPU cache affinity, if not even DMA support, the fuck also optimize the hair]

Theoretically must do based on the transport layer and the transport layer below the tuple to do the hash, can not be randomly assigned, in the calculation of the hash can not introduce any variable fields per packet. Because some high-level protocols such as TCP and the vast majority of applications based on non-TCP are highly ordered, the parallelization of the intermediate node based on packet processing will cause the packet to arrive in the end of the node, thus triggering the reorganization and retransmission overhead. Their own to improve the speed of the line seems to be a good, but to the end of the host brought trouble. However, I do not consider this at present, I just based on the way of rotation scheduling to distribute the poll process to different CPUs to deal with, which will obviously lead to the above chaos sequence problem.

If you want to solve the above problem, you need to add a scheduling layer, the Rx SOFTIRQ again into the upper and lower half of the Rx SOFTIRQ1 and Rx SOFTIRQ2, the upper half of the Rx SOFTIRQ1 is only constantly removed SKB and assigned to a specific CPU core, the lower half is the protocol stack processing, Modify Napi poll logic, whenever poll out a SKB, calculate this SKB hash value, and then assign it to a specific CPU queue, and finally wake up SKB need to deal with the CPU on the RX SOFTIRQ2, this period need to introduce a bitmap to record there is no situation.

But there is a need to weigh the logic, is not really worth the RX SOFTIRQ to split again, divided it into the top and bottom, the scheduling cost and switching overhead in the end, the need for benchmarking to evaluate.

*) to extend the execution time of net SOFTIRQ, the package has been dispatch loop. The management/control plane process is divided into separate cgroup/cpuset.

5.Linux Scheduler-related modifications

This optimization involves the completeness of the scheme, after all, we can not guarantee that the number of CPU cores must be greater than twice times the number of network adapters (input processing and output processing separation), then you must consider the scheduling problem of output events.

According to the design of the packet queue management, consider the single CPU core, if there are more than one bit of the output bitmap of a bit is set, then exactly which network card is scheduled to output it? This is a clear task scheduling problem. Do you trust this job to the Linux kernel scheduler? I won't.

Because I know that although there are several network cards may have packets waiting to be sent, but their task volume and different, this is a binary graph problem. I need three metric weights to weigh which NIC to send first, and these three metrics are the team header wait time, the queue packet length sum, and the number of packets. This can be calculated as a priority Prio, the total is the virtual output queue for the longer, the more packets, the longer the length of the network card is most worth sending data. The calculation queue length is bound to cause nonlocal access, such as access to the virtual output queues of other network adapters, which raises the issue of locks, so consider a simple case, using only one metric, namely packet length. In the Linux current CFS scheduler case, you need to queue the virtual output queue packet length and task virtual time, that is, vruntime, in particular, in the input network card to the output card output bitmap placement, there are the following sequence:

As long as there is SKB line, unconditional setbit

Setbit (Outcart, Incard);

As long as there is a SKB queue, the virtual time of the output thread associated with the output NIC is subtracted from a value that is computed by the length of the packet and the normalized value of the constant.

Outcard_tx_task_dec_vruntime (Outcard, Skb->len);

Not familiar with Linux CFS scheduling, you can Google yourself. In fact, once the output task of an output NIC starts to run, it also dispatches data packets based on the virtual time lapse of CFS, that is, picking up a packet queue descriptor that is most worth sending and putting it in the TX ring.

Finally, there is a thought, if the use of CFS and the RT scheduling class is better? What is the trade-off between the real-time output of a single NIC and the fairness of the output of multiple network adapters? In addition, the use of RT scheduling class must have real-time?

6. Built-in package classification and packet filtration situation

This is a topic about NetFilter optimization, NetFilter performance has been criticized, it is largely due to the iptables caused, must be differentiated treatment netfilter and iptables. Excluding this misunderstanding, of course, does not mean that netfilter is completely innocent. NetFilter is a framework that itself has no lock overhead in case we have closed preemption, the key within its hook traversal execution, as some callback, the internal logic is uncontrolled, like Iptables,conntrack ( The original data structure granularity is very thick-the same hash storage two-directional tuples, and the use of a large number of large size locks, memory is not tense when we can use a few locks, space for freedom, say, a lock can account for how much space AH), are eating the performance of large, and Nf-hipac is much better. So this part of the optimization is not easy to say.

However, the proposal is still some, for a critical area when the lock should not be blind, if a data structure is read more frequently than is written, so that the latter can be ignored, then advise you not to lock it, even if the RW lock, RCU lock not, but in the form of copying, copy a copy, Then read the copy, write the original, write the original after the use of atomic events to notify the failure of the copy. For example, as mentioned above on the synchronization of the Fast flow table, once the routing changes, it triggers an atomic event, query the fast flow table associated with the item, fail it. Queries can be slow because the frequency of routing updates is low.

This section is not much discussed and the recommendations are as follows:

* Preprocessing ACL or NAT ruleset (using NF-HIPAC scheme to replace non preprocessed article-by-article matching)

[Hipac algorithm is similar to a preprocessing of rules, the matches is split and multidimensional interval matching algorithm is adopted]

*) Packet scheduling algorithm (CFS queue, RB tree Storage packet Arrival time *h (x), h for packet-length functions)

7. As a container of the SKB

SKB as a container for a packet exists, it wants to differentiate with the real packet, in fact, it is only as a carrier of packets, like a truck carrying packets. It should not be released, it should never be released. Each card should have its own truck fleet, if we think of the network card as an airport, the Linux router as a land, then the truck from the airport load (packet), either transport it to a destination (Linux as a server), or transport it to another airport ( Linux as a forwarding router), in which the truck is transported back and forth, the truck has always belonged to the cargo to the airport, the goods shipped to another airport after the return of the empty car. The use of trucks does not need to be centralized, not to be destroyed after use, and to rebuild a car (Linux forwarding bottleneck that!!! )。

In fact, there is a more efficient approach, that is, trucks to transport goods to another port or transport to land destinations, do not need to return empty, but directly into the port of destination or destinations of the outbound queues, waiting for the transport of goods full load to return to their ports. But for Linux, the truck is not sure where to go until it needs a route lookup, so the truck is unsure that it will return to the port it belongs to. Therefore, it is necessary to fix the package management queue, that is, to remove the permanent binding relationship between the Rx ring of the NIC and the SKB. For the sake of unification, the new design will be routed to the local packet as well as forwarding, except that the output NIC becomes a BSD socket, and the new design is shown in the following illustration:

In fact, we can see the difference by analogy between trains and taxis. For the train, its line is fixed, such as Harbin to Hankou train, it belongs to Harbin Railway Bureau, full of visitors to Hankou, the next guest, and then Hankou empty again boarding, it must return to Harbin. However, for taxis, this is not the case, Jiading's Shanghai C-brand taxis theoretically belong to jiading, not refusing to hire, a taxi to Songjiang, the driver to Songjiang, although expect someone to hit his car back to jiading, but passengers on the bus (Route lookup), told the driver, he wants to Minhang, arrived, another person on the bus, said to Jiaxing ... The farther away, but the fact is, because before the passengers get on the bus, the driver is not sure where he is going.

user-State protocol stack scheme

1. Controversy

On some platforms, if you do not solve the user/kernel switch when the cache,tlb refresh cost, this scheme is not my main push, these platforms, whether written directly or write back, access to cache are not MMU, nor cache MMU permissions, and the cache directly uses the virtual address.

2. Dispute Resolution

Intel I/oat's DCA technology can be used to avoid cache jitter caused by context switching

3. Adopting a pf_ring approach

Modify the drive directly associated with the DMA buffer ring (see DMA optimizations for kernel scenarios).

4. Learn from Tilera's RISC Hyper-core solution

Parallel pipeline processing each layer, the number of flow series is processed at the same time, the CPU core number +2, the number of running water is the number of processing modules.

[Pipeline inverted]

In essence, the user-State protocol stack and the kernel protocol stack is the same solution, no one or more of those kinds of ideas. The user-state protocol stack is less restrictive, more flexible, and more stable, but not always beneficial. It should be noted that a large number of existing disputes are metaphysical, the beholder, benevolent.

Stability

For unprofessional, non-large routers, the stability problem can not be considered, because no need for 7*24, the failure of a big deal to restart it, harmless. But as far as technology is concerned, there are still a few things to say. In the case of high-speed bus, the parallel bus is easy to intersymbol, memory is also easy to fault, a bit of error, a level of instability will cause unpredictable consequences, so pci-e this high-speed bus are serial transmission of data, for hard disk, SATA is the same truth.

In the case of multiple NIC DMA, for the adoption of PCI-E based devices, the bus is very fierce, this topology is determined by the bus, and the bus type regardless of, and then consider the system bus and the CPU core, this will be more intense, because the CPU will also participate in. It's very common for people to knock over tables and chairs instead of knocking them down, but if you think about it, I would like to apologize to him for a quarrel I had with my client three years ago.

2012, I do a VPN project, the customer said my device may be down in the next second, because of uncertainty. I said if (true) {printf ("Cao Ni ma!\n") (Of course I didn't dare to say that at the time); OK? He said not necessarily. I was lit up ... But now it seems that he is right.

good side effects after design of Voq-qos

Voq is a bright spot of this program, this article is almost around Voq, about another highlight DXR Pro, in other articles have been elaborated, this article is only cited. Near the end, I mentioned the Voq again, this time from a macroscopic point of view, not the details of the point of view to explain.

As long as the input buffer queue is large enough, packet reception is almost linear speed, but for the output, but by the scheduling algorithm, team head congestion, such as the impact of the input for the system is passive, interrupt trigger, and the output for the system is active, subject to the design of the system. So it is a truth for forwarding to "accept the easy to be challenged". Therefore, for the location of the QoS, most systems are selected on the output queue, because even if traffic is intervened on the input queue, traffic is two of innocent interventions at the time of output, which affects the QoS intervention on the input queue. I remember having studied the IMQ input queue flow control on Linux, just focused on the implementation details, and did not do metaphysical thinking, not now.

With the VOQ, it is exciting to work with a well-designed scheduling algorithm that solves almost all problems. As I mentioned earlier in the output operation, the output thread uses a weighted fair scheduling algorithm based on packet length and virtual time for output scheduling, but the effect of this algorithm is only to send packets at full speed. Would it be better if the scheduling algorithm is strategically made into a pluggable, or ported, framework and algorithm in the TC module of Linux?

Alas, if you Baidu "Router wire speed", it found almost all of the "router speed limit", this is a joke. In fact, for forwarding, you do not have to add any TC rules to achieve speed limit effect, the Linux box on the Internet a string, immediately be automatic speed limit, is not it? Plus Voq, you really need a speed limit. Just like in the crowded Chinese city central, the main road says speed limit 60, isn't that a joke? Which downtown bustling streets can run to 60 .... But once on the high speed, speed limit 100/120, is necessary.

good side effects after VOQ design-team head congestion and acceleration ratio problem

Using the terminology of hardware routers, if the use of packets routed back to the output network card queue scheme, then there will be more than one network card to a network card queuing packets, which for the output of the NIC is passive, this is a sad process, in order to allow multiple packages can be reached at the same time, Output bandwidth must be the tax of each input bandwidth, this is the N times the acceleration problem, we hope that an output network card active packet scheduling process, orderly and necessarily efficient. This is the essence of VOQ design.

For Linux, because its output queue is software, so n acceleration than the problem into a queue locking problem, in short, or a regrettable process of the fight, so the idea of the solution is consistent. So in Linux I simulate a voq. From here we can see the difference between Voq and output queuing, VOQ for the output process is the process of active scheduling, obviously more efficient, and output queuing for the output process is a passive scramble for the process, it is clear that this is hopeless.

It should be explained that VOQ is just a logical concept, analogy to the concept of hardware routers. If you still insist on using output queue instead of VOQ, then design multiple output queues, each network card a queue is also reasonable, it is even more simplified, compressed a network card allocation process, simplifying the scheduling. So we got the third edition of the Linux VOQ design: Associate the virtual output queue Voq to the output NIC instead of the input NIC (I'll analyze the reason in the next section).

bus topology and crossbar

Real hardware routers, such as Cisco, Huawei's devices, routing forwarding all by the card hardware execution, the packet is still there during this time, the query to publish the speed is so fast that the relative to move the packet to the output network card queue costs, look-up table overhead can be ignored. Therefore, on the real hardware router, how to build a high-performance switching network is the most important.

Not only that, the hardware router also needs to consider is that the packet in the routing query is the input processing logic directly through the switching network to the output network card queue, or stay in place, and then wait for output logic through the switching network packet pull to there. This difference will affect the design of the switched network arbiter. If you have more than one network card that outputs packets to a network card at the same time, the push method can cause conflicts because it is equivalent to a bus on a crossbar path, and the conflict typically occurs within the switching network, so this push Typically, a cache is carried on a switch node inside a switched network, which is used for staging a failed packet of conflicting quorum. Conversely, if it is pull way, the situation is different. So the effect of putting the output queue on which side of the switched network is different.

But for the general system architecture, is generally used PCI-E bus to connect each network card, this is a typical bus structure, there is no so-called Exchange network. So the so-called arbitration is the bus arbitration, this is not my focus, who let me have only a common structure of equipment?! My optimization does not include the design of the bus arbiter, because I don't understand this.

Therefore, for the Linux protocol stack forwarding optimization of the common architecture bus topology, the influence of the virtual output queue Voq on the input or output NIC is not too great. However, given the local optimization that comes with continuous memory access, I prefer to associate VOQ with the output NIC. If VOQ is associated to the input network card, then in the output scheduling, the output of the network card output is the output of a bitmap to send the data from the input network card VOQ in the queue with its associated scheduling packet, no doubt, these queues in memory is discontinuous, if associated to the output network card, For each output NIC, the VOQ is continuous. As shown in the following illustration:

Implement related

We mentioned earlier that SKB only exists as a container (truck). So SKB is not to be released. Ideally, in the Linux kernel boot, network protocol stack initialization, according to their own hardware performance and network card parameters to do a self-test, and then allocate Max SKB, these skb can be evenly distributed to each network card, while reserving a socket SKB pool, for user socket. The thing behind is SKB transport behavior, where the truck to calculate where, the transport process is not empty.

You may think this is unnecessary, because SKB itself even the entire Linux kernel most of the memory allocation is allocated and cache, slab is to do this, is not kmem_cache mechanism? Well, I admit that the Linux kernel is doing a good job of it. But Kmem_cache is a common framework, why not raise a level for SKB? Every time the call ALLOC_SKB, it triggers the management mechanism within the Kmem_cache framework to do a lot of work, update the data structure, maintain the list, etc., and may even touch the more low-level partner system. Therefore, it is not a good idea to expect to use efficient kmem_cache directly.

Would you argue that if system memory is tight, it won't be released? Everything is not absolute, but this is not the scope of this article, which involves many aspects, such as cgroup.

For SKB modifications, I added a field that indicates where it belongs (a network adapter?). Socket pool? ...), where you currently belong, this information can be maintained SKB will not be free to kmem_cache, but also can optimize cache utilization. This part of the modification has been implemented and is being further modified against Intel Gigabit drivers. About the performance of DXR Pro, I have been tested in the user state and have not yet been ported to the kernel.

On the realization of the Quick lookup table, the current idea is to optimize the nf_conntrack and do the multilevel hash lookup.

the final statement

This article is only for Linux to do forward tuning scheme, if you need more optimization of the scheme, please refer to the ASIC and NP hardware solutions, do not use the bus topology,

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Linux forwarding bottleneck analysis, evaluation, optimization, and solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Linux forwarding bottleneck analysis, evaluation, optimization, and solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support