Reception of network packets in the Linux kernel-Part I concepts and frameworks

Last Update:2017-08-11 Source: Internet

Author: User

Tags epoll readable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unlike sending network packets, network packets are asynchronous. Because you are not sure who will suddenly send you a network package. So this network packet logic actually consists of two things:
1. Notification after the arrival of the packet
2. Receive notifications and get data from the packet
These two things happen at both ends of the protocol stack. That is, the NIC/protocol stack boundary and the protocol stack/application boundary:
NIC /protocol stack boundary: Network card Notification packet arrival, interrupt protocol stack to receive packets;
protocol Stacks/Application boundaries:The protocol stack populates the socket queue with packets, notifies the application that the data is readable, and the application is responsible for receiving data.

This article is to introduce the two things about the two boundaries of how a detail, related to the network card interrupt, NAPI. Network card Poll,select/poll/epoll and other details. And suppose you have understood this already.

The network card/protocol stack boundary event NIC triggers an interrupt when the packet arrives, and then the protocol stack knows the packet arrival event. How the next package is completely dependent on the protocol stack itself, this is the task of the network card interrupt handler, of course, can not be used in the way of interruption, but instead of a separate thread to constantly poll the network card whether there are packets coming. But this way is too CPU-intensive. Do too much work hard, so it's basically discarded. An asynchronous event like this. The basic is the use of interrupt notification scheme. The overall packet-wrapping logic can be broadly divided into the following two ways
A. The CPU is interrupted by the arrival of each packet. By the CPU scheduling interrupt handler for packet processing, the packet logic is divided into the upper half and the lower half, the core of the protocol stack processing logic in the bottom half of the completion.
B. Packet arrival, interrupt Cpu,cpu schedule Interrupt handler and turn off interrupt response, the second half of the schedule is constantly polling the network card. Once the packet has been received or a threshold is reached, the interrupt is opened again.
The way a can cause very large performance damage when the packet continues to arrive at high speed, so in this case the general use of the way B, which is the way Linux Napi.

About events that occur on the NIC/protocol stack boundary. Don't want to say a lot of other things, because this will involve very much hardware details, such as you in the NAPI mode, after the interruption of the network card inside how to cache the packet, in addition to consider the multi-core processor situation. Is it possible to break a packet received by a NIC to a different CPU core? Then this involves the problem of multi-queue network card, and these are not a common kernel program ape can control, you have to understand a lot of other manufacturers related things, such as Intel's various specifications, a variety of people to see the halo of the manual ...

Protocol stack/socket Boundary event
So. To make it easier to understand, I decided to have a border. The protocol stack/application boundary describes the same things, and these are basically areas of interest to kernel program apes and even application apes, in order to make later discussions easier. I rename this protocol stack/application boundary to the protocol stack/socket boundary, the socket isolates the protocol stack and the application, it is an interface, for the protocol stack, it can represent the application. For an application, it can represent a stack of protocols. When the packet arrives. The following things can happen:
1). The protocol stack places the packet into the socket's receive buffer queue and notifies the application that owns the socket;
2). The CPU dispatches the application that holds the socket, pulls the packet out of the receive buffer queue, and completes the packet.
The whole of For example the following

As seen in the socket features, each socket's packet logic includes the following two features
The queue to which the receiving queue protocol stack has finished processing packets to which the application is awakened to read data.

The Sleep queue application that is associated with the socket assumes no data is readable. Be able to sleep on this queue. Once the protocol stack queues a packet into the socket's receive queue, the process or thread on that sleep queue is awakened.

A socket is locked in the metadata of the socket running the flow operation, you must lock the socket, note. The receive queue and the sleep queue do not require this lock to be protected. This lock is protected by similar socket buffer size changes. TCP is ordered to receive things like that.

This model is very easy and direct, and the network card interrupt CPU to notify the Internet has a packet arrival need to deal with, the protocol stack in such a way to inform the application has data readable, before continuing to discuss the details and Select/poll/epoll, first said two unrelated things, and then no longer said, Just because they're relevant, it's just a mention. Does not occupy a large space.
1. Surprise group and exclusive wake-up similar to TCP accpet logic this way, for large webserver. Basically, there are multiple processes or threads that accept on a listen socket at the same time, assuming that the protocol stack queues a clientsocket into the accept queue. Will all of these threads wake up or just wake one? The assumption is that all wakes up, very obviously. Just one thread will grab this socket, the other thread Rob failed to continue to sleep, can be said to be awakened in vain, this is the classic TCP surprise group. Therefore, an exclusive wake-up is generated. This means waking the first thread on the sleep queue only, then exiting the wakeup logic, and no longer wakes up the subsequent thread. This avoids the surprise group.
This topic on the online discussion has long been voluminous, but carefully think about it will find that the exclusive wake still have problems. It will greatly reduce efficiency.

Why do you say that? The wake-up operation of the protocol stack and the actual accept operation of the application are completely asynchronous. Unless the application is in the wake of the protocol stack, the app is stuck on the accept, and no one can guarantee what the application was doing. Give a simple example. On multi-core systems, the protocol stack has multiple requests at the same time, and there are just a few threads waiting on the sleep queue, assuming that it would be nice to have the multiple stacks run at the same time to wake up multiple threads, but because a socket has only one accept queue, So the exclusive wake-up mechanism for the queue basically calls the imagination back. The only Accpet queue-in/out lock operation allows the entire process to be serialized, completely losing the benefits of multicore parallelism. So Reuseport and the fasttcp based on this have emerged.

(This weekend.) Detailed study of the Linux kernel 4.4 version of the update brought about, really let a person in front of the light Ah, I will write a separate article to describe the narrative)
2.REUSEPORT and multiple queues were first learned before Google's reuseport. I've personally made a similar patch, and the idea came from an analogy with a multi-queue NIC, since a network card could interrupt multiple CPUs. Data-readable events on a socket why can't I break multiple applications? However, the socket API has already been fixed dead, which has hit my mind. Because a socket is a file descriptive descriptor. Represents a five-tuple (except for sockets that are not connect UDP and listen TCP.) )。 The protocol stack event is just about a five-tuple ... So in order to make the idea workable. Only in the socket API to make a fuss, that is to agree to multiple sockets binding the same IP address/source port pair, and then according to the source IP address/port hash value to differentiate traffic routing, the idea I also realized. In fact, with the multi-queue Nic is an idea. Completely consistent. The multi-queue network card is not also according to different five-tuple (or n-tuple?). Do we not seriously the hash value to interrupt different CPU cores? Think carefully about the transplant was too damn handsome. However, seeing Google's Reuseport patch thought that he had worked hard and made wheels again ... So I want to solve the problem of accept single queue, since the multi-core era, why do not maintain an accept queue on each CPU core? Application scheduling let the schedule subsystem to consider it ... This time not stupid, then saw Sina's fasttcp plan.
Of course, assuming Reuseport's hash calculation based on the source ip/source Port pair, it avoids the "break" of the same stream to the receiving queue of the different sockets directly.

Well, the episode has been finished, and the details are coming up.

Management of the receive queue managing the receive queue in fact, it is very easy, is a SKB list, the protocol stack SKB inserted into the list when the queue itself, and then insert SKB. Then wake up the thread on the socket sleep queue. Then the line loads lock gets the SKB data on the socket receiving queue, which is so simple.

At least that's how it's done on 2.6.8 's core.

The later version number is the optimized version number for this base version number. has experienced two times of optimization.

Receive path Optimization 1: The backlog queue is introduced with complex details, such as when the socket buffer size is altered according to the data received. The application needs to lock the entire socket when calling the Recv routine, where multiple applications may operate the same socket in a complex multi-core CPU environment, and multiple protocol stack runs may also be queued to the same socket receive buffer skb[for details, please refer to the Multi-core Linux kernel path optimization only way-multi-core platform TCP optimization], so the granularity of the lock will naturally become the socket itself. When the application holds the socket. The protocol stack is not sleep-waiting because it may run in the context of a soft interrupt, and a backlog queue is introduced in order for the protocol stack to run without a spin jam. protocol stack when the application holds the socket, only need to skb into the backlog queue to be able to return, then the backlog queue finally by WHO to deal with it?
Who will deal with the matter who is looking for. It was because the application locked the socket so that the protocol stack had to put SKB into the backlog, then in the application release socket, the backlog queue SKB into the receiving queue, The analog stack will skb the queue and wake up the operation.
Once the backlog queue is introduced, a single receive queue becomes a two-stage relay queue, similar to a pipelining operation. This way, no matter how the protocol stack does not clog the wait, the protocol stack assumes that SKB cannot be queued to the receive queue immediately. Then this will be done by the socket lock itself, until it gives up the lock. The operation routines are as follows:
Protocol Stack Queueing SKB---
Get Socket Spin Lock
When the application occupies the socket: SKB into the backlog queue
When the application does not occupy the socket: Queues the SKB in the receive queue. Wake-up Receive queue
Release the socket spin lock
The application receives data---
Get Socket Spin Lock
Plug-in occupies socket
Release the socket spin lock
Read data: Since the socket is already exclusive, it is possible to safely copy the contents of the receive queue SKB to the user state
Get Socket Spin Lock
The SKB of the backlog queue is queued to the receive queue (which in fact should be completed by the protocol stack, but is postponed to the moment because the application occupies the socket). Wake Up Sleep queue
Release the socket spin lock
Can see that the so-called socket lock, not a simple spin lock. But in different paths have different locking way, in short, just to ensure that the socket metadata is protected, the scheme is reasonable, so we see this is a two-layer lock model.

The two-level locking lock frame is so verbose that we can actually summarize the last sequence above into a more abstract generic pattern that can be applied in some scenarios.

Describe this pattern now.
Participant Category: non-sleep-non-sleeping class, sleep-sleep class
Number of participants: Non-sleep multiple, Sleep class
Competitor: Between the Non-sleep classes, the Sleep class. Between the Non-sleep class and the Sleep class
Data:
X-Locked entities
x.lock-spin lock for locking a non-sleeping path and protecting the tag lock
x.flag-tag Lock to lock a sleep path
x.sleeplist-a task queue waiting to get a tag lock

Lock/unlock logic for the Non-sleep class:

Spin_lock (X.lock); if (X.flag = = 1) {    //add something todo to backlog    delay_func (...);} else {    //do it directly    Direct_func (...);} Spin_unlock (X.lock);

Lock/unlock logic for the Sleep class:

Spin_lock (x.lock);d o {    if (X.flag = = 0) {break        ;    }    for (;;) {        ready_to_wait (x.sleeplist);        Spin_unlock (x.lock);        Wait ();        Spin_lock (x.lock);        if (X.flag = = 0) {break            ,             }}    } while (0); X.flag = 1;spin_unlock (x.lock);d o_something (...); Spin_lock (X.lock) if (have_delayed_work) {do    {        fetch_delayed_work (...);        Direct_func (...);    } while (have_delayed_work);}  X.flag = 0;wakeup (x.sleeplist); Spin_unlock (X.lock);

For the socket packet logic, the fact is that SKB is inserted into the receive queue and the sleep queue that wakes the socket is populated into the direct_func above. At the same time Delay_func's task is to insert SKB into the backlog queue.
The abstract model is basically a two-layer lock logic, the spin lock in the sleep path is only used to protect the marker bit, the sleep path uses the marker bit to lock instead of using the spin lock itself, the change of the mark bit is protected by the spin lock. This very fast change operation replaces the slow business logic processing path (for example, Socket packet ...) Completely locked. This greatly reduces the spin overhead of the CPU time caused by the race state.

Recently I used this model in a real scene, very good, the effect is really OK, so deliberately abstracted out the above code.
The introduction of this two-layer lock frees the non-sleeping path operation so that it can still discharge packets into the backlog queue while the Sleep path task occupies a socket instead of waiting for the Sleep path task to be unlocked, but sometimes the logic on the sleep path is not so slow, assuming it is not slow, Even very fast, locking time is very short, then is it possible to directly with the non-sleeping path to scramble for spin lock it? This is the opportunity to introduce the sleep path fast lock.

Receive path Optimization 2: The socket processing logic in the fast lock process/thread context is introduced to compete for the socket's spin lock directly with the kernel stack in the following situations:
A. Very small processing critical area
B. There is currently no other process/thread in the context of the socket processing logic processing this socket.

Meet the above conditions. This is a simple environment, the competitor status is equal. So the obvious question is who is going to deal with the backlog queue, which is actually not a problem, because in this case the backlog is not available, the backlog must hold a spin lock, and the socket holds a spin lock during fast lock. Two paths are totally mutually exclusive!

Therefore, the above condition A is extremely important, assuming that there is a large delay in the critical region, it will cause the protocol stack path excessive spin. The new fast lock framework is for example the following:

Fast lock/unlock logic for the Sleep class:

Fast = 0;spin_lock (X.lock) do {    if (X.flag = = 0) {        fast = 0;        break;    }    for (;;) {        ready_to_wait (x.sleeplist);        Spin_unlock (x.lock);        Wait ();        Spin_lock (x.lock);        if (X.flag = = 0) {break            ;             }    }    X.flag = 1;    Spin_unlock (X.lock);} while (0);d o_something_very_small (...); Do {    if (fast = = 1) {break        ;    }    Spin_lock (x.lock);    if (have_delayed_work) {do        {            fetch_delayed_work (...);            Direct_func (...);        } while (have_delayed_work);    }      X.flag = 0;    Wakeup (x.sleeplist);} while (0); Spin_unlock (X.lock);

The code above is so complex, not just spin_lock/spin_unlock. is due to the assumption that X.flag is 1. This means that the socket is already being processed, for example, blocking the wait.

The above is the overall architecture of the queue and lock of the asynchronous process on the/socket boundary of the protocol stack. Sum up. Consists of 5 elements:
Receive queue for A=socket
B=socket's sleep queue
C=socket's backlog queue
Spin Lock for D=socket
E=socket's Possession Mark
The following process runs between these 5 people:

With this framework. The interface between the protocol stack and the socket can safely and asynchronously transfer the network data. If you look at it in detail, and have enough knowledge of the wakeup mechanism of the Linux 2.6 kernel, and have some decoupling ideas, I think I should be able to know how select/poll/epoll is a working mechanism.

I would like to describe the narrative in the second part of this article, and I think that there is only enough understanding and mastery of the basic concept. A lot of things can be deduced only by thought.
Below, we can let SKB participate in the above framework.

Relay Transfer SKB In the implementation of the Linux protocol stack, SKB represents a packet. A SKB can belong to a socket or protocol stack, but not at the same time belong to both, a SKB belongs to the protocol stack refers to it is not and no matter what a socket associated, it is only responsible for the protocol stack itself, assuming a skb belongs to a socket, It means that it has been bound to a socket and that all operations on it are the responsibility of the socket.

Linux provides a destructor destructor for SKB, which invokes the previous owner's destructor whenever the SKB is assigned to a new owner. And is assigned a new destructor. We are more concerned with SKB from the protocol stack to the last stick of the socket, before the SKB is queued to the socket receive queue. The following function is called:

static inline void Skb_set_owner_r (struct sk_buff *skb, struct sock *sk) {    Skb_orphan (SKB);    Skb->sk = SK;    Skb->destructor = Sock_rfree;    Atomic_add (Skb->truesize, &sk->sk_rmem_alloc);    Sk_mem_charge (SK, skb->truesize);}

The Skb_orphan mainly callback the destructor that the former owner gave to the SKB. A new destructor callback function, Sock_rfree, is then assigned to it. After the Skb_set_owner_r call is complete, the SKB formally enters the socket's receive queue:

Skb_set_owner_r (SKB, SK);/* Cache The SKB length before we tack it onto the receive * queue.  Once It is added it's no longer belongs to us and * could be freed by other threads of control pulling packets * from the Queu E. */skb_len = Skb->len;skb_queue_tail (&sk->sk_receive_queue, SKB), if (!sock_flag (SK, Sock_dead))    sk- >sk_data_ready (SK, Skb_len);

Finally, by calling Sk_data_ready to notify that the task data on the socket sleep queue has been queued to the receive queue, it is actually a wakeup operation, and then the protocol stack returns.

It is very clear that the entire processing of the SKB is done in the process/thread context, and the SKB is not returned to the protocol stack until the SKB data is fetched. Instead, the process/thread frees itself, so in its destructor callback function Sock_rfree, the main thing is to return the buffer space to the system, mainly doing two things:
1. The socket has allocated memory minus the space occupied by the SKB
Sk->sk_rmem_alloc = sk->sk_rmem_alloc-skb->truesize;
2. The socket pre-allocated space plus the space occupied by the SKB
Sk->sk_forward_alloc = Sk->sk_forward_alloc + skb->truesize;

Statistics and limitations of Protocol packet memory usage the kernel protocol stack is only a subsystem of the kernel. And its data comes from outside the computer. The data source is not controlled, very easy to be DDoS attack, it is necessary to limit the overall memory usage of a protocol, for example, all TCP connections can only use 10M of memory, the Linux kernel initially only for TCP statistics. The statistical restrictions on UDP have also been added later. The configuration is reflected as a few sysctl of the parameters:
Net.ipv4.tcp_mem = 18978 25306 37956
Net.ipv4.tcp_rmem = 4096 87380 6291456
Net.ipv4.tcp_wmem = 4096 16384 4194304
Net.ipv4.udp_mem = 18978 25306 37956
....
For each of the above three values, the meanings are as follows:
The first value mem[0]: Indicates normal, usually the memory usage is lower than this value, all is OK;
Second value mem[1]: a warning value. Usually above this value. We are going to start the austerity programme;
Third value mem[2]: insurmountable bounds, above this value, indicates that memory usage is overrun. The data is going to be discarded.
Note that these configuration values are for separate protocols. The Recvbuff configured in the sockopt configuration is a buffer limit size for a single connection. The two are different. When the kernel deals with this protocol limit, it uses a pre-distribution mechanism for the first time, even if it is just a 1byte package, to avoid frequent detection. It also overdrafts a page's memory limit, where there is no actual memory allocation. Since the actual memory allocations have been determined at the time of the SKB generation and the IP shard reorganization, this is just a matter of adding up the values and checking to see if the limit is exceeded. So the logic here is just a subtraction process, except that the CPU consumed by the computational process does not consume other machine resources.

Calculation methods such as the following
proto.memory_allocated: one for each protocol. Indicates how much memory the current protocol has used in the kernel socket buffer to store SKB;
Sk.sk_forward_alloc: Each socket represents the amount of memory that is currently pre-allocated to the socket and can be used to store SKB.
Skb.truesize: The size of the SKB structure itself and the sum of its data size;
SKB is about to enter the cumulative routine on the eve of the socket's receive queue:

OK = 0;if (Skb.truesize < Sk.sk_forward_alloc) {    ok = 1;    Goto Addload;} Pages = How_many_pages (skb.truesize), tmp = Atomic_add (proto.memory_allocated, Pages*page_size), if (TMP < mem[0]) { C3/>ok = 1;    normal;} if (tmp > Mem[1]) {    OK = 2;    Tight;} if (tmp > Mem[2]) {    overrun;} if (ok = = 2) {    if (do_something (proto)) {        ok = 1;    }} Addload:if (ok = = 1) {    sk.sk_forward_alloc = sk.sk_forward_alloc-skb.truesize;    proto.memory_allocated = tmp;} else {    drop SKB;}

SKB called the sk.sk_forward_alloc extension of the destructor period when released by the socket:

Sk.sk_forward_alloc = Sk.sk_forward_alloc + skb.truesize;

Protocol buffer payback period (called when SKB is released or expired when SKB is deleted):

if (Sk.sk_forward_alloc > Page_size) {    pages = Sk.sk_forward_alloc adjusted to the full number of pages;    prot.memory_allocated = prot.memory_allocated-pages*page_size;}

This logic can be seen in the sk_mem_xxx functions such as Sk_rmem_schedule.

The first part of this article has ended, and the second part will focus on describing the logic of Select,poll,epoll.

Reception of network packets in the Linux kernel-Part I concepts and frameworks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More