Reception of network packets in the Linux kernel-Part I concepts and frameworks

Last Update:2016-01-17 Source: Internet

Author: User

Tags epoll readable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unlike sending network packets, network packets are asynchronous, because you are not sure who will send you a network packet suddenly, so this network packet logic actually contains two things:
1. Notification after the arrival of the packet
2. Receive notifications and get data from the packet
These two events occur at both ends of the protocol stack, namely the NIC/stack boundary and the protocol stack/application boundary:
NIC /protocol stack boundary: Network card Notification packet arrival, interrupt protocol stack to receive packets;
protocol Stacks/Application boundaries:The protocol stack populates the socket queue with packets, notifies the application that the data is readable, and the application is responsible for receiving data.
This article describes the two things about the two boundaries of how a detail, related to the network card interrupt, NAPI, Nic Poll,select/poll/epoll and other details, and assume that you have understood these.

Network card/protocol stack boundary of the event network card at the time of arrival of the packet will trigger an interrupt, then the protocol stack will know the packet arrival event, then how to receive the packet depends entirely on the protocol stack itself, this is the network card interrupt handler task, of course, also can not use the way of interruption, Instead of using a separate thread to constantly poll for packets coming in, this approach consumes too much CPU and does too little work, so it's basically deprecated, like this asynchronous event, which is basically a scheme that uses interrupt notifications. The overall packet-wrapping logic can be broadly divided into the following two ways
A. Each packet arrives interrupts the CPU, by the CPU dispatch interrupt processing program carries on the packet processing, the packet-wrapping logic divides into the upper half and the lower half, the core protocol stack processing logic completes in the bottom half.
B. Packet arrival, interrupt Cpu,cpu schedule Interrupt handler and turn off the interrupt response, the next half of the schedule to constantly poll the network card, after the completion of the packet or reached a threshold value, re-open the interrupt.
The way a can cause a lot of performance damage when the packet continues to arrive fast, so in this case it generally takes mode B, which is also the way Linux Napi uses it.

About the network card/protocol stack boundary of the events, do not want to say more, because this will involve a lot of hardware details, such as you in the NAPI mode, after the interruption of the network card inside how to cache the packet, in addition to the case of multi-core processor, Is it possible to break a packet received by a NIC to a different CPU core? So this involves the problem of multi-queue network card, and these are not a common kernel programmer can control, you need to know more about the vendor-related things, such as Intel's various specifications, a variety of people to see the halo of the manual ...

Protocol stack/socket Boundary event
So, to make it easier to understand, I decided to describe the same thing at another boundary, the protocol stack/application boundary, which is basically an area of interest to the kernel programmer and even the application programmer, and I renamed the stack/application boundary to the protocol stack for the sake of making the discussion easier to follow. Socket boundary, socket isolation protocol stack and application, it is an interface, for the protocol stack, it can represent the application, for the application, it can represent the protocol stack, when the packet arrives, the following things happen:
1). The protocol stack places the packet into the socket's receive buffer queue and notifies the application that owns the socket;
2). The CPU dispatches the application that holds the socket, pulls the packet out of the receive buffer queue, and completes the packet.
Overall the following

As shown in the socket features, each socket's packet logic contains the following two elements
The queue to which the receiving queue protocol stack has finished processing packets to which the application is awakened to read data.
Sleep queue the application associated with the socket can sleep on this queue if there is no data to read, and once the protocol stack queues the packet into the socket's receive queue, the process or thread on that sleep queue will be awakened.
A socket lock in the execution of the stream operation of the socket's metadata, the socket must be locked, note that the receiving queue and sleep queue does not require the lock to protect the lock is similar to the size of the socket buffer changes, TCP, such as sequential receive things.

This model is very simple and straightforward, and the network card interrupts the CPU notification that there is a packet arrival on the net, the protocol stack notifies the application that there is data to read, before continuing to discuss the details and Select/poll/epoll, say two unrelated things, and then stop saying, Just because they're relevant, it's just a matter of mentioning it and not taking up a lot of space.
1. Surprise group and exclusive wake-up similar to TCP accpet logic this way, for large Web servers, there are basically multiple processes or threads on a listen socket at the same time, if the protocol stack a client socket into the accept queue, Will all of these threads wake up or only one? If it is all awake, it is clear that only one thread will grab this socket, other threads rob the failure to continue to sleep, can be said to be awakened in vain, this is the classic TCP surprise group, so there is an exclusive wake up, that is, only wake the first thread on the sleep queue, It then exits the wakeup logic and no longer wakes up the thread behind it, which avoids the surprise group.
This topic is already voluminous on the Internet, but if you think about it, you will find that there is still a problem with exclusive awakening, which will greatly reduce the efficiency.
Why do you say that? Because the wake-up operation of the protocol stack and the actual accept operation of the application are completely asynchronous, no one can guarantee that the application is doing so unless the application is blocking the application at the time the protocol stack wakes up. To give a simple example, on a multicore system, there are multiple requests at the same time on the protocol stack, and there are just a few threads waiting on the sleep queue, and it would be nice to have the multiple stacks execute the stream while waking up the threads, but since a socket has only one accept queue, Therefore, the exclusive wake-up mechanism for the queue basically calls the imagination back, and the only one Accpet queue in and out of the lock operation to serialize the entire process, completely lost the advantages of multi-core parallelism, so reuseport and the fasttcp based on this emerged. (This weekend, careful study of the Linux kernel 4.4 version of the update brought about, really let a person in front of the light, I will write a separate article to describe)
2.REUSEPORT and multi-queue at first, before I learned about Google's reuseport, I personally did a similar patch, and the idea came from an analogy with a multi-queue NIC, since a NIC can interrupt multiple CPUs. Data-readable events on a socket why can't I break multiple applications? However, the socket API has already been fixed to death, which is a blow to my idea, because a socket is a file descriptor, representing a five-tuple (not connect UDP socket and listen TCP, except!) ), the event of the protocol stack is just related to a five-tuple ... So in order to make the idea feasible, only in the socket API outside the fuss, that is to allow multiple sockets to bind the same IP address/source port pair, and then according to the source IP address/Port pair hash value to differentiate traffic routing, the idea I also realized, in fact, with the multi-queue Nic is a thought, Exactly the same. Multi-queue NICs are not also based on a different five-tuple (or n-tuple?). Do we not seriously the hash value to interrupt different CPU cores? Think carefully about the transplant was too TMD, but see Google's reuseport patch feel that they do not work hard, re-build the wheel ... So I want to solve the problem of accept single queue, since the multi-core era, why do not maintain an accept queue on each CPU core? Application scheduling let the schedule subsystem to consider it ... This time not stupid, then saw Sina's fasttcp plan.
Of course, if Reuseport's hash calculation based on source ip/source port pairs, it avoids the "break" of the same stream to the receiving queue of the different sockets directly.

Well, the episode has been finished, and the details are coming up.

The management of the receive Queue management receive queue is actually very simple, is a SKB linked list, the protocol stack SKB inserted into the list when the queue itself, then insert SKB, and then wake up the thread on the socket sleep queue, and then the line loads lock to get the socket on the receiving queue SKB data, It's that simple.
At least that's how it's done on 2.6.8 's core. Later versions are the optimized version of this basic version, which has undergone two optimizations.

Receive path Optimization 1: Introducing the backlog queue for complex details, such as modifying the socket buffer size based on received data, the application needs to lock the entire socket when calling the Recv routine, in a complex multi-core CPU environment, There are multiple applications that may operate on the same socket, and multiple protocol stack execution flows may also be queued to the same socket receive buffer skb[for details, refer to the multi-core Linux kernel path optimization only way-multi-core platform TCP optimization So the size of the lock will naturally become the socket itself. When the application holds the socket, the protocol stack is not allowed to sleep because it may run in the soft interrupt context, so that the protocol stack execution flow does not spin block, a backlog queue is introduced, and when the application holds the socket, the protocol stack Just put the SKB into the backlog queue and you can return, so who's going to deal with the backlog queue eventually?
Who's going to deal with this? It was because the application locked the socket so that the protocol stack had to put SKB into the backlog, then in the application release socket, the backlog queue SKB into the receiving queue, The analog stack will skb the queue and wake up the operation.
Once the backlog queue is introduced, a single receive queue becomes a two-stage relay queue, similar to a pipelining operation. In any case, the protocol stack does not have to block the wait, if the protocol stack can not immediately put SKB into the receiving queue, then this thing is done by the socket lock itself, wait until it gives up the lock, this thing can be done. The operation routines are as follows:
Protocol Stack Queueing SKB---
Get Socket Spin Lock
When the application occupies the socket: SKB into the backlog queue
When the application does not occupy the socket: Queues the SKB in the receive queue, wakes up the receive queue
Release the socket spin lock
The application receives data---
Get Socket Spin Lock
Block-occupied sockets
Release the socket spin lock
Read data: Since the socket has been exclusive, you can safely copy the contents of the receiving queue SKB to the user state
Get Socket Spin Lock
SKB the backlog queue into the receive queue (which should actually be done by the protocol stack, but is postponed to the moment because the application occupies the socket), wakes up the sleep queue
Release the socket spin lock
It can be seen that the so-called socket lock, not a simple spin lock, but in different paths have different locking methods, in short, as long as the socket can be guaranteed to protect the metadata, the scheme is reasonable, so we see this is a two-layer lock model.

The two-level locking lock frame is so verbose that we can actually summarize the last sequence above into a more abstract generic pattern that can be applied in some scenarios. Let's describe this pattern now.
Participant Category: non-sleep-non-sleeping class, sleep-sleep class
Number of participants: Non-sleep multiple, sleep class multiple
Competitor: Between the Non-sleep classes, the Sleep class, the Non-sleep class and the Sleep class
Data:
X-Locked entities
x.lock-spin lock for locking a non-sleeping path and protecting the tag lock
x.flag-tag Lock to lock a sleep path
x.sleeplist-a task queue waiting to get a tag lock

Lock/unlock logic for the Non-sleep class:

Spin_lock (X.lock); if (X.flag = = 1) {    //add something todo to backlog    delay_func (...);} else {    //do it directly    Direct_func (...);} Spin_unlock (X.lock);

Lock/unlock logic for the Sleep class:

Spin_lock (x.lock);d o {    if (X.flag = = 0) {break        ;    }    for (;;) {        ready_to_wait (x.sleeplist);        Spin_unlock (x.lock);        Wait ();        Spin_lock (x.lock);        if (X.flag = = 0) {break            ,             }}    } while (0); X.flag = 1;spin_unlock (x.lock);d o_something (...); Spin_lock (X.lock) if (have_delayed_work) {do    {        fetch_delayed_work (...);        Direct_func (...);    } while (have_delayed_work);}  X.flag = 0;wakeup (x.sleeplist); Spin_unlock (X.lock);

For the socket packet logic, the SKB is inserted into the receiving queue and the sleep queue that wakes the socket is populated into the direct_func above, and the Delay_func task is to insert SKB into the backlog queue.
The abstract model is basically a two-layer lock logic, the spin lock in the sleep path is only used to protect the marker bit, the sleep path uses the marker bit to lock instead of using the spin lock itself, the mark bit modification is the spin lock protection, this very fast modification operation replaces the slow business logic processing path ( such as socket collection ...) The full lock, which greatly reduces the CPU time spin overhead of the race state. Recently I used this model in a real scene, very good, the effect is really OK, so deliberately abstracted out the above code.
The introduction of this two-layer lock frees the non-sleeping path operation so that it can still queue the packet into the backlog queues instead of waiting for the Sleep path task to be unlocked while the Sleep path task occupies a socket, but sometimes the logic on the sleep path is not so slow, if it is not slow, Even soon, the lock time is very short, then can you directly with the non-sleeping path to scramble for spin lock it? This is the opportunity to introduce the sleep path fast lock.

Receive path Optimization 2: The socket processing logic introduced in the fast lock process/thread context can compete with the kernel stack for a spin lock on the socket directly if the following conditions are true:
A. Very small processing critical area
B. There is currently no other process/thread in the context of the socket processing logic processing this socket.
Meet the above conditions, it is a simple environment, competitor status equivalence. The obvious question is who is dealing with the backlog queue problem, which is not a problem, because the backlog is not available, the backlog must hold a spin lock, and the socket is also held in the spin lock during fast lock, Two paths completely mutually exclusive! Therefore, the above conditions A is extremely important, if there is a large delay in the critical area, it will cause the protocol stack path excessive spin! The new fast lock framework is as follows:

Fast lock/unlock logic for the Sleep class:

Fast = 0;spin_lock (X.lock) do {    if (X.flag = = 0) {        fast = 0;        break;    }    for (;;) {        ready_to_wait (x.sleeplist);        Spin_unlock (x.lock);        Wait ();        Spin_lock (x.lock);        if (X.flag = = 0) {break            ;             }    }    X.flag = 1;    Spin_unlock (X.lock);} while (0);d o_something_very_small (...); Do {    if (fast = = 1) {break        ;    }    Spin_lock (x.lock);    if (have_delayed_work) {do        {            fetch_delayed_work (...);            Direct_func (...);        } while (have_delayed_work);    }      X.flag = 0;    Wakeup (x.sleeplist);} while (0); Spin_unlock (X.lock);

The code is so complex, not just spin_lock/spin_unlock, because if X.flag is 1, it means that the socket is already being processed, such as blocking the wait.

The above is the overall architecture of queues and locks for asynchronous processes on the/socket boundary of the protocol stack, summarizing the 5 elements:
Receive queue for A=socket
B=socket's sleep queue
C=socket's backlog queue
Spin Lock for D=socket
E=socket's Possession Mark
The following processes are performed between these 5 people:

With this framework, the protocol stack and the socket can be safely and asynchronously to the network data transfer, if you look closely, and the Linux 2.6 kernel wakeup mechanism have enough knowledge, and have a certain decoupling idea, I think it should be able to know select/poll/ Epoll is what kind of work mechanism. I would like to describe this in the second part of this article, and I think that as long as there is enough understanding and mastery of the basic concepts, many things can be deduced only by thought.
Below, we can let SKB participate in the above framework.

Relay transmission of SKB in the implementation of the Linux protocol stack, SKB represents a packet, a SKB can belong to a socket or protocol stack, but not both, a SKB belongs to the protocol stack refers to it is not associated with any socket, it is only responsible for the protocol stack itself, If a SKB belongs to a socket, it means that it has been bound to a socket and that all operations on it are the responsibility of the socket.
Linux provides a destructor destructor callback function for SKB, and whenever SKB is given a new owner, it invokes the previous owner's destructor and is assigned a new destructor, and we are more concerned with SKB from the protocol stack to the last bar of the socket, The following function is called before SKB is queued to the socket receive queue:

static inline void Skb_set_owner_r (struct sk_buff *skb, struct sock *sk) {    Skb_orphan (SKB);    Skb->sk = SK;    Skb->destructor = Sock_rfree;    Atomic_add (Skb->truesize, &sk->sk_rmem_alloc);    Sk_mem_charge (SK, skb->truesize);}

One of the main Skb_orphan is to callback the destructor assigned to the SKB by the previous owner, and then assign a new destructor Sock_rfree to the constructor. After the Skb_set_owner_r call is complete, the SKB formally enters the socket's receive queue:

Skb_set_owner_r (SKB, SK);/* Cache The SKB length before we tack it onto the receive * queue.  Once It is added it's no longer belongs to us and * could be freed by other threads of control pulling packets * from the Queu E. */skb_len = Skb->len;skb_queue_tail (&sk->sk_receive_queue, SKB), if (!sock_flag (SK, Sock_dead))    sk- >sk_data_ready (SK, Skb_len);

Finally, by calling Sk_data_ready to notify that the task data on the socket sleep queue has been queued to the receive queue, it is actually a wakeup operation, and then the protocol stack returns. It is clear that all subsequent processing of the SKB is done in the process/thread context, and when the SKB data is fetched, the SKB is not returned to the protocol stack, but is freed by the process/thread itself, so in its destructor callback function Sock_rfree, The main thing to do is to return the buffer space to the system, mainly to do two things:
1. The socket has allocated memory minus the space occupied by the SKB
Sk->sk_rmem_alloc = sk->sk_rmem_alloc-skb->truesize;
2. The socket pre-allocated space plus the space occupied by the SKB
Sk->sk_forward_alloc = Sk->sk_forward_alloc + skb->truesize;

Protocol packet Memory usage statistics and limits the kernel protocol stack is only a subsystem of the kernel, and its data comes from outside the computer, the data source is not controlled, is vulnerable to DDoS attacks, it is necessary to limit the overall memory usage of a protocol, such as all TCP connections can only use 10M of memory, such as The Linux kernel initially only counted on TCP, and later added a statistical limit for UDP, which is represented in several sysctl parameters:
Net.ipv4.tcp_mem = 18978 25306 37956
Net.ipv4.tcp_rmem = 4096 87380 6291456
Net.ipv4.tcp_wmem = 4096 16384 4194304
Net.ipv4.udp_mem = 18978 25306 37956
....
For each of the above three values, the meanings are as follows:
The first value mem[0]: Indicates normal, usually the memory usage is lower than this value, all is OK;
Second value mem[1]: Warning value, usually above this value, it is necessary to start the austerity program;
Third value mem[2]: insurmountable bounds, above this value, indicates that memory usage is overrun and data is discarded.
Note that these configuration values are for individual protocols, and the Recvbuff configured in Sockopt is configured with a buffer size limit for a single connection, which is different. The kernel in processing this protocol limit, in order to avoid frequent detection, the use of pre-allocation mechanism, the first time even if only a 1byte package, will be overdrawn a page of memory limit, there is no actual memory allocation, Because the actual memory allocation is determined at the time of the SKB generation and the IP shard reorganization, this only adds up the values and detects if the limit is exceeded, so the logic here is just a subtraction process, and no other machine resources are consumed in addition to the CPU consumed by the calculation process.

The calculation method is as follows
proto.memory_allocated: Each protocol one, indicating the current protocol in the kernel socket buffer in total has been used in memory of how much storage skb;
Sk.sk_forward_alloc: Each socket, indicating the amount of memory currently pre-allocated to the socket, can be used to store SKB;
Skb.truesize: The size of the SKB structure itself and the sum of its data size;
SKB is about to enter the cumulative routine on the eve of the socket's receive queue:

OK = 0;if (Skb.truesize < Sk.sk_forward_alloc) {    ok = 1;    Goto Addload;} Pages = How_many_pages (skb.truesize), tmp = Atomic_add (proto.memory_allocated, Pages*page_size), if (TMP < mem[0]) { C3/>ok = 1;    normal;} if (tmp > Mem[1]) {    OK = 2;    Tight;} if (tmp > Mem[2]) {    overrun;} if (ok = = 2) {    if (do_something (proto)) {        ok = 1;    }} Addload:if (ok = = 1) {    sk.sk_forward_alloc = sk.sk_forward_alloc-skb.truesize;    proto.memory_allocated = tmp;} else {    drop SKB;}

SKB called the sk.sk_forward_alloc extension of the destructor period when released by the socket:

Sk.sk_forward_alloc = Sk.sk_forward_alloc + skb.truesize;

Protocol buffer payback period (called when SKB is released or expired when SKB is deleted):

if (Sk.sk_forward_alloc > Page_size) {    pages = Sk.sk_forward_alloc adjusted to the full number of pages;    prot.memory_allocated = prot.memory_allocated-pages*page_size;}

This logic can be seen in sk_mem_xxx functions such as Sk_rmem_schedule.
The first part of this article is over, and the second part will focus on the logic of Select,poll,epoll.

Reception of network packets in the Linux kernel-Part I concepts and frameworks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More