Reception of network packets in the Linux kernel-Part II Select/poll/epoll

Source: Internet
Author: User
Tags epoll goto readable

As in the first part of the previous article, these words are designed to help others or to clear their mind. Rather than the so-called source code analysis. To analyze the source code, or direct debug source code is the best, see no matter what documents and books are the worst.

Therefore, this kind of people to clarify the idea of the article as far as possible to remember as the way of running water, as simple as possible and clear.
The wakeup callback mechanism of the Linux 2.6+ kernel uses the sleep queue to organize all of the tasks waiting for an event, while the wakeup mechanism can asynchronously wake up the task on the entire sleep queue, each node in the sleep queue has a callback , wakeup logic, when waking up the sleep queue, iterates through each node on the list of queues, invokes the callback of each node, and, assuming that a node is an exclusive node encountered during traversal, terminates the traversal. No longer continues to traverse the subsequent nodes. The logic on the whole can be represented by the following pseudo-code:
Sleep Waiting

Define Sleep_list;define wait_entry;wait_entry.task= Current_task;wait_entry.callback = func1;if (something_not_ Ready); Then    # into the clogging path    add_entry_to_list (Wait_entry, sleep_list); go on:      schedule ();    if (Something_not_ready); Then        goto go_on;    endif    del_entry_from_list (Wait_entry, sleep_list); EndIf ...

Wake-up mechanism
Something_ready;for_each (sleep_list) as Wait_entry; Do    wait_entry.callback (...);    if (wait_entry.exclusion); Then break        ;    Endifdone

We just need to seriously pay attention to this callback mechanism, it can do things really more than Select/poll/epoll,linux Aio is also it to do, register the callback. You are almost able to make a blockage path do whatever it is when awakened.

Generally speaking, a callback is the following logic:

Common_callback_func (...) {    do_something_private;    Wakeup_common;}

Of Do_something_private is wait_entry own definition of logic, and Wakeup_common is the public logic. The goal is to increase the Wait_entry task to the CPU's ready task queue, and then let the CPU dispatch it.


Now leave a thought, assuming the realization of select/poll, what should be done on the callback of Wait_entry?
.....
Select/poll logic to know, in most cases. To efficiently process network data, a task typically handles multiple sockets in bulk, and the one that reads the data, which means that all of these sockets are treated fairly, and you can't plug in the "data read" of whatever socket That means you can't call Recv/recvfrom in plug-in mode for any socket, which is the real need for a multiplexed socket.


Suppose there are n sockets that are handled by the same task. How do you end up with multiplexing logic? Very clearly. We have to wait for the "data readable" event instead of waiting for "actual data"!. We're going to plug in the event that "N Sockets have data readable on one or more sockets", that is to say, just this blockage, means there must be data readable, meaning that the next call Recv/recvform will not clog! There is one more aspect. This task should be queued to all of these sockets at the same time, expecting a socket to be sleep_list with only the data to read. is capable of waking the task.
So, the design of this kind of multiplexing model Select/poll is obvious.


Select/poll is very easy to design. Introduce a poll routine for each socket, the process for "data-readable" inference such as the following:

Poll () {    ...    If (Receive queue is not empty) {        ev |= poll_in;    }    ...}

When the task calls Select/poll. Suppose there is no data to read. Task is blocked. At this point it has been queued to all n socket sleep_list, only a socket to the data, the task will be awakened, the next thing is
For_each_n_socket as SK; Do    event.evt = Sk.poll (...);    event.sk = SK;    Put_event_to_user;done;

Visible. Just to have a socket with data readable, the entire n socket will be traversed to invoke the poll function again. See if there is no data to read, in fact, when the Select/poll task is awakened, it does not know the detail socket has data readable. It only knows that there is at least one socket in these sockets with data readable. So it needs to traverse through. To prove it. After the traversal is complete. The user-state task can read a socket that has an event occurred based on the returned result set.
Visible. Select/poll is very primitive, suppose there are 100,000 sockets (exaggeration?). ), there is a socket readable, then the system has to traverse over ... So select only restricts the ability to reuse up to 1024 sockets, and it is macro-controlled on Linux. Select/poll simply implements the socket multiplexing, which is not suitable for large-capacity network server processing scenarios at all. The bottleneck is that it cannot be extended in wartime as the socket grows.
Epoll on the use of Wait_entry callback since a Wait_entry callback can do random things, can it do more than select/poll in Wakeup_common scenes?
To do this, Epoll prepared a linked list. Called Ready_list, all of the sockets in the Ready_list are events, for data reading. are really data-readable.

Epoll Wait_entry callback to do is to increase their own to this ready_list. Wait for epoll_wait to return. Just need to traverse ready_list. Epoll_wait sleep on a separate queue (single_epoll_waitlist), not on the socket's sleep queue.
Unlike Select/poll, a task that uses Epoll does not have to be queued at the same time into all of the multiplexed sockets ' sleep queues, which have their own queues, and the task simply needs to sleep in its own separate queue to wait for events. The callback logic for each socket's wait_entry is:

Epoll_wakecallback (...) {    add_this_socket_to_ready_list;    Wakeup_single_epoll_waitlist;}
For this Epoll needs an extra call, and that's Epoll_ctrl ADD. Add a socket to Epoll table, it mainly provides a wakeup callback, the socket is assigned to a Epoll entry, the same time will initialize the Wait_entry callback for Epoll_ Wakecallback. The entire epoll_wait and wakeup logic for the protocol stack, as seen below:
Protocol stack wakes up the socket's sleep queue
1. The packet is queued to the socket's receiving queue;
2. Wake up the socket's sleep queue, that is, call each wait_entry callback;
3.callback adds ready_list to the socket itself;
4. Wake up the epoll_wait sleep in a separate queue.


From this. The epoll_wait continues to move forward. Iterate through the poll history of each socket in the call Ready_list, collecting events. This process is routine, because this is indispensable, ready_list inside each socket has data readable, do not work hard, this is and select/poll the essential difference (select/poll, even if there is no data to read. Also to be traversed all over again).
To summarize, the epoll logic is to do the following routines:
Epoll Add logic

Define Wait_entrywait_entry.socket = This_socket;wait_entry.callback = Epoll_wakecallback;add_entry_to_list (wait_ Entry, this_socket.sleep_list);


Epoll wait Logic
Define Single_wait_listdefine single_wait_entrysingle_wait_entry.callback = Wakeup_common;single_wait_entry.task = Current_task;if (Ready_list_is_empty); Then    # into the clogging path    add_entry_to_list (Single_wait_entry, single_wait_list); go on:      schedule ();    if (sready_list_is_empty); Then        goto go_on;    endif    del_entry_from_list (Single_wait_entry, single_wait_list); endiffor_each_ready_list as SK;    do Event.evt = Sk.poll (...);    event.sk = SK;    Put_event_to_user;done;

Epoll Wake-up logic
Add_this_socket_to_ready_list;wakeup_single_wait_list;

Combined above. can give the following flowchart about Epoll. Compare the flowchart of the first part of this article




Can see. The essential difference between epoll and Select/poll is that, in the event of an incident, each Epoll item (the socket) has its own separate wakeup callback, for Select/poll. There is only one! This means that in Epoll, a socket event is capable of invoking its own callback to handle itself. From the macroscopic point of view, the efficiency of epoll lies in separating out two kinds of sleep waits. one is the epoll itself of sleep waiting. It waits for a "random socket event", that is, the condition returned by the epoll_wait call, and it is not suitable for sleeping directly on the socket's sleep queue, assuming this is really the case. After all, so many sockets ... So it just sleeps itself. A socket's sleep queue must be related only to itself. So there is a kind of sleep waiting for each socket itself, it sleeps in its own queue can.


Epoll ET and LT are the time to mention ET and LT, the biggest controversy is which performance is high. Not exactly how it is used. Various documents say ET is efficient, but in fact, not at all, for the actual, LT is efficient at the same time. More secure.

What is the difference between the two?
Conceptual Differences ET: Notifications are only available when there is a change in status. For example, data buffering goes from scratch (unreadable-readable), assuming that the buffer contains data, it will not be kept informed.
LT: There is only data in the buffer. will be kept informed.
looked up a lot of information, and the answer was nothing more like the one above. However, assuming that the implementation of Linux is seen, it makes people more confused with ET. What does it mean that the state has changed? For example, the data receive buffer is a one-time 10 packets, against the above flowchart. It is very obvious that the wakeup operation is called 10 times, does that mean that the socket will be increased ready_list 10 times? This is certainly not the case, when the second packet arrives calling Wakeup callback, it is discovered that the socket is already in ready_list. Definitely no more, this time epoll_wait returns, after the user has read 1 packets. Suppose the program has a bug. It is no longer read. In this case, there are 9 packets in the buffer. Here's the problem. At this time, assuming that the protocol stack into a package, whether it is a notification or not notice it?? According to the concept of understanding, will not be notified, because this is not a "change of state", but in fact, on Linux you try to find that will be notified, because only the package into the socket queue. Will trigger wakeup callback, the socket will be placed in the ready_list, for ET, before epoll_wait return, the socket has been removed from the ready_list. So, assuming that in the ET mode, you find that the program is stuck in the epoll_wait, and you cannot conclude that the packet has not been received for a reason. It is also possible that the packet did not complete, but assume that a new packet is coming at this time. The epoll_wait will still return. Although this does not bring a buffer to the edge of the state of change.


So. Changes to the buffer state. It cannot be simply understood as having and not so simple, but the arrival of packets and not coming.
ET and LT is the concept of interruption, assuming that you bring the packet to the arrival. That is, insert into the socket receive queue this thing is understood as an interrupt event, so-called edge triggering is not this concept?
The difference in implementation is the logic of the implementation of the Code, the difference between the ET and LT implementations is that it will always be added to the ready_list once there are events. Until the next poll moves it out and then adds it to the ready_list after it detects an event of interest. It is the poll of the poll routine to infer whether there is an event, rather than relying entirely on wakeup callback. That's constantly polling! Other words. The LT mode is fully polled and will go to poll once every time. It is not until poll the event of interest that he will rest. At this point, only the arrival of the packet can again rely on wakeup callback to increase its ready_list.

On the implementation. The difference between the two can be seen from the following code.


Epoll_waitfor_each_ready_list_item as entry; Do    remove_from_ready_list (entry);    event = Entry.poll (...);    if (event) then        Put_user;        if (LT) then            # The following poll conclusion is the result of            add_entry_to_ready_list (entry);        endif    Endifdone


Differences in performance differ mainly in the organization of data structures and algorithms, for Epoll. Mainly is the list operation and Wakeup callback operation, for ET, is the wakeup callback to add the socket to ready_list, and for Lt. In addition to wakeup callback's ability to add sockets to Ready_list, Epoll_wait also has less work to add to poll ready_list,wakeup for the next callback. But this is not the root of the performance difference. The root of the performance difference is the traversal of the linked list. Suppose there is a large number of sockets using the LT mode, because each occurrence of the event will be added Ready_list again. So even if the socket is no longer an event. Will be used once poll to confirm. This extra time for no event socket meaningless traversal is not on ET. But be careful. Traversing the list of performance consumption only in the chain list is too long to reflect, do you think the hundreds socket will reflect the disadvantage of LT? True. ET does reduce the number of data-readable notifications, but this does not give an overwhelming advantage.
Lt is really easier to use than ET, nor easy to deadlock, or it is recommended to use LT to normal programming. Instead of using ET to occasionally dazzle.



Programming differences epoll ET in plug-in mode, the queue empty event is not recognized, which simply plugs into the epoll_wait call of the recv of a single socket instead of all the monitored sockets. Although it does not affect the execution of the code, it is good to have the data coming from the socket, but it can affect the programming logic. This means lifting the multiplexing arm, causing a large number of sockets to starve. Even with the data, I can't read it.

Of course, for Lt. There are similar problems, but LT will aggressively feed back the data to be readable, so events are not easily discarded due to your programming errors.
For LT, as it will continue to feed back, just have the data and when you want to read it you can read it. It always has the "next poll" opportunity to proactively detect if there is data to continue reading. Even if you use clogging mode, just do not cross the blocked border to cause other socket starvation. How much data can be read, but for ET, it notifies your application that the data is readable. Although the new data will still be notified, you will not be able to control the new data and when. So you have to read the total data. The ability to leave, when reading the total, means that you must be able to detect that the data is empty, so that is to say, you must use non-clogging mode until the Eagin error is returned.
Give tips in several et modes1. The size of the queue buffer contains the length of the SKB structure itself, about 230
2.ET mode. Wakeup callback increases the number of ready_list in the socket >= the number of packets received, so
Multiple datagrams are fast enough to reach a successful callback that might simply trigger a epoll wakeup callback, and only the socket will be added to Ready_list once
= = causes the queue to be full
And maybe the big message won't go in.
= Stopper effect
The tabloid text that can fill the remaining hole of the buffer can trigger the epoll_wait return of the ET pattern. Assuming the minimum length is 1, you can send a 0-length packet to lure epoll_wait back
= = But since the size of the SKB structure is inherently large, the temptation above cannot guarantee success.


3.epoll Surprise Group, able to participate in NGX experience
4.epoll can also refer to the NAPI shutdown scheme until the Recv routine returns Eagin or an error occurs, Epoll Wakeup callback is no longer called. This means that only the buffer is not empty. Even if a new packet comes in, it won't be notified.
A. Only the Epoll wakeup callback of the socket is invoked, the notification is forbidden;
The B.RECV routine starts a notification when it returns Eagin or errors.

Reception of network packets in the Linux kernel-Part II Select/poll/epoll

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.