Reception of network packets in the Linux kernel-Part II Select/poll/epoll

Source: Internet
Author: User
Tags epoll goto readable

And the first part of the previous article, the text is to help others or their own ideas, rather than the so-called source analysis, want to analyze the source code, or direct debug source is the best, see any document and books are the worst. Therefore, this kind of people to clarify the idea of the article as far as possible to remember as the way of running water, as simple as possible and clear.

Wakeup callback mechanism of Linux 2.6+ kernel

The Linux kernel organizes all of the tasks waiting for an event through the sleep queue, while the wakeup mechanism asynchronously wakes up the task on the entire sleep queue, and each node on the sleep queue has a callback,wakeup logic that wakes up the sleep queue. Iterates through each node on the list of queues, invokes the callback of each node, and, if it encounters a node that is an exclusive node during traversal, terminates the traversal and no longer traverses the subsequent nodes. The overall logic can be represented by the following pseudo-code:

Sleep Waiting
Define Sleep_list;define wait_entry;wait_entry.task= Current_task;wait_entry.callback = func1;if (something_not_ Ready);    Then # Enter the blocking path Add_entry_to_list (Wait_entry, sleep_list); go On:schedule (); if (Something_not_ready);    Then Goto go_on; endif del_entry_from_list (Wait_entry, sleep_list); EndIf ...


Wake-up mechanism
Something_ready;for_each (sleep_list) as Wait_entry;    Do Wait_entry.callback (...); if (wait_entry.exclusion);    Then break; Endifdone


We only need to seriously pay attention to this callback mechanism, it can do things really more than Select/poll/epoll,linux Aio is also it to do, registered callback, you can almost let a blocking path in the wake of the time to do anything. Generally speaking, a callback is the following logic:

Common_callback_func (...)    {do_something_private; Wakeup_common;}


Where Do_something_private is Wait_entry's own custom logic, and Wakeup_common is the public logic that is designed to add the Wait_entry task to the CPU's ready task queue and then let the CPU dispatch it.
Now leave a thinking, if the realization of select/poll, what should be done in the Wait_entry callback on the article?
.....

The logic of Select/poll

You know, in most cases, to efficiently process network data, a task will generally batch processing multiple sockets, which is the data to read that, which means to treat all these sockets fairly, you can not block in any socket "data read", That means you can't call Recv/recvfrom in blocking mode for any socket, which is the substantive requirement for multiplexing sockets.
Assuming that n sockets are handled by the same task, how do you complete multiplexing logic? Obviously, we have to wait for the "data readable" event instead of waiting for "actual data"!! We want to block the event that "N Sockets have data readable on one or more sockets", that is, as long as the blocking is lifted, it means there must be data readable, meaning that the next call to Recv/recvform will not block! On the other hand, this task should be queued to all of these sockets at the same time, expecting any socket to wake the task as long as the data is sleep_list.
So, the design of this kind of multiplexing model Select/poll is obvious.
Select/poll's design is very simple, for each socket to introduce a poll routine, the process for the "data readable" judgment as follows:

Poll () {... if (Receive queue is not empty) {EV |= poll_in; }    ...}


When the task calls Select/poll, if there is no data to read, the task will block, at this point it has been discharged into all n socket sleep_list, as long as a socket to the data, the task will be awakened, the next thing is

For_each_n_socket as SK;    Do event.evt = Sk.poll (...);    event.sk = SK; Put_event_to_user;done;


Visible, as long as there is a socket with data readable, the entire n socket will be traversed to invoke the poll function, to see if there is no data readable, in fact, when blocking the Select/poll task is awakened, it does not know that the specific socket has data readable, It only knows that there is at least one socket in these sockets with data readable, so it needs to traverse through, in order to prove that, after the completion of the traversal, the user-state task can be based on the returned result set to the occurrence of the socket to read the operation.
Visible, Select/poll very primitive, if there are 100,000 sockets (exaggeration?). ), there is a socket readable, then the system has to traverse over ... So select only limits the maximum number of 1024 sockets that can be reused, and this is macro-controlled on Linux. Select/poll is simply a simple way to achieve the socket multiplexing, is not suitable for large-capacity network server processing scenarios. The bottleneck is that it cannot be extended in wartime as the socket grows.

The utilization of Epoll to Wait_entry callback

Since a wait_entry callback can do anything, would it be possible to make it more wakeup_common than the select/poll scene?
To this end, Epoll prepared a linked list, called Ready_list, all sockets in ready_list, there are events, for data reading, are really data-readable. Epoll Wait_entry callback To do is to add themselves to this ready_list, waiting for the epoll_wait to return, only need to traverse ready_list. Epoll_wait sleep on a separate queue (single_epoll_waitlist), not on the socket's sleep queue.
Unlike Select/poll, a task that uses Epoll does not need to be queued to all multiplexed sockets at the same time, and the sockets have their own queues, and the task only needs to sleep in its own separate queue waiting for events, and each socket's wait The callback logic of _entry is:

Epoll_wakecallback (...)    {add_this_socket_to_ready_list; Wakeup_single_epoll_waitlist;}

To do this, Epoll needs an extra call, which is Epoll_ctrl add, which adds a socket to the Epoll table, which mainly provides a wakeup callback, which assigns the socket to a epoll entry, The callback of the Wait_entry is initialized at the same time as Epoll_wakecallback. The wakeup logic for the entire epoll_wait and the protocol stack is as follows:
Protocol stack wakes up the socket's sleep queue
1. The packet is discharged into the socket's receiving queue;;
2. Wake up the socket's sleep queue, that is, call each wait_entry callback;
3.callback add itself this socket to ready_list;
4. Wake up the epoll_wait sleep in a separate queue.
Since then, Epoll_wait continues to traverse the poll process of calling ready_list inside each socket, collecting events. This process is routine, because this is essential, ready_list inside each socket has data readable, do not work hard, this is and select/poll the essential difference (select/poll, even if there is no data to read, but also to traverse all over).
To summarize, the epoll logic is to do the following routines:

Epoll Add logic
Define Wait_entrywait_entry.socket = This_socket;wait_entry.callback = Epoll_wakecallback;add_entry_to_list (wait_ Entry, this_socket.sleep_list);



Epoll wait Logic
Define Single_wait_listdefine single_wait_entrysingle_wait_entry.callback = Wakeup_common;single_wait_entry.task = Current_task;if (Ready_list_is_empty);    Then # Enter the blocking path Add_entry_to_list (Single_wait_entry, single_wait_list); go On:schedule (); if (sready_list_is_empty);    Then Goto go_on; endif del_entry_from_list (Single_wait_entry, single_wait_list); Endiffor_each_ready_list as SK;    Do event.evt = Sk.poll (...);    event.sk = SK; Put_event_to_user;done;


Epoll Wake-up logic
Add_this_socket_to_ready_list;wakeup_single_wait_list;


Integrated above, you can give the following on the epoll of the flowchart, you can compare the first part of this paper flowchart


650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/79/BC/wKioL1aZ81-CEqlcAAL_CrGFxPo563.jpg "title=" Poll2.jpg "alt=" Wkiol1az81-ceqlcaal_crgfxpo563.jpg "/>


It can be seen that the essential difference between epoll and Select/poll is that, in the event of events, each Epoll item (that is, the socket) has its own separate wakeup callback, and for Select/poll, there is only one! This means that in Epoll, a socket event can invoke its own callback to handle itself. From the macroscopic point of view, Epoll is efficient in separating out two kinds of sleep waiting, one is the epoll itself sleep wait, it waits for "any one socket event", that is, epoll_wait call return condition, it is not suitable for direct sleep on the socket sleep queue And if so, who will sleep? After all, so many sockets ... So it sleeps only itself. A socket's sleep queue must be related only to itself, so another type of sleep wait is each socket itself, which sleeps on its own queue.


ET and LT of Epoll

It's time to mention ET and LT, and the biggest controversy lies in which performance is high, not how it's used. The various documents say ET is efficient, but in fact, it is not, for practical purposes, lt is efficient and more secure. What's the difference between the two?

The conceptual difference

ET: Only when the state changes, will be notified, such as the data buffer from scratch (unreadable-readable), if there is data in the buffer, it will not be kept informed;
LT: As long as there is data in the buffer, it will be notified.
Looked at a lot of information, the answer is nothing more like the above, but if you look at the implementation of Linux, but let people more confused about ET. What does it mean that the state has changed? For example, the data receive buffer inside a one-time to 10 packets, compared with the above flowchart, it is clear that the call 10 wakeup operation, is not meant to be added to the socket ready_list 10 times? Certainly not the case, the second packet came to call Wakeup callback, found that the socket is already in ready_list, and certainly will not add, at this time epoll_wait return, the user read 1 packets, assuming the program has a bug, then no longer read, At this point in the buffer there are still 9 packets, the problem comes, at this time if the protocol stack into a package, in the end is the notification or not notice it? According to the concept of understanding, will not be notified, because this is not a "change of state", but in fact, on Linux you try to find that will be notified, because as long as the packet into the socket queue, will trigger wakeup callback, will put the socket into Ready_ In the list, for ET, the socket has been removed from the ready_list before Epoll_wait returns. Therefore, if you find that the program is blocked in epoll_wait in the ET mode, it is not possible to conclude that the packet has not been received for one reason, or that the packet did not complete, but if a new packet is present at this time, Epoll_wait will return. Although this does not bring a buffer to the edge of the state of change.
Therefore, the change in the state of the buffer can not be simply understood as having and not so simple, but the arrival of the packet and does not come.
ET and LT is the concept of interruption, if you put the arrival of the packet, that is, insert into the socket receive queue this thing understood as an interrupt event, so-called edge trigger is not this concept?

The difference in implementation

In the logic of code implementation, the difference between the ET and LT implementations is that it will always be added to the ready_list once it has an event, until the next poll is removed, and then added to the ready_list after detecting an interest event. By poll routines to determine if there is an event, rather than relying entirely on wakeup callback, this is the true meaning of poll, which is constantly polling! That is, the LT mode is fully polling, each time will go to poll once, until the poll is not interested in the event, will rest, at this time only the arrival of the packet can be re-dependent wakeup callback to add it ready_list. In the implementation, the difference between the two can be seen from the following code.

Epoll_waitfor_each_ready_list_item as entry;    Do Remove_from_ready_list (entry);    event = Entry.poll (...);        if (event) then Put_user;        if (LT) then # The following poll conclusion is the result of add_entry_to_ready_list (entry); endif Endifdone



The difference in performance

The difference of performance is mainly embodied in the organization of data structure and algorithm, for Epoll, mainly is the list operation and Wakeup callback operation, for ET, is wakeup callback to add socket to ready_list, and for LT, In addition to wakeup callback can add the socket to Ready_list, epoll_wait can also be added to the poll ready_list,wakeup for the next callback, but there is less work, But this is not the fundamental difference in performance, the root of the performance difference is the chain table traversal, if there is a large amount of socket using LT mode, because each event will be added to the ready_list again, then even if the socket has no event, or will be used once poll to confirm, This extra time for no event socket meaningless traversal is not on ET. But note that traversing the list of performance consumption only when the list is too long to reflect, do you think the hundreds socket will reflect the disadvantage of LT? Admittedly, et does reduce the number of data-readable notifications, but this in fact does not give an overwhelming advantage.
Lt is really easier to use than ET, and it is not easy to deadlock, or it is recommended to use LT to program normally, rather than using ET to occasionally dazzle.

The difference in programming

Epoll et in blocking mode, can not recognize the queue empty event, so just block in a single socket recv instead of all the monitored socket epoll_wait call, although it does not affect the operation of the Code, as long as the socket has data arrival is good, But it can affect the programming logic, which means that the multiplexing is disarmed, causing a large number of sockets to starve, even if there is data to read. Of course, for LT, there are similar problems, but LT will be aggressively feedback data readable, so events will not easily be discarded because of your programming errors.
For LT, as it will continue to feed back, as long as there is data, you want to read when you can read, it will always have "next poll" the opportunity to actively detect whether there is data can continue to read, even if using blocking mode, as long as not across the blocking boundary caused by other sockets hungry, read how much data can be, But for ET, it notifies your application that the data is readable, and although the new data arrives or notifies you, you don't have control over what the new data will come and when, so you have to read all the data before you can leave, reading all the time means you have to be able to detect that the data is empty, so that is to say, You must use non-blocking mode until you return a eagin error.

Give tips in several et modes

1. The size of the queue buffer includes the length of the SKB structure itself, about 230
In 2.ET mode, the number of times the socket is added to ready_list in the wakeup callback >= the number of packets received, so
Multiple datagrams are fast enough to reach a successful callback that might trigger a epoll wakeup callback only once, at which point the socket is added to ready_list only once
= = causes the queue to be full
The next big message won't go in.
= Stopper effect
The tabloid text that can fill the remaining hole of the buffer can trigger the epoll_wait return of the ET pattern, and if the minimum length is 1, then a packet of 0 length can be sent to lure Epoll_wait back
= = But since the size of the SKB structure is inherent in size, the above inducement cannot guarantee success.
3.epoll Surprise Group, can refer to the experience of NGX
4.epoll can also refer to the Napi shutdown scheme, until the recv routine returns Eagin or an error occurs, Epoll Wakeup callback is no longer called, which means that as long as the buffer is not empty, even if a new packet is not notified.
A. As long as the socket Epoll Wakeup callback is called, the subsequent notification is forbidden;
The B.RECV routine starts a subsequent notification when the Eagin or error is returned.


Reception of network packets in the Linux kernel-Part II Select/poll/epoll

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.