Epoll: from implementation to application, epoll explains implementation
Applicable scenarios of multiplexing
• When the customer processes multiple Descriptors (for example, interactive input and network set interfaces), they must use I/O multiplexing.
• I/O multiplexing is also required for a TCP server to process both the listener interface and the connected interface.
• If a server processes TCP and UDP, I/O multiplexing is generally required.
• If a server needs to process multiple services or protocols, I/O reuse is generally required.
Select/poll/epoll differences
How Epoll works
Epoll_create
When the operating system starts, it registers an evnetpollfs file system. The corresponding file operations only implement poll and release operations, and then initialize some of them to create an slab cache. In order to allocate epitem and eppoll_entry, initialize the recursive check queue.
Create an eventpoll object with user information, whether it is root, the maximum number of listening fd, waiting queue, ready linked list, head node of the red/black tree, and other information, create a file object corresponding to fd, that is, epollfd and fd,
The eventpoll object is stored in the private pointer of the struct file structure. To obtain the eventpoll object from fd and return the object,
Epoll_ctl
Copy the epoll_event structure to the kernel space;
Determine whether the added fd supports the poll structure;
In addition, get the event_poll object from epfd-> file-> privatedata, and identify whether to add, delete, or modify according to op;
First, check whether there is a corresponding fd in the red/black tree of the eventpoll structure. If no fd is found, the insert operation is supported. Otherwise, a duplicate error is reported;
The corresponding modification is easier to delete.
It is locked during insertion.
During the insert operation, an epitem structure corresponding to fd is created and related members are initialized, such as the fd and file structures of the listener,
Finally, call the file operation-> poll function of fd (the poll_wait operation will be called at the end) to register the current process to the device's waiting queue: Pass the poll_table variable in it and call poll_wait, poll_table provides a function pointer, which actually calls the object pointed to by this function pointer. This function is to hold the current function in the device's waiting queue, specify the callback function callback when the device event is ready. The implementation of this callback is to put this epitem in the rdlist linked list.
Add the epitem structure to the red/black tree.
Epoll_wait
Calculate the sleep time (if any) and determine whether the linked list of the eventpoll object is empty. initialize a waiting queue, Mount yourself, and set the process status to sleep. determine whether there is a signal (If yes, it will be interrupted to wake up directly). If nothing happens, call schedule_timeout for sleep. If the sleep times out or is awakened, first, wait for the queue to be deleted, and then copy the resource to the user space.
To copy a resource, first transfer the ready event linked list to the intermediate linked list, and then traverse and copy it to the user space one by one,
And judge whether it is horizontally triggered. If yes, insert it to the ready linked list again.
The specific implementation involves many details: What if an event is ready during the copy rdlist process? Will epollfd be cyclically awakened by another epoll listener? lt: When will it be deleted from rdlist?
EPOll ET and LT
Kernel implementation:
The difference only exists when returning data from rdlist. The kernel first copies the rdlist to a temporary linked list txlist. If it is a LT event and the event is ready, fd is reset to rdllist. The next epoll_wait operation will copy the fd in rdllist to the user. For example. Assume that a socket is only connect and data has not been sent and received, then its poll event mask always has POLLOUT, and each call to epoll_wait always returns a POLLOUT event, because its fd is always put back into rdllist; if someone writes a lot of data to this socket at this time, the socket is blocked, fd will not be put back into rdllist, and epoll_wait will no longer return the user POLLOUT event. If we add EPOLLET to this socket and connect without sending and receiving data, epoll_wait will only return a POLLOUT notification to the user (because this fd will not return to rdllist again ), there will be no event notifications for epoll_wait in the future.
Note:The above LT fd copy back to rdlist does not occur after processing to the user, but directly copies the copy to the user and then to rdlist. What if the user consumes this event to make the event not ready, for example, if it is readable and returned to the user, when the user continues to call epoll_wait to return rdlist until the user reads the unreadable message, the user will find that it is unreadable. In fact, before each return, the user will continue to call poll with NULL, to determine whether the event changes, call poll to pass a poll_table variable and add it to the waiting queue. If the status in rdlist changes, it will not be returned to the user.
Trigger method:
Based on the analysis of the two ways of adding rdlist, we can conclude that the conditions for wake-up (return ready) in ET mode are as follows:
For read operations:
(1) When the buffer status changes from unreadable to readable, that is, the buffer status changes from empty to non-empty.
(2) When new data arrives, that is, the amount of content to be read in the buffer increases.
(3) When the buffer contains readable data (that is, the buffer is not empty) and the user performs the epoll_mod IN event on the corresponding fd
For write operations:
(1) When the buffer changes from non-writable to writable, that is, the buffer changes from full to below.
(2) When old data is sent, that is, the amount of content to be written in the buffer decreases.
(3) When there is writable space in the buffer (that is, the buffer is not enough) and the user performs the epoll_mod OUT event on the corresponding fd
The LT mode is much simpler, except that the above operation is always notified when an event is ready for reading.
The reason why ET is more efficient than LT:
After the above analysis, we can see that LT needs to process rdlist every time, without a doubt, the amount of data copied to the user increases, and the epoll_wait cycle also increases, naturally reducing the performance.
On the other hand, from the user's perspective, using the ET mode, it can easily handle EPOLLOUT events, eliminating the need to open and close EPOLLOUT epoll_ctl (EPOLL_CTL_MOD) calls. This may improve your performance. For example, if you need to write 1 MB of data and 25 6 kb of socket, The EAGAIN is returned. In ET mode, when epoll returns the EPOLLOUT event again, the data to be written will continue, when no data needs to be written, skip this step without processing. In the LT mode, you need to enable EPOLLOUT first. When no data needs to be written, disable EPOLLOUT (otherwise, the EPOLLOUT event will always be returned). Calling epoll_ctl is a system call, to fall into the kernel and operate and lock the red and black trees, ET processes EPOLLOUT more conveniently and efficiently. It is not easy to miss events or generate bugs. If the server's response is usually small, it will not trigger EPOLLOUT, so it is suitable for using LT, such as redis. In this case, you do not even need to pay attention to EPOLLOUT. if the traffic is small enough, it will be sent directly, you can simply cancel the follow after sending the email. You can perform some optimization. As a high-performance general-purpose server, nginx can run up to 1 GB of network traffic. In this case, it is easy to trigger EPOLLOUT and ET is used.
See zhihu https://www.zhihu.com/question/20502870/answer/89738959.
Practical application:
When epoll is working in ET mode, for read operations, if the read operation does not read the data in the buffer at one time, the read will not be notified next time, this causes the existing data in the buffer to be read, unless new data arrives again. Write operations are mainly caused by a non-blocking problem of fd in the ET mode-how to ensure that the data to be written by the user is written.
To solve the Read and Write problems in the above two ET modes, we must implement:
A. For read, as long as there is data in the buffer, it will continue to read;
B. For writing, as long as there is still space in the buffer and the data written by the user request has not been written, it has been written.
To use this method, make sure that each connected socket works in non-blocking mode, because read/write needs to be read or written until an error occurs (for read, when the actual number of bytes read is smaller than the number of bytes requested, it can be stopped.) If your file descriptor is not non-blocking, then the read or write will be blocked for the last time. In this way, it cannot be blocked on epoll_wait, causing other file descriptor tasks to starve to death.
Therefore, it is often said that "ET needs to work in non-blocking mode". Of course, this does not mean that ET cannot work in blocking mode, but may encounter some problems when working in blocking mode.
Accept in ET Mode
Consider this situation: when multiple connections arrive at the same time, the TCP ready queue of the server instantly accumulates multiple ready
In the edge trigger mode, epoll only notifies you once and accept processes only one connection. As a result, the remaining connections in the TCP ready queue cannot be processed.
The solution is to use the while loop to hold the accept call and exit the loop after processing all the connections in the TCP ready queue. How do I know if all connections in the ready queue are processed? If accept returns-1 and errno is set to EAGAIN, all connections are processed.
Is as follows:
While (conn_sock = accept (listenfd, (struct sockaddr *) & remote, (size_t *) & addrlen)> 0 ){
Handle_client (conn_sock );
}
If (conn_sock =-1 ){
If (errno! = EAGAIN & errno! = ECONNABORTED
& Errno! = EPROTO & errno! = EINTR)
Perror ("accept ");
}
Extension: when the server uses multi-channel transfer technology (select, poll, epoll, etc.), accept should work in non-blocking mode.
Cause: If accept works in blocking mode, consider this situation: the TCP connection is aborted by the client, that is, before the server calls accept (select and so on have returned the connection to read ready ), the client actively sends the RST to terminate the connection, causing the newly established connection to be removed from the ready queue. If the set of interfaces is set to blocking mode, the server will always block the accept call, until another customer establishes a new connection. However, during this period, the server simply blocks the accept call (it should actually block the select) and other descriptors in the ready queue cannot be processed.
The solution is to set the listener set interface to non-blocking, and stop the service before the server calls accept.
During a connection, the accept call can return-1 immediately. At this time, the implementation originating from Berkeley will process the event in the kernel and will not notify epoll of the event, when other implementations set errno to ECONNABORTED or EPROTO errors, we should ignore these two errors. (For details, see UNP v1 p363)
EPOLlONSHOT
In some scenarios where listening events and reading are separated, for example, when listening in the main thread and receiving and processing data in the Child thread, two threads operate on one socket at the same time, for example, when the main thread monitors that the event is handled by thread 1, but the event has not been processed yet, the main thread is handled by thread 2, which leads to data inconsistency, generally, you need to register an EPOLLONESHOT event on the file descriptor. The operating system can trigger at most one readable or writable event registered on it, and trigger only once, unless we use the epoll_ctl function to reset the EPOLLONESHOT event. Similarly, the thread that registers the event must re-register after processing the data. Otherwise, the event will not be triggered again next time. See section 9.3.4 linux high-performance Server programming
However, there is a defect. In this way, epoll_ctrl will be called every time into the kernel, and epoll will use the redblack tree to ensure thread security, which will seriously affect the performance, in this case, we need to change the idea. At the application layer, we need to maintain an atomic integer to record whether the current handle is being processed by a thread. Every time an event arrives, we will check this atomic integer, if the processing is performed, no thread processing will be allocated; otherwise, the thread will be allocated, so as to avoid falling into the kernel. Just use epoll_data to store this atomic integer.
You can use ET or LT for EPOLLSHOT to prevent data inconsistency, because it prevents re-triggering, but the atomic integer method can only use ET mode, it does not prevent re-triggering, but prevents processing by multiple threads. In some cases, the computing speed may not keep up with the I/O speed, that is, the buffer content cannot be received in a timely manner, at this time, the receiving thread and the main thread are separated. If LT is used, the main thread will always trigger the event, resulting in busy-loop. The use of ET trigger will only be triggered when the event arrives. The buffer zone does not trigger content, and the number of triggers decreases. Although the main thread may still be idling (fd has an event coming, but it has been processed by the thread and does not need to be processed at this time. It is better to continue epoll_wait), but this idling is less likely than repeated epoll_ctl calls.
EPOLL misunderstanding
1.
Epoll ET
Mode only supports non-blocking handle?
In fact, it also supports blocking handles, but it is generally only suitable for non-blocking use based on application scenarios. For more information, see "epoll et and LT actual applications"
2.EpollShared Memory?
Epoll is more efficient than select because shared memory is used when the ready file descriptor is copied from the kernel? This is not correct. During implementation, only copy_from_user and _ put_user are used to perform kernel-to-user data interaction with the user's virtual space. There is no memory-sharing api.
Problem highlights
Epoll
Need to op-> poll again
Cause
This is because all processes will be awakened after an event is in the queue. Some processes may have deleted the event after consuming the event, the subsequent processes may no longer consume events after waking up, so you need to judge poll again. If the event is still in progress, add it to rdlist. Of course, after the event is consumed, it may not be deleted. You can use the flag option to set the consumption method in the queue.
EpollTxlist<Events are directly returned to rdlist without user consumption,After the user consumes the event, the event is not ready and epoll_wait is called again., Epoll_waitRdlist is returned.?
Revents = epi-> ffd will be called again before the ready list is returned. file-> f_op-> poll (epi-> ffd. file, NULL) to determine the event. If the event changes, it will not be returned.
Kernel waiting queue:
In order to support blocked access to devices, the kernel needs to design a waiting queue, which contains processes. When a device event is ready, it will wake up the processes in the waiting queue to consume the event. However, when using select to listen to non-blocking handles, this queue is not used to implement non-blocking, but to implement the state wait, that is, wait for a readable and writable event to notify the listening process
Kernel pollThe technology is for poll/selectDesigned?
The driver of each device needs to provide a series of functions to support the use of virtual file systems in the operating system, such as read and write. poll is one of the functions. For select and poll implementation, it is used to query whether a device is readable, writable, or in a special State.
EventPollTwo queues
There are two waiting queues in Evnetpoll,
Wait_queue_head_t wq;
Wait_queue_head_t poll_wait;
When the former is used to call epoll_wait (), we are "sleeping" in this waiting queue...
The latter is used when epollfd capability is poll... that is, epollfd is monitored by other epoll and its file-> poll () is called.
Waeup will be performed to the wq message queue when the handle monitored by epoll has a message, and waeup will be performed for poll_wait.
EventPollfsImplemented file opetion
Only poll and realse are implemented. Since epoll itself is also a file system, its descriptor can also be monitored by poll/select/epoll. Therefore, the poll method must be implemented, specifically the ep_eventpoll_poll method, the internal implementation is to insert the thread that listens to the current epollfd into its own poll_wait queue and determine whether an event occurs in the handle it receives, if any, you need to return the information to the epoll_wait listening for epollfd. The specific method is to scan the list of ready files, call the poll on each file to check whether it is actually ready, and then copy it to the user space, however, there may be epoll files in the file list, which may lead to recursion when calling poll. Therefore, wrap them with ep_call_nested to prevent endless loops and excessive calls. For details, see recursive deep detection (ep_call_nested)
EpollThread security issues
When a thread is blocked on epoll_wait (), it is okay for other threads to add a new file descriptor to it. If this file descriptor is ready, the epoll_wait () of the blocking thread () will be awakened. However, if a file descriptor being monitored is disabled by another thread, the details are undefined. In some UNIX systems, the select statement unblocks the returned results, and the file descriptor is considered ready, however, IO operations on the file descriptor will fail (unless the file descriptor is assigned again). in Linux, the closure of the file descriptor by another thread has no effect. However, you should try to close the file descriptor that another thread is listening to on the wall.
Recursive deep detection (ep_call_nested)
Epoll itself is a file and can also be monitored by poll/select/epoll. If epoll is mutually monitored, it may lead to an endless loop. In epoll implementation, all functions that may generate recursive calls are wrapped by the ep_call_nested function. In the recursive call process, an endless loop or too deep recursion will break the endless loop and direct return of recursive calls. The implementation of this function depends on an external global linked list nested_call_node (different nodes are used for different function calls). recursive functions (nproc) may occur during each call) add a node to the linked list that contains the context ctx (process, CPU, or epoll file) of the current function call and the processed object to identify the cookie, by checking whether there are identical nodes, you can know whether an infinite loop has occurred. By checking the number of nodes in the same context in the linked list, you can know the recursive depth. See reference 2.
Why do we need to create a file system:
First, some information can be maintained in the kernel, which is maintained between epoll_wait for multiple times (the eventpoll structure is saved). Second, epoll itself can also be poll/epoll.
Two callback Functions
Epoll interacts with the waiting queue by calling the poll function of the corresponding device, calling the ep_ptable_queue_proc function in the poll function, and inserting the current process into the waiting queue, specify ep_poll_callback as the callback function for wake-up. Ep_poll_callback implements copying the current handle to rdlist and wakeup, and wq of eventpoll waits for the queue.
References:
Http://blog.csdn.net/shuxiaogd/article/details/50366039
Describes kernel blocking, non-blocking, and poll mechanisms, and analyzes the implementation method of select.
Http://paperman825.m.blog.chinaunix.net/uid-28541347-id-4236779.html
Explains the implementation of poll and paves the way for the next blog
Http://paperman825.m.blog.chinaunix.net/uid-28541347-id-4238524.html? Page = 2
Epoll implementation is my getting started blog
Http://watter1985.iteye.com/blog/1614039
The poll mechanism and select/poll/epoll implementation are very comprehensive and systematic.
Http://21cnbao.blog.51cto.com/109393/120099
The poll mechanism is not highly relevant, but it is helpful to understand the waiting queue of the kernel.
Https://www.nowcoder.com/discuss/26226
Niuke network comments
Http://paperman825.m.blog.chinaunix.net/uid/28541347/sid-193117-list-2.html
Apply the series of Feng Shen! Haha