With the full support of the 2.6 kernel to epoll, many articles and sample code on the network provide a message that using Epoll instead of the traditional poll can bring a performance boost to Web service applications. But most of the reasons for performance improvements are less explained here, I'll try to analyze how poll and epoll work in the kernel (2.6.21.1) code, and then compare the results with some test data.
POLL: First say Poll,poll or select for most of the unix/linux programmers familiar, these two things similar principle, the performance is not obvious differences, but select for the number of file descriptors monitored, so here is the use of poll to explain.Poll is a system call, its kernel entry function for sys_poll,sys_poll almost nothing to do any processing directly calls Do_sys_poll,do_sys_poll execution process can be divided into three parts:
1, copy the user's incoming POLLFD array to the kernel space, because the copy operation is related to the length of the array, this is an O (n) operation in the time, and the code for this step includes the part from the beginning of the function to the do_poll before the call to the Do_sys_poll.
2. Query the status of each file descriptor corresponding to the device, and if the device is not ready, add an entry in the waiting queue for the device and continue querying the status of the next device. If no device is ready after querying all the devices, then the current process needs to be suspended until the device is ready or timed out, and the suspend operation is performed by calling Schedule_timeout. After the device is ready, the process is notified to continue running, and then traverses all the devices again to find the ready device. This step is due to two traversal of all devices, and the time complexity is O (n), where the wait time is not included. The relevant code is in the Do_poll function.
3, the acquisition of data to the user space and perform the release of memory and stripping wait queue, such as the aftermath of the time to copy data to the user space and the stripping wait queue, such as the complexity of the operation is also O (n), the specific code includes the Do_sys_poll function in the call Do_poll to the end of the part.
Epoll:
Next analyze Epoll, unlike Poll/select, Epoll is no longer a separate system call, but consists of epoll_create/epoll_ctl/epoll_wait three system calls, which you will see the benefits of doing.
First Look at sys_epoll_create (epoll_create corresponding kernel function), this function is to do some preparatory work, such as the creation of data structure, initialization and finally return a file descriptor (representing the newly created virtual Epoll file), This operation can be considered a fixed-time operation.
Epoll is implemented as a virtual file system, which has at least the following two benefits:
1, you can maintain some information in the kernel, which is maintained between multiple epoll_wait, such as all monitored file descriptors.
2, Epoll itself can also be poll/epoll;
Specific Epoll virtual file system implementation and performance analysis is irrelevant, no longer repeat.
In the sys_epoll_create can also see a detail, that is epoll_create parameter size at this stage is meaningless, as long as more than 0 on the line.
Next is Sys_epoll_ctl (epoll_ctl corresponding kernel function), need to be clear that each call Sys_epoll_ctl only one file descriptor, here mainly describes when the OP is epoll_ctl_add the execution process, Sys_epoll_ The CTL does some security checks into the Ep_insert,ep_insert after entering the ep_poll_callback as a fallback function to join the waiting queue of the device (assuming that the device is not yet ready), because each time poll_ctl only manipulate one file descriptor, So it can also be thought of as an O (1) operation
The Ep_poll_callback function is critical, and it is returned by the system after the waiting device is ready to perform two operations:
1. Join the ready device to the ready queue, this step avoids having to poll all devices again after the device is ready, like poll, reducing the complexity of time, from O (n) to O (1);2, Wake up the virtual Epoll file;
Finally, the sys_epoll_wait is the Ep_poll function that is actually performing the operation. The function waits for the process itself to be inserted into the wait queue of the virtual Epoll file until it wakes up (see the Ep_poll_callback function described above), and the final execution ep_events_transfer copies the results to the user space. Because only the ready device information is copied, the copy here is an O (1) operation.
There is a concern that the problem is epoll to epollet processing, that is, the edge trigger processing, a rough look at the code is a part of the horizontal trigger mode of the kernel do the work to the user to deal with, intuitively will not have too much impact on performance, interested friends welcome the discussion.
Poll/epoll Comparison:
The process of poll on the surface can be seen as a system call consisting of a epoll_create/of epoll_ctl/once epoll_wait/a time, In fact, the reason why Epoll divides poll into parts is because of the features of poll used in server software (such as Web servers):
1, need to poll a large number of file descriptors at the same time;
2, the file descriptor ready after each poll completes only a small fraction of all poll descriptors.
3, the change of the file descriptor Array (UFDs) is only very small after the poll call.
The traditional poll function is equivalent to each call is reinvent, from the user space full read into UFDs, complete and then completely copied to the user space, and each poll need to do at least once all the devices to join and delete wait queue operation, these are inefficient reasons.
Epoll all the above considerations, do not need to fully read the output UFDs every time, just use epoll_ctl to adjust a small part of it, do not need to do every epoll_wait each time to join the delete wait queue operation, In addition, the improved mechanism makes it more efficient to search for an entire array of devices after a device is ready. The most obvious point, from the user's use, is that using epoll does not have to poll all the returned results each time to find the ready part, O (n) to O (1), and improve performance.
It is also found here that it is not possible to change the epoll_ctl to one at a time to deal with multiple FD (like semctl) to improve some performance? Especially if the system call is time-consuming. However, the time-consuming issue of system calls will be analyzed later.
 
poll/epoll test data comparison :
Test environment: I wrote three pieces of code to simulate the server, the active client, the zombie client, the server runs on a self-compiled standard 2.6.11 kernel system, the hardware is PIII933, two clients each run on another PC, these two PCs are better than the server hardware performance, The main guarantee is that the server can be loaded easily, with a 100M switch connection between the three machines. The
server accepts and poll all connections, responds to a response if a request arrives, and then continues poll.
Active clients (active client) emulate several concurrent active connections, which send requests to accept replies without interruption. The
Zombie Client (zombie) simulates some clients that connect only but do not send the request, and are only intended to occupy the server's poll descriptor resource.
Test Process: Maintain 10 concurrent active connections, continuously adjust the number of stiff concurrent connections, record the performance difference between using poll and epoll at different scales. The number of zombie concurrent connections depends on the ratio: 0,10,20,40,80,160,320,640,1280,2560,5120,10240. The
Middle horizontal axis represents the ratio of zombie concurrent connections to active concurrent connections, and the vertical axis represents the amount of time, in seconds, that is spent completing 40,000 request replies. Red lines represent poll data, and green indicates epoll data. As you can see, poll increases linearly as the number of file descriptors is monitored, while Epoll maintains a steady state that is almost unaffected by the number of descriptors.
When all clients that are monitoring are active, the efficiency of poll will be slightly higher than epoll (mainly near the origin, that is, zombie concurrent connection is 0 o'clock, the figure is not easy to see), exactly epoll implementation than poll complex , monitoring a small number of descriptors is not its forte. Test code and specific data can be obtained from here, welcome to discuss. The Select () system call provides a mechanism for synchronizing multivariate I/O:
#include<sys/time.h> #include<sys/types.h> #include<unistd.h> IntSelect(int n, Fd_set *readfds, fd_set< Span class= "Apple-converted-space" > *writefds, Fd_set *exceptfds, Struct timeval *timeout); FD_CLR (Int fd, fd_set *set); Fd_isset (Int fd, fd_set *set); Fd_set (Int fd, fd_set *set); Fd_zero (Fd_set *set); |
Calling select () will block until the specified file descriptor is ready to perform I/O, or the time specified by the optional parameter timeout is past. The monitored file descriptors are grouped into three types of set, each of which waits for different events. The file descriptors listed in Readfds are monitored for data to be read (not blocked if the read operation is complete). The file descriptors listed in Writefds are monitored for write operations to complete without blocking. Finally, the file descriptors listed in Exceptfds are monitored for exceptions, or if data that cannot be controlled is available (these states apply only to sockets). These three types of set can be null, in which case select () does not monitor this type of event. When select () returns successfully, each set of sets is modified so that it contains only the file descriptors that are ready for I/O. For example, suppose you have two file descriptors with values of 7 and 9, respectively, placed in Readfds. When select () returns, if 7 is still in set, the file descriptor is ready to be read without blocking. If 9 is not already in set, then reading it will probably block (I say that the data might be available just after the select is returned, in which case the next call to select () will return the file descriptor ready to read). The first parameter, n, equals the value of the largest file descriptor in all set plus 1. Therefore, the caller of select () is responsible for checking which file descriptor has the maximum value, and then passing the value 1 to the first argument. The timeout parameter is a pointer to the Timeval struct, and timeval is defined as follows:
#include <sys/time.h> struct timeval {long tv_sec; /* seconds */Long tv_usec; /* 10e-6 second */}; |
If this parameter is not NULL, I/o,select () will return after tv_sec seconds and tv_usec microseconds even if no file descriptor is prepared. When select () returns, the state of the timeout parameter is undefined in different systems, so timeout and file descriptor set must be reinitialized before each call to select (). In fact, the current version of Linux automatically modifies the timeout parameter, setting its value to the time remaining. Therefore, if timeout is set to 5 seconds and then 3 seconds before the file descriptor is ready, this time the call to select () returns, Tv_sec becomes 2. If two values in timeout are set to 0, then calling select () returns immediately, reporting all pending events when called, but not waiting for any subsequent events. The file descriptor set does not operate directly, and is typically managed using several helper macros. This allows the UNIX system to implement the file descriptor set in a way that it prefers. However, most systems simply implement set as bit arrays. Fd_zero removes all file descriptors from the specified set. It should be called before each call to select (). Fd_set Writefds; Fd_zero (&writefds);
Fd_set adds a file descriptor to the specified set, and FD_CLR removes a file descriptor from the specified set: Fd_set (FD, &writefds); /* add ' fd ' to the set */FD_CLR (FD, &writefds); /* Oops, remove ' fd ' from the set */
Well-designed code should never use FD_CLR, and it's really rarely used in real life. Fd_isset tests whether a file descriptor specifies a part of a set. If the file descriptor returns a non-0 integer in set, then 0 is returned. Fd_isset is used after calling select () to test whether the specified file descriptor is ready for the relevant action: if (Fd_isset (FD, &readfds)/*/' FD ' is readable without blocking! */
Because file descriptor sets are statically created, they impose a limit on the maximum number of file descriptors that can be placed in the set and the value of the maximum file descriptor specified by Fd_setsize. In Linux, this value is 1024. We will also see the derivative of this restriction later in this chapter. The return value and error code select () returns the number of file descriptors for ready I/O when successful, including all three set. If a timeout is provided, the return value may be 0, the error returns 1, and the setting errno to one of the following values: EBADF provides an invalid file descriptor for a set. Eintr the signal is captured while waiting, you can re-initiate the call. The EINVAL parameter n is a negative number, or the specified timeout is illegal. ENOMEM Insufficient memory available to complete the request. --------------------------------------------------------------------------------------------------------------poll () system call is a multiple I/O solution for Systems v. It solves several shortcomings of select (), although select () is still used frequently (mostly out of habit, or in portable terms):
#include <sys/poll.h> int poll (struct pollfd *fds, unsigned int Nfds, int timeout); |
Unlike select (), poll () does not use the inefficient three bit-based file descriptor set, but instead employs a separate struct POLLFD array, which is pointed to by the FDS pointer. The POLLFD structure is defined as follows:
#include &NBSP;<SYS/POLL.H> Struct pollfd {int fd; /* File descriptor */Short events; /* requested events to watch */Short revents; /* returned events witnessed */}; |
Each pollfd struct specifies a monitored file descriptor that can pass multiple structures, indicating that poll () monitors multiple file descriptors. The events field for each struct is the event mask that monitors the file descriptor, which is set by the user. The Revents field is the action result event mask for the file descriptor. The kernel sets this field when the call returns. Any events requested in the events domain may be returned in the revents domain. The legitimate events are as follows: Pollin have data to read. Pollrdnorm have normal data to read. Pollrdband has priority data to read. POLLPRI has urgent data to read. Pollout write data does not cause blocking. Pollwrnorm writing normal data does not cause blocking. Pollwrband Write priority data does not cause blocking. POLLMSG sigpoll messages are available.
Additionally, the following event may be returned in the revents domain: Poller An error occurred in the specified file descriptor. Pollhup the specified file descriptor pending event. Pollnval The specified file descriptor is illegal.
These events do not make sense in the events field, because they are always returned from revents at the appropriate time. Using poll () and select () are different, you do not need to explicitly request an exception condition report. Pollin | Pollpri equivalent to the Read event of select (), Pollout | Pollwrband is equivalent to the Write event of select (). Pollin equivalent to Pollrdnorm | Pollrdband, while pollout is equivalent to Pollwrnorm. For example, to monitor whether a file descriptor is readable and writable at the same time, we can set events to Pollin | Pollout. When poll returns, we can examine the flags in revents, which correspond to the events structure of the file descriptor request. If the Pollin event is set, the file descriptor can be read without blocking. If Pollout is set, the file descriptor can be written without causing blocking. These flags are not mutually exclusive: they may be set at the same time, indicating that the read and write operations of the file descriptor return normally without blocking. The timeout parameter specifies the number of milliseconds to wait, and poll returns regardless of whether I/O is ready. Timeout is specified as a negative value to indicate an infinite timeout, and a timeout of 0 indicates that the poll call immediately returns and lists the file descriptor ready for I/O, but does not wait for other events. In this case, poll (), like its name, returns as soon as it is elected. When the return value and error code succeed, poll () returns the number of file descriptors in the struct that are not 0 revents, and if no events occur before the timeout, poll () returns 0; when failed, poll () returns-1, and sets errno to one of the following values: EBADF Invalid file descriptor specified in one or more structs. The Efault FDS pointer points to an address that exceeds the address space of the process. Eintr the requested event before generating a signal that the call can be re-initiated. The EINVAL Nfds parameter exceeds the Plimit_nofile value. Enomem There is not enough memory available to complete the request. --------------------------------------------------------------------------------------------------------------the above content from "Oreilly.linux.system.programming-talking.directly.to.the.kernel.and.c.library.2007"-Advantages of-------------------------------------------------------------------------------------------------------------Epoll: 1. Support for a process to open a large number of socket descriptors (FD) Select the most intolerable is that the FD opened by a process has a certain limit, set by Fd_setsize, and the default value is 2048. It is obviously too small for the number of connected IM servers that need to be supported. At this time you can choose to modify the macro and then recompile the kernel, but the data also pointed out that this will bring down the network efficiency, the second is the choice of multi-process solution (traditional Apache scheme), but although the cost of the creation process of Linux is relatively small, but still can not be ignored, Coupled with inter-process data synchronization is far less efficient than synchronization between threads, so it is not a perfect solution. However, Epoll does not have this restriction, it supports the FD limit is the maximum number of open files, this number is generally far greater than 2048, for example, in 1GB memory of the machine about 100,000, the specific number can be cat/proc/sys/fs/file-max to see, In general, this number is very much related to system memory. 2.IO efficiency does not increase linearly with the number of FD traditional Select/poll Another Achilles heel is when you have a large socket set, but because of network latency, only part of the socket is "active" at any one time, But select/poll each invocation linearly scans the entire set, resulting in a linear decrease in efficiency. However, Epoll does not have this problem, it only operates on "active" sockets---This is because Epoll is implemented in the kernel implementation based on the callback function above each FD. Then, only the "active" socket will be active to call the callback function, the other idle state socket will not, at this point, Epoll implemented a "pseudo" AIO, because this time the driving force in the OS kernel. In some benchmark, if all sockets are basically active---such as a high-speed LAN environment, Epoll is no more efficient than select/poll, and conversely, if you use epoll_ctl too much, there is a slight decrease in efficiency. But once you use the idle connections to simulate a WAN environment, epoll is far more efficient than select/poll. 3.Use Mmap to speed up the kernel and user space message delivery. this actually involves the concrete implementation of the Epoll. Both Select,poll and epoll need the kernel to inform the user of the FD message, how to avoid unnecessary memory copy is very important, at this point, epoll through the kernel in the user space mmap the same piece of memory implementation. And if you want me to follow epoll from the 2.5 kernel, you will never forget to mmap this step manually. 4. Kernel Fine tuning This is not really the advantage of epoll, but the advantages of the entire Linux platform. Maybe you can doubt the Linux platform, but you can't avoid the Linux platform that gives you the ability to fine-tune the kernel. For example, the kernel TCP/IP stack uses a memory pool to manage the sk_buff structure, so the size of this memory pool (Skb_head_pool) can be dynamically adjusted at runtime---via Echo xxxx>/proc/sys/net/core/hot_ List_length completed. Another example is the 2nd parameter of the Listen function (TCP completes the packet queue length of 3 handshake), or it can be adjusted dynamically depending on the size of your platform memory. Even try the latest NAPI NIC driver architecture on a special system that has a large number of packet polygons but also has a small size for each packet itself.
Principle of Epoll/poll/select