Scalable Event multiplexing Technology: Epoll and Kqueue

Source: Internet
Author: User
Tags epoll

Usually I like Linux more than BSD system, but I really want to have BSD Kqueue function on Linux.

What is event multiplexing technology

Suppose you have a simple Web server, and there are two socket connections open there. When the server receives an HTTP request from both connections, it should return an HTTP response to the client. But you can't know the message that the client sent first and when it was sent. The blocking behavior of the BSD socket interface means that if you call the recv () function on a connection, you will not be able to respond to requests on another connection. Then you need I/O multiplexing technology. One direct way of I/O multiplexing is to have a process/thread for each connection so that the blocking behavior on the connection does not interfere with each other. This way, you give all the tedious dispatch/multiplexing problems to the operating system kernel. This multithreaded architecture is accompanied by high resource consumption. Maintaining a large number of threads is not necessary for the kernel. Separate stacks on each connection not only increase memory footprints, but also reduce the CPU's local cache capacity. So how do we implement I/O multiplexing without using thread-to-connection mode? You can do this by simply waiting for polling, which is a non-blocking socket operation on each connection, but this behavior is too wasteful. All we need to know is which socket is ready. So the system kernel provides a separate channel between the application and the kernel, which notifies you when your socket becomes ready. This is the Select ()/poll () mode of operation based on readiness mode.

Overview: Select ()

Select () and poll () work very much the same way. Let's take a quick look at the Select () function

select(int nfds, fd_set *r, fd_set *w, fd_set *e, struct timeval *timeout)

Calling the Select () function, your application needs to provide three interest sets: R,w and E. Each collection is a bitmap of a file descriptor. For example, if you are interested in reading data from the file descriptor 6, then the 6th byte bit in the R collection is set to 1. This call is blocked until more file descriptors are ready in the interest set, so you can manipulate the file descriptors without being blocked. Upon return, the system kernel will overwrite the entire bitmap to indicate which file descriptors are ready. From an extensibility point of view, we can find 4 questions:

    1. The size of these bitmaps is fixed (fd_setsize, usually 1024), although there are some ways to circumvent this limitation.
    2. Because the bitmap is written by the kernel, the user application needs to refill the interest set each time it is called.
    3. Each time the call is invoked, both the user application and the kernel need to scan the entire bitmap to indicate which file descriptors belong to the interest set and which belong to the result set. This is especially inefficient for result sets because they look very sparse (for example, in a given time, only very few file descriptors change).
    4. The kernel must iterate over the entire set of interests for each invocation in order to find which file descriptors are ready. If no one is ready, the kernel iterates over the set of an internal event for each socket link.
Overview: Poll ()

The design intent of poll () is to solve these problems.

poll(struct pollfd *fds, int nfds, int timeout)struct pollfd {    int fd;    short events;    short revents;}

The implementation of the poll () is not dependent on bitmaps, but instead uses a file descriptor array (so the first problem is resolved). The second problem is solved by taking a separate field between interest events and result events, because the user program can maintain and reuse the array. If the poll function is capable of splitting the array instead of the field, then the third problem will be solved by the edge. The fourth question is inherited and unavoidable, because poll () and select () are stateless, and the kernel does not maintain the state of the interest set internally.

Why is it related to extensibility?

If your Web server needs to maintain a relatively small number of connections (such as 100) and the connection rate is low (e.g. 100 per second), then poll () and select () are sufficient. You may not need to worry about event-driven programming at all, as long as the multi-process/multithreaded architecture is available. If performance is not your focus, then flexibility and ease of development are key. The Apache Web server is a typical example.

However, if your server program is network resource-sensitive (such as 1000 concurrent connections or a higher connection rate), then you really care about performance issues. This situation is often referred to as c10k problems. Your Web server will have a hard time doing anything useful, except wasting precious CPU cycles under such a high load.

Suppose there are 10000 concurrent connections. In general, only a small number of file descriptors are used, such as 10 are already read ready. So every time poll ()/select () is called, there are 9,990 file descriptors that are copied and scanned in a meaningless way.

As mentioned earlier, this problem is due to the stateless generation of the Select ()/poll () interface. Banga et al's paper (released in Usenix ATC 1999) provides a new recommendation: state-related interest sets. By maintaining the state of the interest set within the kernel, instead of providing the entire set of interest for each invocation. On top of the Decalre_interest () call, the kernel continuously updates the interest set. The user program distributes events by calling the Get_next_event () function.

Inspiration usually comes from research results, and Linux and free BSD have their own implementations, namely Epoll and Kqueue. This, in turn, means the lack of portability, a epoll-based program that cannot run on the free BSD system. One argument is that Kqueue is technically better than epoll, so there seems to be no reason for epoll to exist.

Epoll in Linux

The Epoll interface consists of 3 calls:

int epoll_create(int size);int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Epoll_ctl () and Epoll_ctl () are essentially respectively corresponding to the Declare_interest () and get_next_event () functions. Epoll_create () Creates a context similar to a file descriptor, which in fact implies the context of the process. From an internal mechanism, the implementation of Epoll in the Linux kernel is not very different from the implementation of select ()/poll (). The only difference is whether the state is relevant. Because essentially they are designed to be the same (based on socket/pipeline event multiplexing Technology). See the source code files for the Linux branch tree fs/select.c (corresponding to select and poll) and FS/EVENTPOLL.C (corresponding to Epoll) for more information. You can also find Linus Torvalds some early ideas about Epoll.

Kqueue in the free BSD

As with Epoll, Kqueue also supports multiple contexts (interest sets) in each process. The Kqueue () function behaves somewhat like epoll_create (). However, Kevent () integrates the roles of Epoll_ctl () (for adjusting interest sets) and epoll_wait () (Getting events).

int kqueue(void);int kevent(int kq, const struct kevent *changelist, int nchanges,            struct kevent *eventlist, int nevents, const struct timespec *timeout);

In fact, Kqueue is a bit more complicated than epoll from an easy-to-program point of view. This is because kqueue designs are more abstract and more broadly designed. Let's take a look at the kevent structure:

struct kevent {     uintptr_t       ident;          /* 事件标识 */     int16_t         filter;         /* 事件过滤器 */     uint16_t        flags;          /* 通用标记 */     uint32_t        fflags;         /* 特定过滤器标记 */     intptr_t        data;           /* 特定过滤器数据 */     void            *udata;         /* 不透明的用户数据标识 */ };

The details of these fields are beyond the scope of this article, but you may have noticed that there are no explicit file descriptor fields. This is because the purpose of the Kqueue design is not to replace select ()/poll (), which is based on socket event multiplexing technology, but rather to provide a generalized mechanism to handle multiple operating system events.

The filter field indicates the kernel event type. If it is evfilt_read or evfilt_write,kqueue, it is the same as Epoll. In this case, the Ident field behaves as a file descriptor. The Ident field may also appear as an identifier for other types of events, such as the process number and the number of signals, depending on the filter type. More details can be found in the man manual and in this document.

Comparative performance of Epoll and Kqueue

From a performance standpoint, there is a design flaw in epoll; it cannot update the interest set multiple times in a single system call. You have to call the Epoll_ctl () function 100 times when you have 100 file descriptors in your interest set that need to be updated. Performance degradation is evident in the transition system calls, as explained in this article. I guess this is the legacy of Banga et al's original work, just as declare_interest () supports only one call to update at a time. In contrast, you can specify multiple interest set updates in a single kevent call.

Non-file type support

Another problem, which I have seen more important, is also a limitation of epoll. It is designed to improve the performance of select ()/poll (), Epoll can only work based on file descriptors. What's wrong with that? A common saying is that "in Unix, everything is a file." Most of the situation is true, but not always. For example, the clock is not, the signal is not, the semaphore is not, including the process is not. (in Linux) the network device is not a file either. There are many things in Unix-like systems that are not files. You cannot use the event multiplexing technique of Select ()/poll ()/epoll () for these things. A typical network server manages many types of resources, except sockets. You might want to manage them through a single interface, but you can't. To avoid this problem, Linux provides a number of complementary system calls, such as SIGNALFD (), EVENTFD (), and timerfd_create () to convert non-file types to file descriptors so you can use Epoll. But it doesn't look so elegant ... Do you really want to use a separate system call to handle each type of resource? In Kqueue, the versatile kevent structure supports a variety of non-file events. For example, your program can get a child process Exit event notification (by setting filter = Evfilt_proc, ident = pid, and fflags = Note_exit). Even if some resources or events are not supported by the current version of the kernel, they will also be supported in the future kernel, without modifying any API interfaces.

Disk File support

The last problem is that Epoll does not support all file descriptors; Select ()/poll ()/epoll () does not work on regular disk files. This is because Epoll has a hypothetical premise that is strongly based on the readiness model. You are monitoring a ready-to-go socket, so the sequential IO call on the socket does not block. However, disk files do not conform to this model because they are always in a ready state. Disk I/O is blocked only when the data is not cached to memory, not because the client has not sent a message. The model of the disk file is the completion notification model. In such a model, you just generate I/O manipulation, and then wait for the notification to complete. Kqueue supports this approach by setting the Evfilt_aio filter type to be associated to POSIX AIO functionality, such as Aio_read (). In Linux, you can only pray because the cache hit rate is high and the disk is not blocked (this is an egg on a typical Web server), or the disk I/O blocking does not affect the processing of network sockets (such as the Flash schema) by separating threads.

In our previous article, we proposed a new programming interface: Megapipe. It is completely based on the completion notification model and can be used for both disk and non-disk files. The final text is here

Scalable Event multiplexing Technology: Epoll and Kqueue

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.