When developing high-performance network programs, Windows developers must call iocp and Linux developers must call epoll. Everyone understands that epoll is an IO multiplexing technology that can efficiently process millions of socket handles, which is much more efficient than the previous select and poll. We can use epoll to make it feel pretty cool and fast, so why can it process so many concurrent connections at high speed?
Let's briefly review how to use the three epoll system calls encapsulated by the C library.
int epoll_create(int size);int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);
You can use epoll_create to create an epoll object. The parameter size is the maximum number of handles that can be correctly processed by the kernel. If the maximum number is exceeded, the kernel does not guarantee the effect.
Epoll_ctl can be used to operate the epoll created above. For example, you can add the newly established socket to epoll for monitoring, or remove a socket handle being monitored by epoll from epoll without monitoring it.
When epoll_wait is called, a user-state process is returned when an event occurs in all the monitored handles within the specified Timeout time.
From the above call method, we can see that epoll is superior to select/poll: because the latter will pass all the sockets you want to monitor to the select/poll system call each time it calls, this means that you need to copy the user-state socket list to the kernel state. If you use a volume handle, the system will copy dozens of hundreds of KB of memory to the kernel state each time, which is very inefficient. When epoll_wait is called, select/poll is called in the past, but no socket handle is required to be passed to the kernel because the kernel has obtained the list of handles to be monitored in epoll_ctl.
Therefore, after you call epoll_create, the kernel is ready to help you store the handles to be monitored in the kernel state, each call to epoll_ctl only adds a new socket handle to the data structure of the kernel.
In the kernel, everything is a file. Therefore, epoll registers a file system with the kernel to store the aforementioned monitored socket. When you call epoll_create, a file node is created in the virtual epoll file system. Of course, this file is not a common file. It only serves epoll.
When epoll is initialized by the kernel (the operating system is started), epoll will also open up its own kernel high-speed cache area for storing every socket we want to monitor, these sockets are saved in the kernel cache in the form of a red/black tree to support fast search, insertion, and deletion. This kernel high-speed cache area is to establish a continuous physical memory page and then create an slab layer on it. Simply put, it is physically allocated with the size of the memory object you want, idle allocated objects are used for each usage.
static int __init eventpoll_init(void){... .../* Allocates slab cache used to allocate "struct epitem" items */epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC,NULL, NULL);/* Allocates slab cache used to allocate "struct eppoll_entry" */pwq_cache = kmem_cache_create("eventpoll_pwq",sizeof(struct eppoll_entry), 0,EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL); ... ...
Epoll's efficiency lies in that when we call epoll_ctl to insert millions of handles into it, epoll_wait can still return quickly and effectively send the event handle to our users. This is because when we call epoll_create, the kernel not only helps us to create a file node in the epoll file system, A red/black tree is created in the kernel cache to store sockets sent from epoll_ctl. In addition, a list chain table is created to store ready events. When epoll_wait is called, just observe whether there is any data in the list linked list. If data exists, the system returns the data and sleep if no data exists. When the timeout time is reached, the system returns the data even if the linked list has no data. Therefore, epoll_wait is very efficient.
Generally, even if we want to monitor millions of handles, only a small number of ready handles are returned at a time, epoll_wait only needs to copy a small number of handles from the kernel state to the user State. How can it be inefficient ?!
So how is the list linked list ready for maintenance? When we execute epoll_ctl, in addition to placing the socket into the red/black tree corresponding to the file object in the epoll file system, we also register a callback function for the kernel interrupt handler to tell the kernel, if the handle is interrupted, put it in the ready list linked list. Therefore, when a socket has data, the kernel will insert the socket into the ready linked list after copying the data on the NIC to the kernel.
In this way, a red/black tree, a ready-to-handle linked list, and a small amount of kernel cache can help us solve the socket processing problems under high concurrency. When epoll_create is executed, a red/black tree and a ready linked list are created. When epoll_ctl is executed, if a socket handle is added, check whether the socket handle exists in the red/black tree and return immediately, if it does not exist, add it to the trunk and register the callback function with the kernel to temporarily insert data to the ready linked list when the interrupt event occurs. When epoll_wait is executed, the data in the prepared linked list is immediately returned.
Finally, let's look at the epoll's two unique modes: Lt and ET. Both the LT and ET modes apply to the processes mentioned above. The difference is that in lt mode, as long as the event on a handle is not processed at a time, the handle will be returned twice when epoll_wait is called later, while in et mode, the handle is returned only for the first time.
How did this happen? When an event occurs on a socket handle, the kernel inserts the handle into the ready list linked list mentioned above. At this time, we call epoll_wait to copy the ready socket to the user-state memory, then clear the list ready. Finally, epoll_wait is done to check these sockets. If it is not the et mode (or the LT mode handle ), in addition, when there are indeed unprocessed events on these sockets, the handle is put back to the ready linked list just emptied. Therefore, a non-et handle, as long as there is an event above it, epoll_wait will return each time. The handle in et mode will not be returned from epoll_wait for the next time even if the events on the socket are not completely processed.