Select, poll, epoll Implementation Analysis-combined with the kernel source code

Source: Internet
Author: User
Tags epoll

Select, poll, and epoll are Io multiplexing mechanisms. The so-called I/O multiplexing mechanism means that multiple descriptors can be monitored through one mechanism. Once a descriptor is ready (generally read or write ), notify the program to perform corresponding read/write operations. However, select, poll, and epoll are essentially synchronous I/O, because they all need to read and write after the Read and Write events are ready, that is, the read and write process is blocked, asynchronous I/O does not need to be read and written by itself, while asynchronous I/O is responsible for copying data from the kernel to the user space. Blocking, non-blocking, synchronous, and asynchronous will be detailed in the next article.

The implementation of select is similar to that of poll, and there are also many shortcomings that have been criticized. epoll can be said to be an enhanced version of select and poll.

I. Select implementation

1. Use copy_from_user to copy fd_set from user space to kernel space

2. register the callback function _ pollwait

3. traverse all FD and call the corresponding poll method (for socket, this poll method is sock_poll, and sock_poll will call tcp_poll, udp_poll or datagram_poll as needed)

4. Taking tcp_poll as an example, the core implementation of tcp_pollwait is the callback function registered above.

5. _ pollwait mounts current (current process) to the device's waiting queue. Different devices have different waiting queues. For tcp_poll, the waiting queue is SK-> sk_sleep (note that hanging the process to the waiting queue does not mean that the process is sleep ). After the device receives a message (network device) or fills in the file data (disk device), it will wake up the process of the device waiting for sleep on the queue, then the current will be awakened.

6. When the poll method is returned, a mask describing whether the read/write operation is complete is returned. The fd_set is assigned a value based on the mask.

7. If all FD data is traversed and no read/write mask is returned, schedule_timeout is called to sleep the select process (that is, current. When a device driver reads or writes its own resources, it will wake up the process waiting for sleep in the queue. If a certain timeout value is exceeded (specified by schedule_timeout) and no one wakes up, the process that calls the SELECT statement will be wakened again to obtain the CPU, traverse the FD again, and determine whether the FD is ready.

8. Copy fd_set from the kernel space to the user space.

Summary:

Disadvantages of select:

(1) Each time you call the SELECT statement, you need to copy the FD set from the user State to the kernel state. This overhead is very high in many cases of FD.

(2) At the same time, each call to the SELECT statement requires all FD passed in through the kernel traversal. This overhead is also very large in many cases of FD.

(3) The number of file descriptors supported by select is too small. The default value is 1024.

Select source code analysis is recommended to see http://zhangyafeikimi.iteye.com/blog/248815

Ii. Poll implementation

The implementation of poll is very similar to that of select, except that the methods for describing FD sets are different. poll uses the pollfd structure instead of the fd_set structure of select. The rest are similar.

Iii. epoll implementation

Since epoll is an improvement to select and poll, it should be able to avoid the above three shortcomings. Then how does epoll solve the problem? Before that, let's take a look at the differences between epoll and select and poll call interfaces. Both select and poll provide only one function-select or poll. Epoll provides three functions: epoll_create, epoll_ctl, and epoll_wait. epoll_create creates an epoll handle. epoll_ctl is the type of event to be monitored, and epoll_wait is the waiting event.

For the first drawback, The epoll solution is included in the epoll_ctl function. Every time a new event is registered to the epoll handle (epoll_ctl_add is specified in epoll_ctl), all FD files will be copied to the kernel instead of being copied repeatedly during epoll_wait. Epoll ensures that each FD is copied only once throughout the process.

For the second disadvantage, the epoll solution is not like the select or poll solution. Each time, the current is added to the FD device waiting queue in turn, in epoll_ctl mode, the current is mounted once (this time is required) and a callback function is specified for each FD. when the device is ready, it will wake up the waiting person on the queue, this callback function will be called, and this callback function will add the ready FD to a ready linked list ). Epoll_wait is actually used to check whether there is a ready FD in the ready linked list (sleeping for a while using schedule_timeout () to judge the effect for a while, which is similar to Step 1 in select implementation ).

The principle of this callback mechanism is very simple. Let's take a look at the code used by select and epoll when they add current to the FD device waiting for the queue:

Select:

static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,poll_table *p){struct poll_table_entry *entry = poll_get_entry(p);if (!entry)return;get_file(filp);entry->filp = filp;entry->wait_address = wait_address;init_waitqueue_entry(&entry->wait, current);add_wait_queue(wait_address, &entry->wait);}

The implementation of init_waitqueue_entry is as follows:

static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p){q->flags = 0;q->private = p;q->func = default_wake_function;}

The above code is to create a poll_table_entry structure entry. First, set current to the private member of Entry-> wait, and set default_wake_function to the func member of Entry-> wait, then, link entry-> wait to wait_address (this wait_address is the waiting queue of the device and sk_sleep in tcp_poll ).


Let's look at epoll again:

/* * This is the callback that is used to add our wait queue to the * target file wakeup lists. */static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt){struct epitem *epi = ep_item_from_epqueue(pt);struct eppoll_entry *pwq;if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);pwq->whead = whead;pwq->base = epi;add_wait_queue(whead, &pwq->wait);list_add_tail(&pwq->llink, &epi->pwqlist);epi->nwait++;} else {/* We have to signal that an error occurred */epi->nwait = -1;}}

The implementation of init_waitqueue_func_entry is as follows:

static inline void init_waitqueue_func_entry(wait_queue_t *q,wait_queue_func_t func){q->flags = 0;q->private = NULL;q->func = func;}

We can see that the overall implementation of select is similar, except that it creates an eppoll_entry structure pwq, but the pwq-> wait func member is set to the callback function ep_poll_callback (instead of default_wake_function, so here there is no wake-up operation, but only a callback function), the private member is set to null. Finally, link pwq-> wait to the whead (that is, the device is waiting in the queue ). In this way, when the device waits for the process in the queue to be awakened, it will call ep_poll_callback.

Then, when epoll_wait is used, it determines whether there is a ready FD in the ready linked list. If not, it adds the current process to a waiting queue (file-> private_data-> WQ) and in a while (1) loop to determine whether the ready queue is empty, and with schedule_timeout to sleep for a while, to determine the effect of a while. If the current process is sleeping and the device is ready, the callback function is called. In the callback function, the ready FD is put into the ready linked list, and the current process in the waiting queue (file-> private_data-> WQ) is wakened, so that epoll_wait can continue to be executed.

The third disadvantage is that epoll does not have this limit. The FD ceiling supported by epoll is the maximum number of files that can be opened. This number is generally larger than 2048. For example, the number of machines with 1 GB of memory is about 0.1 million.
/Proc/sys/fs/file-max. Generally, this number has a great relationship with the system memory.


Summary:

1. The select and poll implementations need to constantly poll all FD sets until the device is ready, during which sleep and wakeup may alternate multiple times. Epoll also needs to call epoll_wait to continuously poll the ready linked list. During this period, sleep and wake up may alternate multiple times. However, when the device is ready, epoll calls the callback function, put the ready FD in the ready linked list and wake up the sleep process in epoll_wait. Although both need to sleep and alternate, the Select and Poll must traverse the entire FD set while being "Awake, when epoll is "Awake", you only need to judge whether the ready linked list is empty, which saves a lot of CPU time. This is the performance improvement brought about by the callback mechanism.

2. for select and poll, the FD set must be copied once from the user State to the kernel state during each call, and the current must be mounted once to the device waiting queue, while epoll only needs to be copied once, in addition, the current is only mounted to the waiting queue once (at the beginning of epoll_wait, note that the waiting queue here is not a device waiting queue, but a waiting queue defined inside epoll ). This can also save a lot of expenses.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.