Analysis of several implementation mechanisms of IO multiplexing

Source: Internet
Author: User
Tags epoll sleep

The select,poll,epoll is a mechanism for IO multiplexing. The so-called I/O multiplexing mechanism, which means that through a mechanism, you can monitor multiple descriptors, and once a descriptor is ready (usually read-ready or ready to write), it notifies the program to read and write accordingly. But select,poll,epoll are essentially synchronous I/O because they all need to read and write when the read-write event is ready, that is, the read-write process is blocked, and asynchronous I/O is not responsible for reading and writing, and the asynchronous I/O implementation is responsible for copying the data from the kernel to the user space. About blocking, non-blocking, synchronous, asynchronous will be explained in detail in the next article.

Select and poll implementations are similar, there are many of the shortcomings of the criticism, Epoll can be said to be the enhanced version of Select and poll.
First, select implementation
1. Use Copy_from_user to copy fd_set from user space to kernel space
2. Register callback function __pollwait
3, traverse all FD, call its corresponding poll method (for socket, this poll method is sock_poll,sock_poll according to the situation will call to Tcp_poll,udp_poll or Datagram_poll)
4, take Tcp_poll as an example, its core implementation is __pollwait, that is, the above registered callback function.
5, the main task of __pollwait is to hang the current process into the waiting queue of the device, different devices have different waiting queue, for Tcp_poll, its waiting queue is sk->sk_ Sleep (note that suspending the process to the waiting queue does not mean that the process is already asleep). When the device receives a message (network device) or fills out the file data (disk device), it wakes the device to wait for the sleep process on the queue, and current is awakened.
6. The poll method returns a mask mask that describes whether the read-write operation is ready, and assigns a value to fd_set based on the mask mask.
7. If all FD is traversed and no read-write mask is returned, the call to Schedule_timeout is the process of calling select (that is, current) into sleep. When a device driver takes its own resource to read and write, it wakes up the process of waiting for sleep on the queue. If there is more than a certain timeout (schedule_timeout specified), or no one wakes up, then the process calling select will be woken up to get the CPU again, and then iterate over the FD to determine if there is no ready FD.
8, copy the fd_set from the kernel space to the user space.
Summarize:
A few of the major drawbacks of select:
(1) Each call to select, the FD collection needs to be copied from the user state to the kernel state, the cost of FD is very large
(2) At the same time, each call to select requires a kernel traversal of all the FD passed in, which is also very expensive when FD is very large
(3) The number of file descriptors supported by Select is too small, the default is 1024
Second, the realization of poll
The implementation of poll is very similar to select, except that the FD collection is described in different ways, poll uses the POLLFD structure rather than the fd_set structure of the Select. The others are the same.
Third, the realization of Epoll
Since Epoll is an improvement on select and poll, the above three drawbacks should be avoided. How did that epoll all work out? Before we take a look at the different invocation interfaces of Epoll and select and poll, both Select and poll provide only a function--select or poll function. While Epoll provides three functions, Epoll_create,epoll_ctl and Epoll_wait,epoll_create are created with a epoll handle; Epoll_ctl is the type of event registered to listen; Epoll_ Wait is waiting for the event to occur.


For the first drawback, the Epoll solution is in the Epoll_ctl function. Each time a new event is registered in the Epoll handle (specifying Epoll_ctl_add in Epoll_ctl), all FD is copied into the kernel instead of being duplicated at epoll_wait. Epoll guarantees that each FD will be copied only once throughout the process.
For the second disadvantage, the solution for Epoll is not to add current to the FD-corresponding device-waiting queue each time, like Select or poll, but to hang the current only once at Epoll_ctl (which is necessary) and to specify a callback function for each FD. This callback function is invoked when the device is ready to wake the waiting queue, and the callback function will add the ready FD to a ready list. Epoll_wait's job is actually to see if there is a ready-to-use FD (using Schedule_timeout () to sleep for a while, judging the effect of a meeting, and the 7th step in the Select implementation is similar).
To illustrate the rationale for this callback mechanism, it is actually very simple to look at the code that select and Epoll use when adding current to the device waiting queue for the FD counterpart:
Select

static void __pollwait (struct file *filp, wait_queue_head_t *wait_address,  
                poll_table *p)  
{  
    struct Poll_ Table_entry *entry = Poll_get_entry (p);  
    if (!entry)  
        return;  
    Get_file (FILP);  
    Entry->filp = Filp;  
    entry->wait_address = wait_address;  
    Init_waitqueue_entry (&entry->wait, current);  
    Add_wait_queue (wait_address, &entry->wait);  


Where Init_waitqueue_entry is implemented as follows:

static inline void Init_waitqueue_entry (wait_queue_t *q, struct task_struct *p)  
{  
    q->flags = 0;  
    Q->private = p;  
    Q->func = default_wake_function;  


The above code is to establish a Poll_table_entry structure entry, first set the current to entry->wait private member, while the default_wake_function is set to Entry-> The Func member of the wait and then chains the entry->wait into the wait_address (the wait_address is the waiting queue for the device, which is sk_sleep in Tcp_poll).
Look again at the Epoll:

/ 
 * * This is the callback-used to add our wait queue to the 
 * target file wakeup lists. 
 *  
/static void Ep_ptable_queue_proc (struct file *file, wait_queue_head_t *whead,  
                 poll_table *pt)  
{  
    struct Epitem *epi = Ep_item_from_epqueue (PT);  
    struct Eppoll_entry *pwq;  
  
    if (epi->nwait >= 0 && (pwq = Kmem_cache_alloc (Pwq_cache, Gfp_kernel))) {  
        Init_waitqueue_func_entry (& Amp;pwq->wait, ep_poll_callback);  
        Pwq->whead = Whead;  
        Pwq->base = epi;  
        Add_wait_queue (Whead, &pwq->wait);  
        List_add_tail (&pwq->llink, &epi->pwqlist);  
        epi->nwait++;  
    } else {  
        /* We have to signal this an error occurred *  
        /epi->nwait =-1;  
    }  
}  


Where Init_waitqueue_func_entry is implemented as follows:

static inline void Init_waitqueue_func_entry (wait_queue_t *q,  
                    wait_queue_func_t func)  
{  
    q->flags = 0;  
    Q->private = NULL;  
    Q->func = func;  


As you can see, the implementation of the overall and select is similar, except that it creates a eppoll_entry structure pwq, except that the pwq->wait Func member is set to the callback function Ep_poll_callback (not default_ Wake_function, so there is no wake-up operation, just a callback function, and the private member is set to NULL. Finally, the pwq->wait is chained into the whead (that is, the device waits for the queue). This will call Ep_poll_callback when the device waits for a process in the queue to be woken up.

Again, when epoll_wait, it will determine if there is no ready in the list of FD, if not, the current process into a waiting queue (FILE->PRIVATE_DATA->WQ), And in a while (1) loop to determine whether the ready queue is empty, and combined with schedule_timeout implementation sleep for a while, judge the effect of a. If the current process is in sleep and the device is ready, the callback function is called. In the callback function, the ready FD is put into the ready list and the current process in the waiting queue (FILE->PRIVATE_DATA->WQ) is awakened, so that the epoll_wait can continue to execute.
For the third disadvantage, Epoll does not have this limit, it supports the maximum number of FD can open file, this number is generally far greater than 2048, for example, in 1GB memory of the machine is about 100,000, the exact number can cat/proc/sys/fs/ File-max, in general, this number and system memory relationship is very large.
Summarize:
1. The Select,poll implementation needs to constantly poll all FD collections until the device is ready and may have to sleep and wake up multiple times alternately. While Epoll actually needs to call Epoll_wait to constantly poll the ready linked list, there may be multiple sleep and wake alternates, but when it is device ready, call the callback function, put the ready FD into the Ready list, and wake the process into sleep in epoll_wait. While both sleep and alternate, select and poll traverse the entire FD collection while "Awake", while Epoll is "awake" as long as it is OK to determine if the ready list is empty, which saves a lot of CPU time. This is the performance boost that the callback mechanism brings.
2, Select,poll each call to the FD set from the user state to the kernel state copy once, and to the device to wait for the queue to hang once, and epoll as long as a copy, and the current to wait for the queue to hang only once (at the beginning of epoll_wait , note that the wait queue here is not a device waiting queue, just a epoll internally defined wait queue. This can also save a lot of overhead.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.