Linux kernel Select/poll,epoll Implementation and Difference _c language

Source: Internet
Author: User
Tags epoll sleep

The following article in this period of time to study the implementation of the kernel of Select/poll/epoll experience:
Select,poll,epoll are multiplexing IO functions, simply in a thread, you can handle multiple file descriptors at the same time read and write. The implementation of the
Select/poll is similar, Epoll is extended from Select/poll, primarily to address the inherent flaws of select/poll.
Epoll new functions that appear in kernel version 2.6, and their implementations in the Linux kernel are very similar.
These three functions require device drivers to provide poll callback functions, and for sockets they are tcp_poll,udp_poll and datagram_poll; The
is the poll interface function that you implement for the device drivers that you develop.

Select implementation (2.6 of the kernel, other versions of the kernel, should be small)
application calls Select, into the kernel call Sys_select, do some simple initialization work, and then enter the Core_sys_select,
This function mainly works is to copy descriptor collection from user space to kernel space, and finally into Do_select, complete its main function. In the
Do_select, calling Poll_initwait, the main job is to register the poll_wait callback function as __pollwait,
when calling poll in a device-driven poll_wait callback function, it is actually calling __ Pollwait, the main job of the
__pollwait is to mount the current process to the waiting queue, which will wake up the process when the waiting event arrives. The
then executes the For loop, which iterates through each file descriptor first, invokes the poll callback function of the corresponding descriptor, detects whether it is ready,
after all the descriptors have been traversed, exits the loop as long as there is a descriptor in the ready state, a signal is interrupted, an error or a timeout, and the Otherwise, the SCHEDULE_XXX function is called, allowing the current process to sleep, until the timeout or the descriptor is ready to be awakened. The
then iterates through each descriptor again, calling poll again for detection. The
loops so that it does not exit until it meets the criteria.
The following is a partial fragment of the 2.6.31 kernel for the Select function:
They call the relationship:
Select--> sys_select--> core_sys_select--& Gt Do_select

int do_select (int n, fd_set_bits *fds, struct Timespec *end_time) {ktime_t expire, *to = NULL;
  struct Poll_wqueues table;
  Poll_table *wait;
  int retval, I, timed_out = 0;
  
  unsigned long slack = 0;
  This reduces the number of traversal loops in order to get the maximum descriptor in the collection.
  That is why the first parameter of select in Linux is so important rcu_read_lock ();
  retval = MAX_SELECT_FD (n, FDS);
  Rcu_read_unlock ();
  if (retval < 0) return retval;

  n = retval;
  Initializing the poll_table structure, one of the important tasks is to assign the __pollwait function address to it, poll_initwait (&table);
  wait = &table.pt;
    if (end_time &&!end_time->tv_sec &&!end_time->tv_nsec) {wait = NULL;
  Timed_out = 1;

  } if (End_time &&!timed_out) slack = estimate_accuracy (end_time);
  retval = 0; The main loop, which will complete the state of the descriptor here rotation for (;;)

    {unsigned long *rinp, *ROUTP, *rexp, *INP, *OUTP, *exp; INP = fds->in; OUTP = fds->out;
    Exp = fds->ex; RINP = fds->res_in; ROUTP = fds->res_out;

    Rexp = fds->res_ex; for (i = 0; i < n; ++rinP, ++ROUTP, ++rexp) {unsigned long, out, ex, all_bits, bit = 1, mask, J;
      unsigned long res_in = 0, res_out = 0, res_ex = 0;
      const struct File_operations *f_op = NULL;
      struct file *file = NULL; The fd_set_bits parameters in the Fd_set and do_select in select are stored in bits to hold the descriptor, meaning to apply for a 1024-bit memory,///if the 28th bit 1 indicates that this collection has a descriptor, I n = *inp++; out = *outp++;
      ex = *exp++; All_bits = in | Out | Ex
        Detect read-Write exception 3 sets have no descriptor if (all_bits = = 0) {i + = __nfdbits;
      Continue
        for (j = 0; j < __nfdbits; ++j, ++i, bit <<= 1) {int fput_needed;
        if (i >= n) break; if (!) (
        Bit & all_bits)) continue; File = Fget_light (i, &fput_needed); Gets the struct file structure pointer through the descriptor index, if (file) {F_op = file->f_op;//via struct file for file_operation
          S, which is a collection of callback functions for the action file.
          mask = Default_pollmask; if (f_op && f_op->poll) {wait_key_set (WAIt, in, out, bit); Mask = (*f_op->poll) (file, wait);
          Call the poll function implemented in our device,//Therefore, in order for the Select to work properly, in our device drivers, we must provide a poll implementation,}
          Fput_light (file, fput_needed);
            if (Mask & Pollin_set) && (in & bit) {res_in |= bit;
            retval++; wait = NULL; This includes the following, which sets the wait to null because the Mask = (*f_op->poll) (file, wait) is detected;
          Descriptor is ready///no longer need to add the current process to the wait queue, Do_select will exit after traversing all descriptors.
            } if (Mask & Pollout_set) && (out & bit) {res_out |= bit;
            retval++;
          wait = NULL;
            } if (Mask & Pollex_set) && (ex & Bit) {res_ex |= bit;
            retval++;
          wait = NULL;
      }} if (res_in) *rinp = res_in;
      if (res_out) *ROUTP = res_out;
      if (res_ex) *rexp = RES_EX; Cond_resched ();
    wait = NULL;
      has been traversed, the addition to the waiting queue, has been added, no need to add, so set to null if (retval | | timed_out | | | signal_pending (current))//Descriptor ready, timeout, or signal interrupt to exit the loop
    Break
      if (table.error) {//error exits loop retval = Table.error;
    Break
     }/* If This is the the I loop and we have a timeout * given, then we convert to ktime_t and set the To
     * Pointer to the expiry value.
      */if (end_time &&!to) {expire = Timespec_to_ktime (*end_time);
    to = &expire;
      /////the process to hibernate until the timeout, or the ready descriptor is awakened, if (!poll_schedule_timeout (&table, task_interruptible, to, Slack))
  Timed_out = 1;

  } poll_freewait (&table);
return retval; } void poll_initwait (struct poll_wqueues *pwq) {init_poll_funcptr (&pwq->pt, __pollwait);//Set poll_table callback function as _
_pollwait, so that when we call poll_wait in the drive, we call __pollwait ...} static void __pollwait (struct file *filp, wait_queue_head_t *wait_address, poll_table *p) {.......... init_waitqueue_func_entry (&entry->wait, Pollwake); Sets the callback function that wakes up the process call, when calling the Wake_up wakeup queue in the drive,//Pollwake is called, which is actually the default function of the calling queue Default_
  Wake_function//To wake up the process of sleep.     Add_wait_queue (wait_address, &entry->wait); Add to wait queue} int core_sys_select (int n, fd_set __user *inp, Fd_set __user *outp, fd_set __user *exp, struct TIMESP
    EC *end_time) {.../////Copy descriptor collection from user space to kernel space if (ret = Get_fd_set (n, INP, fds.in)) | |
    (ret = Get_fd_set (n, OUTP, fds.out)) | |
  (ret = Get_fd_set (n, exp, fds.ex))
  ... ret = Do_select (n, &fds, end_time);
    ...////the do_select back to the collection, copied from the kernel space to user space if (set_fd_set (n, INP, fds.res_in) | |
    Set_fd_set (n, OUTP, fds.res_out) | |
   Set_fd_set (N, exp, fds.res_ex)) ret =-efault; ............
}

The implementation of the poll is basically similar to the Select, according to
Call sequence for poll--> do_sys_poll--> do_poll-->
Where DO_POLLFD is invoked on each descriptor to invoke its callback poll state rotation.
The benefit of poll than Select is that there is no description of the limit, the select has a 1024 limit, the descriptor cannot exceed this value, poll is unrestricted.
From the above code analysis, we can sum up the inherent defects of Select/poll:
1 each call select/poll need to put the descriptor collection from user space copy to the kernel space, after detection is completed, but also the detection of the result set from the kernel space copy to user space
When the descriptor is large and the select is often awakened, the overhead is larger
2 If you say that the description of the collection back and forth is nothing, so many times the whole descriptor traversal is more frightening,
In our application, each call to the Select/poll must first traverse the descriptor and add them to the Fd_set collection, the first traversal of the application layer,
Then into the kernel space, at least one traversal and call each descriptor of the poll callback detection, generally may be 2 traversal, the first time did not find a ready descriptor,
Join the wait queue, the second time is awakened, and then go through it again. Back to the application layer, we also have to iterate through all the descriptors, using Fd_isset to detect the result set.
If the descriptor is large, this traversal consumes CPU resources.
3 Description of the number of restrictions, of course, there is no limit to the poll, select but 1024 of the hard limit, in addition to modifying the kernel to increase the 1024 limit no other way.
Since there are so some shortcomings, it is not select/poll become worthless, it is wrong.
They are still the best function of code porting because almost all platforms have interfaces to their implementations.
In the descriptor is not too much, they are still very good to complete multiplexing Io,
And if the descriptors on each connection are active, their efficiency is not much worse than that of Epoll.
Used multiple threads + each thread to develop a TCP server using a poll method, processing the file to send and receive, the connection reaches thousands of,
The bottleneck at the time was no longer in network IO, but in disk IO.

Let's take a look at Epoll in order to solve the select/poll inborn flaw, is how realizes.
Epoll is only an extension of the select/poll, he did not make a redesign in the Linux kernel, do subversive design, he only on the basis of a select to solve their defects.
His bottom layer still needs device drivers to provide poll callbacks as a basis for state detection.
Epoll is divided into three functions epoll_create,epoll_ctl, epoll_wait.
Their implementation is in the EVENTPOLL.C code.
Epoll_create creates epoll devices to manage all the added descriptors, epoll_ctl to add new descriptors, modify or delete descriptors.
Epoll_wait waits for descriptor events.
Epoll_wait Wait is no longer the rotation way of waiting, epoll inside a descriptor ready queue, epoll_wait only detect this queue can,
He used to sleep for a while to detect the way, if you find that the descriptor-ready queue is not empty, copy the descriptor in this queue to user space, and then return.
Where does the data in the descriptor-ready queue come from?
When you used Epoll_ctl to add a new descriptor, you would modify the two callback functions in the EPOLL_CTL kernel implementation.
One is the Qproc callback function pointer in the poll_table structure,
In the Select is the __pollwait function, replaced by the Ep_ptable_queue_proc in the Epoll,
When the poll callback for the newly added descriptor is invoked in Epoll_ctl, the underlying driver invokes the Poll_wait add wait queue.
When the underlying driver calls Poll_wait,
is actually called Ep_ptable_queue_proc, this function modifies the waiting queue's callback function as Ep_poll_callback, and joins in the wait queue head;
Once the underlying driver discovers the data is ready, it invokes the wake_up wakeup wait queue, thus ep_poll_callback will be invoked,
In Ep_poll_callback, this ready descriptor is added to the Epoll-ready queue, while the process that epoll_wait is in is awakened.
This is the essence of Epoll's core realization.
To see how he solves Select/poll's flaws, he first adds the descriptor to the Epoll internal manager through the Epoll_ctl epoll_ctl_add command,
Just add it once, until you remove the descriptor with the Epoll_ctl Epoll_ctl_del command,
Unlike Select/poll, which must be added every time it is executed, it is clear that a significant reduction in the cost of the descriptor in the kernel and in the user's space is constantly going back and forth.
Second, although the internal epoll_wait is also loop detection, it only needs to detect if the descriptor ready queue is empty,
The overhead of rotation each descriptor is negligible compared to the select/poll must be poll.
He also does not have a description of how many restrictions, as long as your machine memory is large enough to accommodate a lot of descriptors.

The following is a partial kernel code fragment for Epoll:

struct Epitem {/* RB tree node used to link this structure to the Eventpoll RB/struct rb_node;//Red black  Point, struct EPOLL_FILEFD ffd; Stores the descriptor corresponding to this variable struct epoll_event event;

User-Defined structure * * other member/};

  struct Eventpoll {* * * other Members/.../* wait queue used by File->poll () * * wait_queue_head_t poll_wait;      /* List of Ready file descriptors * * struct list_head rdllist; Descriptor ready Queue, Mount Epitem structure/* RB tree root used to store monitored FD structs/struct rb_root RBR; Stores the red-black root of the newly added descriptor that is used to store all the descriptors that are added in.

The Epitem structure is mounted on ...};
  Epoll_create syscall_define1 (epoll_create1, int, flags) {int error;

  struct Eventpoll *ep = NULL; 
  * * other code/...//Assign eventpoll structure, this structure is the soul of Epoll, he contains all the data that need to be processed.
  Error = Ep_alloc (&AMP;EP);
  
  if (Error < 0) return error; Error = ANON_INODE_GETFD ("[Eventpoll]", &eventpoll_fops, EP, Flags & O_cloexec);
  Open the descriptor of the Eventpoll and store the EP in the File->private_data variable. if (erRor < 0) Ep_free (EP);
return error;
  } syscall_define4 (Epoll_ctl, int, EPFD, int, op, int, fd, struct epoll_event __user *, event) {/* Other code * * * ...
  EP = file->private_data;  ... epi = Ep_find (EP, Tfile, FD);
  Looking for the descriptor from the Eventpoll RBR is the epitem of fd, error =-einval; Switch (OP) {case Epoll_ctl_add:if (!epi) {epds.events |= Pollerr |
      Pollhup;   Error = Ep_insert (EP, &epds, Tfile, FD);
                                      Add a new descriptor to the function and modify the important callback function.
    It also invokes the descriptor's poll, viewing the Ready state} else error =-eexist;
   Break
* * Other code/...} static int Ep_insert (struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd) {... * * other code * /Init_poll_funcptr (&epq.pt, Ep_ptable_queue_proc);//Set Poll_tabe callback function to Ep_ptable_queue_proc//ep_ptable_q
  
  Ueue_proc sets the callback pointer for the wait queue to Ep_epoll_callback and adds a wait queue. ... * * other code * * revents = Tfile->f_op->poll (Tfile, &epq.pt); Invokes the descriptor's poll callback, in theEp_ptable_queue_proc in this function will be called ... * * other code * * Ep_rbtree_insert (EP, EPI); Add the new build about Epitem to the red-black tree ... * * other code/if (Revents & event->events) &&!ep_is_linked (&epi->rdl

    Link)) {list_add_tail (&epi->rdllink, &ep->rdllist);//If the top poll call detects a descriptor ready, add this descriptor to the ready queue.
    if (waitqueue_active (&AMP;EP-&GT;WQ)) wake_up_locked (&AMP;EP-&GT;WQ);
  if (waitqueue_active (&ep->poll_wait)) pwake++; } ... * * other code////* We have to call this outside the lock/if (Pwake) Ep_poll_safewake (&ep->poll_wait) ;

  If the descriptor-ready queue is not empty, the process where the epoll_wait is located is awakened.
... * * other code/}///This function sets the wait queue callback function to Ep_poll_callback,//So that the ep_poll_callback is invoked when there is a data wake-up wait queue at the bottom, thus alphanumeric the ready description to the ready queue. static void Ep_ptable_queue_proc (struct file *file, wait_queue_head_t *whead, poll_table *pt) {struct Epitem *
  EPI = Ep_item_from_epqueue (PT);

  struct Eppoll_entry *pwq; if (epi->nwait >= 0 && (pwq = Kmem_cache_alloc (Pwq_cache, Gfp_KERNEL))) {Init_waitqueue_func_entry (&pwq->wait, ep_poll_callback);
    Pwq->whead = Whead;
    Pwq->base = EPI;
    Add_wait_queue (Whead, &pwq->wait);
    List_add_tail (&pwq->llink, &epi->pwqlist);
  epi->nwait++;
  else {/* We have to signal the error occurred/epi->nwait =-1;
  } static int Ep_poll_callback (wait_queue_t *wait, unsigned mode, int sync, void *key) {int pwake = 0;
  unsigned long flags;
  struct Epitem *epi = ep_item_from_wait (wait);
  struct Eventpoll *ep = epi->ep; ... * * other code */if (!ep_is_linked (&epi->rdllink)) List_add_tail (&epi->rdllink, &ep->rdllist ); Add the current-ready description Epitem structure to the ready queue ... * * other code */if (pwake) Ep_poll_safewake (&ep->poll_wait); If the queue is not empty, wake epoll_wait process ... * * other code/} epoll_wait kernel code mainly calls Ep_poll, listing Ep_poll part of code fragment: Static int Ep_poll (Struc T eventpoll *ep, struct epoll_event __user *events, int maxevents, long TimeoUT) {int res, eavail;
  unsigned long flags;
  Long jtimeout;
  wait_queue_t wait;
    ... * * other code */if (List_empty (&ep->rdllist)) {init_waitqueue_entry (&wait, current);
    Wait.flags |= wq_flag_exclusive;  
    __add_wait_queue (&ep->wq, &wait); If the ready queue is detected to be empty, add the current process to the wait queue, and execute a loop for (;;)
      {set_current_state (task_interruptible);
      if (!list_empty (&ep->rdllist) | | |!jtimeout)//If the ready queue is not empty, or the time-out exits the loop break;
        if (signal_pending (current)) {//If signal is interrupted, exit circular res =-EINTR;
      Break
      } spin_unlock_irqrestore (&ep->lock, flags);
      Jtimeout = Schedule_timeout (jtimeout);//sleep, known to be awakened or timed out.
    Spin_lock_irqsave (&ep->lock, flags);

    } __remove_wait_queue (&ep->wq, &wait);
  Set_current_state (task_running); } .../* other code */if (!res && eavail &&! res = Ep_send_events (EP, Events, Maxevents), && jtimeout) goto retry;
The main task of ep_send_events is to copy the ready descriptor of the ready queue into the epoll_event array of user space, return res; }

You can see that Ep_poll's epoll_wait loop is a fairly easy loop, he's just a simple test-ready queue, so his overhead is very small.

Finally, we look at how the descriptor is ready, how to notify the Select/poll/epoll, the network socket TCP protocol to explain.

The corresponding poll callback for the TCP protocol is tcp_poll, and the corresponding waiting queue header is Sk_sleep member in the struct sock structure.
In Tcp_poll, the Sk_sleep is added to the wait queue, waiting for the data to be ready.

When the physical network card receives the packet, causes the hardware to interrupt, drives in interrupts the ISR routine, constructs the SKB package, the data copy enters SKB, then calls NETIF_RX
Mount the SKB to the CPU-related input_pkt_queue queue and cause a soft interrupt, which is removed from the input_pkt_queue in the net_rx_action callback function of the soft interrupt
SKB packets, through analysis, call protocol-related callback functions, so that layers pass, until struct sock, the Sk_data_ready callback pointer in this structure is invoked
Sk_data_ready points to the sock_def_readable function, the sock_def_readable function is actually sock in the wake_up wake sk_sleep structure.
The above mechanism is the same for Select/poll/epoll, and the next wake-up Sk_sleep is different, because they point to the various callback functions.
In the Select/poll implementation, the wait queue callback function is Pollwake, which calls Default_wake_function, and wakes up the process that was blocked by the select.
In the Epoll implementation, the wait callback function is ep_poll_callback, which simply adds the ready descriptor to the Epoll's ready queue.

So select/poll/epoll actually they are in the kernel implementation, the difference is not too big, in fact, almost.
Epoll, although efficient, can match the completion port in the Windows platform, but the portability is too poor,
At present, almost only the Linux platform to achieve the epoll and must be more than 2.6 kernel version.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.