Linux kernel Note: epoll implementation principle

Source: Internet
Author: User
Tags epoll

I. Description

The target kernel version is 4.4.10.

This article is just my own view of the source of simple notes, if you want to understand the implementation of Epoll, strongly recommend the following article:

The implementation of Epoll (1)

The implementation of Epoll (2)

The implementation of Epoll (3)

The implementation of Epoll (4)

Second, epoll_create ()

System call Epoll_create () creates an Epoll instance and returns the corresponding file descriptor FD for that instance. In the kernel, each Epoll instance corresponds to object one by one of a struct eventpoll type, which is the core of Epoll and is declared in the fs/eventpoll.c file.

Epoll_create interface definition Here, the main source code analysis is as follows:

First create a struct Eventpoll object:

struct Eventpoll *ep = Null;...error = Ep_alloc (&EP); if (Error < 0)    return error;

Then assign a file descriptor that is not used:

FD = get_unused_fd_flags (O_rdwr | (Flags & o_cloexec)); if (FD < 0) {    error = FD;    Goto Out_free_ep;}

Then create a struct file object, set the struct file_operations *f_op in file to the global variable eventpoll_fops, and the void *private point to the Eventpoll object you just created EP:

struct File *file;...file = Anon_inode_getfile ("[Eventpoll]", &eventpoll_fops, EP, O_RDWR | (Flags & o_cloexec)); if (is_err (file)) {    error = Ptr_err (file);    Goto OUT_FREE_FD;}

Then set the file pointer in Eventpoll:

Ep->file = file;

Finally, the file descriptor is added to the File descriptor table of the current process and returned to the user

Fd_install (fd, file);
return FD;

The main structural relationships after the operation are as follows:

Third, Epoll_ctl ()

The system call EPOLL_CTL () is defined in the kernel as follows, the meaning of each parameter can be found in the man Manual of Epoll_ctl

Syscall_define4 (epoll_ctl, int, EPFD, int, op, int, fd, struct epoll_event __user *, event)

Epoll_ctl () First determines if OP is a delete operation, and if not, copies the event parameter from user space to the kernel:

struct Epoll_event epds;...if (ep_op_has_event (OP) &&     copy_from_user (&epds, event, sizeof (struct epoll_event)))         Goto Error_return;

Ep_op_has_event () is actually determining if op is a delete operation:

static inline int ep_op_has_event (int op) {    return op! = Epoll_ctl_del;}

Next determine whether the user set the EPOLLEXCLUSIVE flag, this flag is the 4.5 version of the kernel only, mainly to solve the same file descriptor is added to multiple epoll instances caused by the "surprise group" problem, detailed description can be seen here. The setting of this flag has some restrictions, for example, it can only be set in the Epoll_ctl_add operation, and the corresponding file descriptor itself cannot be a epoll instance, the following code is the check of these restrictions:

/* *epoll adds to the wakeup queue @ epoll_ctl_add time only, * so epollexclusive are not allowed for a epoll_ctl_mod oper ation. * Also, we do not currently supported nested exclusive wakeups. */if (epds.events & epollexclusive) {     if (op = = epoll_ctl_mod)         goto error_tgt_fput;     if (op = = Epoll_ctl_add && (is_file_epoll (tf.file) | |            (Epds.events & ~epollexclusive_ok_bits)))         Goto Error_tgt_fput;}

The next step is to get the struct file object from the incoming file descriptor, and then get the struct Eventpoll object from the Private_data field in the struct file:

struct FD F, tf;struct Eventpoll *ep, ... f = fdget (EPFD); ... tf = Fdget (FD); ... EP = f.file->private_data;

If the file descriptor that you want to add itself also represents a epoll instance, it is possible to cause a dead loop, which is checked by the kernel and returns an error if there is a dead loop. This part of the code I have not yet to look at, here no longer posted.

The next step is to look for the Epollitem object corresponding to the monitored file from the red and black tree of the Epoll instance, and if it does not exist, it will return NULL if it was not previously added.

struct Epitem *epi;...epi = Ep_find (EP, Tf.file, FD);

The Ep_find () function is essentially a red-black tree lookup process, and the comparison function used by red-black trees to find and insert is EP_CMP_FFD (), which compares the address size of the struct file object, and then compares the file descriptor size. One case where the struct file object address is the same is when a different file descriptor is pointed to the same struct file object through the DUP () system call.


struct EPOLL_FILEFD *p2) { return (P1->file > p2->file? + 1: (P1->file < p2->file? -1:p1->f D-P2->FD));}

The next step is to do different processing depending on the operator op, where we only see the Add Operation when op equals Epoll_ctl_add. The first step is to determine whether the Epollitem object address returned in the previous operation is NULL, NOT NULL to indicate that the file has been added, return an error, or call the Ep_insert () function for a true add operation. The kernel automatically adds Pollerr and Pollhup events to the file before adding the file.

if (!epi) {    epds.events |= pollerr | Pollhup;    Error = Ep_insert (EP, &epds, Tf.file, FD, Full_check);} else    error =-eexist;if (full_check)    clear_tfile_check_list ();

Ep_insert () returns and then judges the Full_check flag, which is related to the dead loop detection mentioned above, and is omitted here.

Iv. Ep_insert ()

In the Ep_insert () function, first determine whether the number of files being monitored in the epoll instance has exceeded the limit, and then create a Epollitem object for the file to be added:

int error, revents, Pwake = 0;unsigned long flags;long user_watches;struct epitem *epi;struct ep_pqueue epq; User_watches = Atomic_long_read (&ep->user->epoll_watches); if (Unlikely (User_watches >= max_user_ Watches))        return-enospc;if (! ( EPI = Kmem_cache_alloc (Epi_cache, Gfp_kernel)))        

The next step is to initialize the Epollitem:

Init_list_head (&epi->rdllink); Init_list_head (&epi->fllink); Init_list_head (&epi->pwqlist); Epi->ep = Ep;ep_set_ffd (&epi->ffd, Tfile, fd); epi->event = *event;epi->nwait = 0;epi->next = EP_ Unactive_ptr;if (Epi->event.events & epollwakeup) {        error = Ep_create_wakeup_source (EPI);        if (error)                goto Error_create_wakeup_source;} else {        rcu_init_pointer (Epi->ws, NULL);}

Next is the more important action: Add the Epollitem object to the waiting queue of the monitored file. The wait queue is actually a list of callback functions, defined in the/include/linux/wait.h file. Because of different file system implementations, it is not possible to get the wait queue directly through the struct file object, so here the poll operation of the struct file returns the waiting queue of the object as a callback, and the callback function set here is Ep_ptable_queue_proc:

struct Ep_pqueue epq;.../* Initialize The poll table using the queue callback */epq.epi = Epi;init_poll_funcptr (&epq.p T, Ep_ptable_queue_proc);/* * Attach the item to the poll hooks and get current event bits. * We can safely use the file* here because it usage count have * been increased by the caller of this function. Note that after * This operation completes, the poll callback can start hitting * The new item. */revents = Ep_item_poll (EPI, &epq.pt);

The function of the struct ep_queue in the code above is to be able to get the corresponding Epollitem object in the poll callback function, which is very common in the Linux kernel.

In the callback function Ep_ptable_queue_proc, the kernel creates a struct Eppoll_entry object and then sets the callback function in the wait queue to Ep_poll_callback (). That is, Ep_poll_callback () is called back when a monitored file has an event, such as when Socker receives data. The Ep_ptable_queue_proc () code is as follows:

static void Ep_ptable_queue_proc (struct file *file, wait_queue_head_t *whead,                                 poll_table *pt) {        struct Epitem * EPI = Ep_item_from_epqueue (PT);        struct Eppoll_entry *pwq;        if (epi->nwait >= 0 && (pwq = Kmem_cache_alloc (Pwq_cache, Gfp_kernel))) {                Init_waitqueue_func_entry (& Amp;pwq->wait, ep_poll_callback);                Pwq->whead = Whead;                Pwq->base = epi;                if (Epi->event.events & epollexclusive)                        add_wait_queue_exclusive (Whead, &pwq->wait);                else                        add_wait_queue (Whead, &pwq->wait);                List_add_tail (&pwq->llink, &epi->pwqlist);                epi->nwait++;        } else {                /* We have to signal this an error occurred *                /epi->nwait =-1;        }}

Structural relationships such as eppoll_entry and epitem such as:

In the return to the Ep_insert () function. After the Ep_item_poll () call is complete, the Fllink field in Epitem is added to the f_ep_links linked list in the struct file, so that all the corresponding struct Epollitem objects can be found through the struct file. In order to find all the Epoll instances corresponding to the struct eventpoll through the struct Epollitem.

Spin_lock (&tfile->f_lock); List_add_tail_rcu (&epi->fllink, &tfile->f_ep_links); Spin_unlock ( &tfile->f_lock);

The Epollitem is then inserted into the red-black tree:

Ep_rbtree_insert (EP, EPI)

At the end of the update, the status is returned, and the insert operation is completed.

Before returning, you will also determine whether the file you just added is currently ready for the event, and if you are adding it to the Epoll ready list, skip the list of Ready links in the next section.

Finally, I draw a structure diagram between several structures.

Linux kernel Note: epoll implementation principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.