Linux Kernel notes: epoll implementation principle (1), linuxepoll

Last Update:2017-04-17 Source: Internet

Author: User

Tags epoll

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Linux Kernel notes: epoll implementation principle (1), linuxepoll

I. Description

The target kernel version is 4.4.10.

This article is just a simple note for me to read the source code. If I want to know the implementation of epoll, I strongly recommend the following article:

The Implementation of epoll (1)

The Implementation of epoll (2)

The Implementation of epoll (3)

The Implementation of epoll (4)

Ii. epoll_create ()

The system calls epoll_create () to create an epoll instance and return the file descriptor fd corresponding to the instance. In the kernel, each epoll instance corresponds to a struct eventpoll object, which is the core of epoll and is declared in the fs/eventpoll. c file.

The epoll_create interface is defined here. The main source code analysis is as follows:

First create a struct eventpoll object:

struct eventpoll *ep = NULL;...error = ep_alloc(&ep);if (error < 0)    return error;

Then assign an unused file descriptor:

fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));if (fd < 0) {    error = fd;    goto out_free_ep;}

Create a struct file object, set struct file_operations * f_op in the file to the global variable eventpoll_fops, and point void * private to the created eventpoll object ep:

struct file *file;...file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC));if (IS_ERR(file)) {    error = PTR_ERR(file);    goto out_free_fd;}

Then set the file pointer in eventpoll:

ep->file = file;

Finally, add the file descriptor to the file descriptor table of the current process and return it to the user.

fd_install(fd, file);
return fd;

The main structure relationships after the operation are as follows:

Iii. epoll_ctl ()

The definitions of epoll_ctl () called by the system in the kernel are as follows. For the meanings of each parameter, see the man manual of epoll_ctl.

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event)

Epoll_ctl () first checks whether the op is deleted. If not, copies the event parameter from the user space to the kernel:

struct epoll_event epds;...if (ep_op_has_event(op) &&     copy_from_user(&epds, event, sizeof(struct epoll_event)))         goto error_return;

Ep_op_has_event () is actually used to determine whether op is a delete operation:

static inline int ep_op_has_event(int op){    return op != EPOLL_CTL_DEL;}

Next, determine whether the EPOLLEXCLUSIVE flag is set. This flag is available only for the kernel of version 4.5, it is mainly used to solve the "surprise group" problem caused by adding the same file descriptor to multiple epoll instances at the same time. For details, refer to here. The setting of this flag has some restrictions. For example, it can only be set in the EPOLL_CTL_ADD operation, and the corresponding file descriptor itself cannot be an epoll instance. The following code checks these restrictions:

/* *epoll adds to the wakeup queue at EPOLL_CTL_ADD time only, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. * Also, we do not currently supported nested exclusive wakeups. */ if (epds.events & EPOLLEXCLUSIVE) {     if (op == EPOLL_CTL_MOD)         goto error_tgt_fput;     if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||            (epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))         goto error_tgt_fput;}

Next, get the struct file object step by step from the input file descriptor, and then get the struct eventpoll object from the private_data field in the struct file:

struct fd f, tf;struct eventpoll *ep;... f = fdget(epfd); ... tf = fdget(fd); ...ep = f.file->private_data;

If the file descriptor to be added also represents an epoll instance, it may cause an endless loop. The kernel checks the situation and returns an error if an endless loop exists. I haven't looked at this part of the code at present, and I will not post it here.

Next, you will find the epollitem object corresponding to the monitored file from the red/black tree of the epoll instance. If the object does not exist, that is, the file has not been added before, the returned value is NULL.

struct epitem *epi;...epi = ep_find(ep, tf.file, fd);

The ep_find () function is essentially a red/black tree search process. The comparison function used for searching and inserting the red/black tree is ep_cmp_ffd (). First, compare the address size of the struct file object, if they are the same, compare the file descriptor size. One case where the address of the struct file object is the same is that different file descriptors are directed to the same struct file object through the dup () System Call.

static inline int ep_cmp_ffd(struct epoll_filefd *p1, 
                             struct epoll_filefd *p2){        return (p1->file > p2->file ? +1:                (p1->file < p2->file ? -1 : p1->fd - p2->fd));}

Next, we will perform different processing based on the op operator. Here we only look at the add operation when op is equal to EPOLL_CTL_ADD. First, the system checks whether the epollitem object address returned in the previous operation is NULL. If it is not NULL, it indicates that the file has been added and an error is returned. Otherwise, the system calls the ep_insert () function to add the object. Before adding a file, the kernel automatically adds the POLLERR and POLLHUP events to the file.

if (!epi) {    epds.events |= POLLERR | POLLHUP;    error = ep_insert(ep, &epds, tf.file, fd, full_check);} else    error = -EEXIST;if (full_check)    clear_tfile_check_list();

After ep_insert () is returned, the full_check flag is determined. This flag is related to the above mentioned endless loop detection, and is omitted here.

Iv. ep_insert ()

In the ep_insert () function, first determine whether the number of monitored files in the epoll instance has exceeded the limit. If no problem exists, create an epollitem object for the file to be added:

int error, revents, pwake = 0;unsigned long flags;long user_watches;struct epitem *epi;struct ep_pqueue epq; user_watches = atomic_long_read(&ep->user->epoll_watches);if (unlikely(user_watches >= max_user_watches))        return -ENOSPC;if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))        return -ENOMEM;

Next we will initialize epollitem:

INIT_LIST_HEAD(&epi->rdllink);INIT_LIST_HEAD(&epi->fllink);INIT_LIST_HEAD(&epi->pwqlist);epi->ep = ep;ep_set_ffd(&epi->ffd, tfile, fd);epi->event = *event;epi->nwait = 0;epi->next = EP_UNACTIVE_PTR;if (epi->event.events & EPOLLWAKEUP) {        error = ep_create_wakeup_source(epi);        if (error)                goto error_create_wakeup_source;} else {        RCU_INIT_POINTER(epi->ws, NULL);}

Next, add the epollitem object to the waiting queue of the monitored file. The waiting queue is actually a linked list of callback functions, which are defined in the/include/linux/wait. H file. Because the implementation of different file systems is different, you cannot directly obtain the waiting queue through the struct file object. Therefore, the following code uses the poll operation of the struct file to return the waiting queue of the object in callback mode, the callback function set here is ep_ptable_queue_proc:

struct ep_pqueue epq;.../* Initialize the poll table using the queue callback */epq.epi = epi;init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);/* * Attach the item to the poll hooks and get current event bits. * We can safely use the file* here because its usage count has * been increased by the caller of this function. Note that after * this operation completes, the poll callback can start hitting * the new item. */revents = ep_item_poll(epi, &epq.pt);

In the code above, the structure ep_queue is used to obtain the corresponding epollitem object in the callback function of poll. This method is very common in Linux kernel.

In the callback function ep_ptable_queue_proc, the kernel creates a struct eppoll_entry object and sets the callback function in the waiting queue to ep_poll_callback (). That is, when an event occurs in the monitored file, for example, when socker receives data, ep_poll_callback () is called back. The ep_ptable_queue_proc () code is as follows:

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,                                 poll_table *pt){        struct epitem *epi = ep_item_from_epqueue(pt);        struct eppoll_entry *pwq;        if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {                init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);                pwq->whead = whead;                pwq->base = epi;                if (epi->event.events & EPOLLEXCLUSIVE)                        add_wait_queue_exclusive(whead, &pwq->wait);                else                        add_wait_queue(whead, &pwq->wait);                list_add_tail(&pwq->llink, &epi->pwqlist);                epi->nwait++;        } else {                /* We have to signal that an error occurred */                epi->nwait = -1;        }}

The structure relationships of eppoll_entry and epitem are as follows:

Return to the ep_insert () function. After calling ep_item_poll (), the fllink field in epitem is added to the f_ep_links linked list in the struct file, so that all corresponding struct epollitem objects can be found through the struct file, then, use struct epollitem to find the struct eventpoll corresponding to all epoll instances.

spin_lock(&tfile->f_lock);list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);spin_unlock(&tfile->f_lock);

Then insert epollitem into the red/black tree:

ep_rbtree_insert(ep, epi)

When the status is updated, the insert operation is completed.

Before returning, the system will determine whether the newly added file is ready for the current event. If yes, it will be added to the epoll ready linked list, about putting the ready linked list in the next section, I will skip it here.

Finally, I drew a structure chart between several structs.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More