Linux kernel Epoll Implementation analysis

Last Update:2014-12-06 Source: Internet

Author: User

Tags epoll goto

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

the difference between Epoll and Select/poll

The select,poll,epoll is a mechanism for IO multiplexing. I/O multiplexing is a mechanism by which multiple descriptors can be monitored, and once a descriptor is ready, the program can be notified of the appropriate action.

The essence of select is 32 bits with 32 integers, that is, the 32*32= 1024来 identifier, and the FD value is 1-1024. When the value of FD exceeds the 1024 limit, the size of the fd_setsize must be modified. At this time, you can identify the 32*max value range of FD.
Poll, unlike Select, passes a POLLFD array to the kernel to pass events that need attention, so there is no limit to the number of descriptors, and the events field and revents in POLLFD are used to indicate the event of concern and the event that occurs. Therefore, the POLLFD array needs to be initialized only once.
Epoll is also an optimization of poll, which does not need to traverse all FD after returning, and maintains a list of FD in the kernel. Select and poll are maintaining the kernel list in the user state and then passing it to the kernel. Unlike Poll/select, Epoll is no longer a separate system call, but consists of epoll_create/epoll_ctl/epoll_wait three system calls, which you will see the benefits of doing. Epoll is supported in the kernel after 2.6.

Some of the major drawbacks of Select/poll:
1, each call Select/poll, will need to set the FD set from the user state copy to the kernel state, the cost of FD is very large
2. At the same time, each call to Select/poll needs to traverse all the FD that is passed in the kernel, which is also very expensive when FD is very large
3, the number of file descriptors for select support is too small, the default is 1024

Why Epoll is more efficient than Select/poll

The traditional poll function is equivalent to each call is reinvent, from the user space full read into UFDs, complete and then completely copied to the user space, and each poll need to do at least once all the devices to join and delete wait queue operation, these are inefficient reasons.

solution in Epoll. Each time a new event is registered in the Epoll handle (specifying Epoll_ctl_add in Epoll_ctl), all FD is copied into the kernel instead of being duplicated at epoll_wait. Epoll guarantees that each FD will be copied only once throughout the process. Select, poll, and Epoll all use Waitqueue to invoke the callback function to wakeup your asynchronous waiting thread, and if timeout is set, it will be a hrtimer, The callback function of select and poll does not do anything, but Epoll's Waitqueue callback function adds the current valid FD to the Ready list and then wakes up the asynchronous wait process, so the Epoll function returns this ready List, the Ready list contains all valid FD, so that kernel does not have to traverse all FD, the user space program does not have to traverse all the FD, but simply iterate back the valid FD linked list.

Why do you want to implement Eventpollfs

1, you can maintain some information in the kernel, this information is maintained between multiple epoll_wait, such as all monitored file descriptors
2, Epoll itself can also be poll/epoll

What conditions should be met for the Epoll FD to join

Theoretically, as long as a file can be added to the epoll,linux itself, a design idea is all documents. The file structure also embodies support for Epoll.

struct File {/* * Fu_list becomes invalid after file_free are called and queued via * Fu_rcuhead for RCU freeing */union {s Truct list_headfu_list;struct rcu_head Fu_rcuhead;} F_u;struct Pathf_path; #define F_DENTRYF_PATH.DENTRY//file-related directory entry object # define F_VFSMNTF_PATH.MNT//Installed file system containing the file <span style= "COLOR: #ff0000;" >const struct file_operations*f_op;  File action Table Pointers </span>spinlock_tf_lock; /* F_ep_links, F_flags, no IRQ */atomic_long_tf_count; The reference counter for the file object unsigned int f_flags;      The flag specified when the file is opened Fmode_tf_mode;       Process access mode Loff_tf_pos; Current file offset struct Fown_structf_owner; Data for I/O event notification via signals const struct CRED*F_CRED;STRUCT file_ra_statef_ra; File pre-read status u64f_version; Version number, automatically increments after each use #ifdef config_securityvoid*f_security; #endif/* Needed for TTY driver, and maybe others */void*private_ data;//pointer to data required by a specific file system or device driver #ifdef config_epoll/* used by fs/eventpoll.c to link all the hooks to this file */<span s tyle= "COLOR: #ff0000;" >struct list_headf_ep_links;//File Event polling waits for the chain header </span> #endif/* #ifdef config_epoll */struct address_space*f_mapping;//Pointer to the file address space object #ifdef config_debug_writecountunsigned long f_ mnt_write_state; #endif};

However, the addition of files can poll the event, which will meet the two-point requirement:

1, the file corresponding File_operations must implement the poll operation

2. Register the callback function on the corresponding file descriptor waiting queue to wake up the wait process

Epoll key data structure relationships are as follows:

How Epoll is poll out of the event

Take the TCP socket as an example to analyze how to poll the event

The Ep_insert function is executed when the FD is added to Epoll by the system call Epoll_ctl

/* * must is called with "MTX" held. */static int Ep_insert (struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd) {int error, revents , Pwake = 0;unsigned long flags;struct epitem *epi;struct ep_pqueue epq;if (Unlikely (&ep->user-> epoll_watches) >= max_user_watches)) return-enospc;if (! ( EPI = Kmem_cache_alloc (Epi_cache, Gfp_kernel)) return-enomem;/* Item initialization follow here ... */init_list_head (& Amp;epi->rdllink); Init_list_head (&epi->fllink); Init_list_head (&epi->pwqlist); epi->ep = Ep;ep _SET_FFD (&AMP;EPI-&GT;FFD, Tfile, fd); epi->event = *event;epi->nwait = 0;epi->next = EP_UNACTIVE_PTR;/* Initialize The poll table using the queue callback */epq.epi = Epi;<span style= "color: #ff0000;" >init_poll_funcptr (&epq.pt, Ep_ptable_queue_proc) </span>;/* * Attach The item to the poll hooks and get Curre NT Event bits. * We can safely use the file* here because it usage count has * been increasedBy the caller of this function. Note that after * This operation completes, the poll callback can start hitting * The new item. */<span style= "color: #ff0000;" >revents = Tfile->f_op->poll (Tfile, &epq.pt) </span>;/* * We have to check if something went wrong durin G The poll wait queue * Install process. Namely an allocation for a wait queue failed due * high memory pressure. */error =-enomem;if (epi->nwait < 0) Goto error_unregister;/* ADD The current item to the list of active Epoll hooks For this file */spin_lock (&tfile->f_lock); List_add_tail (&epi->fllink, &tfile->f_ep_links); Spin_unlock (&tfile->f_lock);/* * ADD the current item to the RB tree. All RB tree Operations was * protected by "MTX", and Ep_insert () was called with "MTX" held. */ep_rbtree_insert (EP, EPI);/* We have a to drop the new item inside our item list to keep track of it */spin_lock_irqsave (& Amp;ep->lock, flags);/* If The file is already ' ready ' we drop it inside the Ready List */if ((Revents & event->events) &&!ep_is_linked (&epi->rdllink)) {List_add_tail (& Epi->rdllink, &ep->rdllist);/* Notify waiting tasks that events is available */if (waitqueue_active (&ep-& GT;WQ) wake_up_locked (&AMP;EP-&GT;WQ), if (Waitqueue_active (&ep->poll_wait)) pwake++;} <span style= "White-space:pre" ></span>...return error;}

first the initial ep_pqueue such a structure will wait for the queue callback function to register, and then execute the registered callback function through the poll function will wait for the queue node to join the corresponding waiting queue

/* * This is the callback-used to add our wait queue to the * target file wakeup lists. */static void Ep_ptable_queue_proc (struct file *file, wait_queue_head_t *whead, poll_table *pt) {struct Epitem *epi = ep_it Em_from_epqueue (PT); struct Eppoll_entry *pwq;if (epi->nwait >= 0 && (pwq = Kmem_cache_alloc (Pwq_cache, GFP _kernel)) {init_waitqueue_func_entry (&pwq->wait, <span style= "color: #ff0000;" >ep_poll_callback</span>);p wq->whead = Whead;pwq->base = Epi;add_wait_queue (Whead, &pwq->wait ); List_add_tail (&pwq->llink, &epi->pwqlist); epi->nwait++;} else {/* We have to signal this an error occurred */epi->nwait =-1;}}

Ep_poll_callback is a callback function that executes when the corresponding descriptor state changes or if there is a corresponding event, TCP in order to have the following call flow when the state changes

Sock_def_wakeup (Sock_init_data to sock initialization)--->wake_up_interruptible_all-->__wake_up--->curr->func ( ep_poll_callbackfor file descriptors added to Epoll)

/* * This is the callback-passed to the wait queue wakeup * machanism. It is called by the stored file descriptors when they * has events to report. */static int Ep_poll_callback (wait_queue_t *wait, unsigned mode, int sync, void *key) {int pwake = 0;unsigned long flags;st Ruct Epitem *epi = ep_item_from_wait (wait); struct Eventpoll *ep = epi->ep;spin_lock_irqsave (&ep->lock, flags) ;/* * If the event mask does not contain any poll (2) event, we consider the * descriptor to be disabled. This condition was likely the effect of the * epolloneshot bit that disables the descriptor when an event is received, * UN Til the next epoll_ctl_mod'll be issued. */IF (! ( Epi->event.events & ~ep_private_bits)) Goto out_unlock;/* * Check The events coming with the callback. At this stage, not * every device reports the events in the "key" parameter of the * callback. We need to is able to handle both cases here, hence the * test for "key"! = NULL before the event match test. */if (Key &&! ((unsigned long) key & Epi->event.events)) Goto out_unlock;/* * If We are trasfering events to userspace, we can hold no locks * (because we ' re accessing user memory , and because of Linux f_op->poll () * semantics). All the events this happens during that period of time is * chained in ep->ovflist and requeued later on. */if (Unlikely (ep->ovflist! = ep_unactive_ptr)) {if (Epi->next = = ep_unactive_ptr) {Epi->next = ep->ovflist; Ep->ovflist = EPI;} Goto Out_unlock;} /* If This file is already in the Ready list we exit soon */if (!ep_is_linked (&epi->rdllink)) List_add_tail (&epi ->rdllink, &ep->rdllist);/* * Wake up (if active) both the Eventpoll wait list and the->poll () * Wait list . */if (Waitqueue_active (&AMP;EP-&GT;WQ)) <span style= "color: #ff0000;" >wake_up_locked (&AMP;EP-&GT;WQ) </span>;if (waitqueue_active (&ep->poll_wait)) Pwake++;out_unlock: Spin_unlock_irqrestore (&ep->lock, flags); */* We haveThis outside the lock */if (pwake) ep_poll_safewake (&ep->poll_wait); return 1;}

The ep_poll_callback action is to add the ready Epitem to the ep->rdlist, and then wake up the process waiting for the corresponding descriptor. Also that is what the system calls the Epoll_wait function, the wait function will determine whether rdlist is empty, if not empty, then jump out of the loop, the scan rdlist will be sent to the user state space with the event that occurred

At present, Linux system, PIPEFD,TIMERFD,SIGNALFD,EVENTFD and so on these can join Epoll, in addition Epoll itself can also as a FD join Epoll.

Libevent

Libevent is an event-triggered network library, suitable for Windows, Linux, BSD and other platforms, internal to select, Epoll, Kqueue and other systems call management event mechanism encapsulation. See http://libevent.org/

Linux kernel Epoll Implementation analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More