It took me half a day to get a general idea of epoll implementation. Here we will summarize the following:
Epoll-related APIs:
1. epoll_create Function
Function declaration: int epoll_create (INT size)
This function generates a file descriptor dedicated to epoll. In fact, it applies for a space in the kernel to store whether or not the socket FD you want to pay attention to occurs and what happened.
Size is the maximum number of socket FD you can pay attention to on this epoll FD.
2. epoll_ctl Function
Function declaration: int epoll_ctl (INT epfd, int op, int FD, struct epoll_event * event)
This function is used to control events on an epoll file descriptor. You can register events, modify events, and delete events.
Parameters:
Epfd: The epoll-specific file descriptor generated by epoll_create;
OP: the operation to be performed, such as registration event. The possible values are epoll_ctl_add registration, epoll_ctl_mod repair and modification, and epoll_ctl_del deletion.
FD: the associated file descriptor;
Event: pointer to epoll_event;
If the call is successful, 0 is returned. If the call is unsuccessful,-1 is returned.
3. epoll_wait Function
Function declaration: int epoll_wait (INT epfd, struct epoll_event * events, int maxevents, int timeout)
This function is used to poll the occurrence of I/O events;
Parameters:
Epfd: The epoll-specific file descriptor generated by epoll_create;
Epoll_event: the array used to return the events to be processed;
Maxevents: Number of events that can be processed each time;
Timeout: the timeout value for waiting for an I/O event (I am not sure about the unit);-1 is equivalent to blocking, and 0 is equivalent to non-blocking. Generally, use-1.
Number of returned events
The disadvantage of select/poll is:
1. parameters must be repeatedly read from the user State during each call.
2. Scan the file descriptor repeatedly during each call.
3. At the beginning of each call, the current process should be put into the waiting queue of each file descriptor. After the call is completed, the process is deleted from each waiting queue.
In practical applications, select/poll may monitor a large number of file descriptors. If only a small part is returned each time, select/poll may not be efficient in this case. Epoll is designed to split a single select/poll operation into one epoll_create + multiple epoll_ctrl + one wait. In addition, the kernel adds a file system "[eventpoll]" for epoll operations. Each or more file descriptors to be monitored have an inode node of the corresponding eventpollfs file system, the main information is stored in the eventpoll struct. Important information of monitored files is stored in the epitem structure. So they are one-to-many relationships. Since the user State information has been saved to the kernel state when epoll_create and epoll_ctrl are executed, even if epoll_wait is called repeatedly, parameters are not copied repeatedly, and file descriptors are scanned, repeatedly put the current process into/out the waiting queue. In this way, the above three shortcomings are avoided.
Next let's take a look at their implementation:
Main code path and file Linux/fs/eventpoll. c Linux/include/Linux/eventpoll. h
SYSCALL_DEFINE1(epoll_create, int, size){ if (size <= 0) return -EINVAL; return sys_epoll_create1(0);}
Syscall_define1 (epoll_create1, Int, flags) {int error; struct eventpoll * Ep = NULL;/* Check the epoll _ * constant for consistency. */build_bug_on (epoll_cloexec! = O_cloexec); If (flags &~ Epoll_cloexec) Return-einval;/** create the internal data structure ("struct eventpoll "). */error = ep_alloc (& EP);/* apply for the eventpoll structure */If (error <0) Return Error;/** creates all the items needed to setup an eventpoll file. that is, * a file structure and a free file descriptor. */error = anon_inode_getfd ("[eventpoll]", & eventpoll_fops, EP, flags & o_cloexec);/* apply for inode nodes from the eventpoll file system, following the idea that everything in Linux is a file, the anon_inode_getfd function applies for the inode node, mounts the eventpoll_fops file operation function set, and sets the EP as the file-> private_data variable, later, we can see that EP is obtained from private_data. At last, we applied for an idle FD, associated the file, and finally returned the FD to the user State */If (error <0) ep_free (EP); Return Error ;}
Main data structure: struct eventpoll {spinlock_t lock; struct mutex CTX;/* Wait queue used by sys_epoll_wait () */wait_queue_head_t WQ;/* Wait queue used by file-> poll () */wait_queue_head_t poll_wait;/* List of ready file descriptors */struct list_head rdllist;/* RB tree root used to store monitored FD structs */struct rb_root RBR; /* red/black root node * // ** this is a single linked list that chains all the "struct epitem" That * happened while transfering ready events to userspace w/out * Holding-> lock. */struct epitem * ovflist;/* the user that created the eventpoll descriptor */struct user_struct * user ;};
Static int ep_alloc (struct eventpoll ** pep)/* application function */{int error; struct user_struct * user; struct eventpoll * EP; user = get_current_user (); error =-enomem; Ep = kzarloc (sizeof (* EP), gfp_kernel); If (unlikely (! EP) goto free_uid; spin_lock_init (& EP-> lock);/* initialize locks, semaphores, and various linked lists */mutex_init (& EP-> CTX ); init_waitqueue_head (& EP-> WQ); init_waitqueue_head (& EP-> poll_wait); init_list_head (& EP-> rdllist); EP-> RBR = rb_root; /* set to the root node of the red/black tree */EP-> ovflist = ep_unactive_ptr; EP-> User = user; * Pep = EP; return 0; free_uid: free_uid (User ); return Error ;}
The data structure relationship after create is as follows:
Let's continue to look at epoll_ctl.
Syscall_define4 (epoll_ctl, Int, epfd, Int, op, Int, FD, struct epoll_event _ User *, event) {int error; struct file * file, * tfile; struct eventpoll * EP; struct epitem * EPI; struct epoll_event EPPs; error =-efault; If (ep_op_has_event (OP) & copy_from_user (& etp, event, sizeof (struct epoll_event)/* copy from user State to kernel state */goto error_return; /* Get the "struct file *" for the eventpoll file */error =-ebadf; File = fget (epfd);/* get the file structure through epfd. epfd is created through epoll_create */If (! File) goto error_return;/* Get the "struct file *" for the target file */tfile = fget (FD ); /* get the file structure through FD. FD is the description word to be listened to */If (! Tfile) goto error_fput;/* the target file descriptor must support poll */error =-eperm; If (! Tfile-> f_op |! Tfile-> f_op-> poll) goto error_tgt_fput;/** we have to check that the file structure underneath the file descriptor * the user passed to us _ is _ An eventpoll file. and also we do not permit * adding an epoll file descriptor inside itself. */error =-einval; If (file = tfile |! Is_file_epoll (File) goto error_tgt_fput;/** at this point it is safe to assume that the "private_data" contains * our own data structure. */EP = file-> private_data;/* The eventpoll structure created by epoll_create gets */mutex_lock (& EP-> CTX ); /** try to lookup the file inside our RB tree, since we grabbed "CTX" * above, we can be sure to be able to use the item looked up by * ep_find () till we release the mutex. */EPI = ep_find (Ep, tfile, FD);/* go to the EP to check whether the FD has the corresponding epi. The EPI is a new data structure, as shown below, this function mainly completes the red/black tree search process and returns the search structure */error =-einval; Switch (OP) {Case epoll_ctl_add:/* For Add */If (! EPI) {/* if it is null and cannot be found, insert */etp. events | = pollerr | pollhup; error = ep_insert (Ep, & Epps, tfile, FD);/* Call insert to perform the insert operation */} else error =-eexist; /* if there is a new add command, and there is an existing EPI command, an error */break is returned; Case epoll_ctl_del:/* delete command */If (EPI) error = ep_remove (Ep, EPI); else error =-enoent; break; Case epoll_ctl_mod:/* modify command */If (EPI) {Epps. events | = pollerr | pollhup; error = ep_modify (Ep, EPI, & Epps);} else error =-enoent; break;} mutex_unlock (& EP-> SVD); error_tgt_fput: fput (tfile); error_fput: fput (File); error_return: Return Error ;}
Main data structure: epitem describes a descriptor to be monitored, use the red/black tree in the eventpoll structure to contact and manage struct epitem {/* RB Tree node used to link this structure to the eventpoll RB tree */struct rb_node RBN; /* red/black tree node information * // * List header used to link this structure to the eventpoll ready list */struct list_head rdllink; /** works together "struct eventpoll"-> ovflist in keeping the * single linked chain of items. */struct epitem * Next;/* the file descriptor information this item refers to */struct epoll_filefd FFD;/* Number of active wait queue attached to poll operations */INT nwait; /* list containing poll wait queues */struct list_head pwqlist;/* The "Container" of this item */struct eventpoll * EP; /* List header used to link this item to the "struct file" items list */struct list_head fllink; /* the structure that describe the interested events and the source FD */struct epoll_event event ;};
Static int ep_insert (struct eventpoll * EP, struct epoll_event * event, struct file * tfile, int FD) {int error, revents, pwake = 0; unsigned long flags; struct epitem * EPI; struct ep_pqueue EPQ;/* pollqueue structure if (unlikely (atomic_read (& EP-> User-> epoll_watches) >=max_user_watches) Return-enospc; If (! (EPI = kmem_cache_alloc (epi_cache, gfp_kernel)/* apply for the return-enomem structure of the EPI;/* Item initialization follow here... */init_list_head (& epi-> rdllink); init_list_head (& epi-> fllink); init_list_head (& epi-> pwqlist); EPI-> Ep = EP; ep_set_ffd (& epi-> FFD, tfile, FD);/* set the FfD structure */EPI-> event = * event; EPI-> nwait = 0; epi-> next = ep_unactive_ptr;/* initialize the poll table using the queue callback */EPQ. EPI = EPI; init_poll_funcptr (& epq.pt, ep_ptable_queue_proc);/* the most important task is to set the poll function */revents = tfile-> f_op-> poll (tfile, & epq.pt ); /* call the register poll function. The call process of ep_ptable_queue_proc should be combined with the FoPs call process of the file operation function set. Here, we will not analyze */error =-enomem; if (epi-> nwait <0) goto error_unregister; spin_lock (& tfile-> f_lock); list_add_tail (& epi-> fllink, & tfile-> f_ep_links ); spin_unlock (& tfile-> f_lock); ep_rbtree_insert (Ep, EPI);/* insert to the red/black tree */SP In_lock_irqsave (& EP-> lock, flags); If (revents & event-> events )&&! Ep_is_linked (& epi-> rdllink) {list_add_tail (& epi-> rdllink, & EP-> rdllist);/* If an expected event occurs, then, the rdllink is mounted to the rdlist ready queue. * // * handle Y waiting tasks that events are available */If (waitqueue_active (& EP-> WQ )) wake_up_locked (& EP-> WQ);/* determine whether to wake up */If (waitqueue_active (& EP-> poll_wait) pwake ++ ;} spin_unlock_irqrestore (& EP-> lock, flags); atomic_inc (& EP-> User-> epoll_watches);/* we have to call this outside the lock */If (pwake) ep_poll_safewake (& EP-> poll_wait);/* perform the wake-up operation */return 0; error_unregister: Unlock (Ep, EPI); spin_lock_irqsave (& EP-> lock, flags ); if (ep_is_linked (& epi-> rdllink) list_del_init (& epi-> rdllink); spin_unlock_irqrestore (& EP-> lock, flags); kmem_cache_free (epi_cache, EPI ); return Error ;}
Static void partition (struct file * file, wait_queue_head_t * whead, poll_table * PT) {struct epitem * EPI = ep_item_from_epqueue (PT); struct eppoll_entry * pwq; if (epi-> nwait> = 0 & (pwq = kmem_cache_alloc (pwq_cache, gfp_kernel) {/* apply for an eppoll_entry structure, similar to select (poll) init_waitqueue_func_entry (& pwq-> wait, ep_poll_callback); initialize the waiting queue node and register the notification callback function */pwq-> whead = whead; pwq-> base = EPI; add_wait_queue (whead, & pwq-> wait);/* The whead here is the poll_wait structure of EP, it is the poll_wait */list_add_tail (& pwq-> llink, & epi-> pwqlist) of the red/black root node; EPI-> nwait ++ ;} else {/* We have to signal that an error occurred */EPI-> nwait =-1 ;}} main data structure: struct eppoll_entry {/* List header used to link this structure to the "struct epitem" */struct list_head llink; /* The "base" pointer is set to the container "struct epitem" */struct epitem * base; /** wait queue item that will be linked to the target file wait * queue head. */wait_queue_t wait;/* The wait queue head that linked the "wait" Wait queue item */wait_queue_head_t * whead;}; struct ep_pqueue {poll_table pt; struct epitem * EPI ;};
The above code is the most important thing to do in ep_insert: Create struct eppoll_entry and set its wake-up callback function to ep_poll_callback, then join the device wait Queue (note that the whead here is the waiting queue that every device driver must carry as described in the previous chapter ). Only in this way can ep_poll_callback be called when the device is ready and the wait queue is awakened. Every time a poll system call is called, the operating system will mount current (the current process) to the waiting queue of all the devices corresponding to FD. As you can imagine, when there are thousands of FD requests, in this way, the "hanging" method is very troublesome, and each call to epoll_wait is not so arrogant. epoll only hangs current once when epoll_ctl (this first time is inevitable) and give each fd a command "Call the callback function after it is ready". If the device has an event, the FD will be put into the rdllist through the callback function, each time epoll_wait is called, only the FD in rdllist can be collected. epoll uses the callback function skillfully, we have implemented a more efficient event-driven model. Let's get a picture of the complex relationships in the process.
What Will ep_poll_callback do? It must be to insert the epitem (representing each FD) of the event received on the red-black tree into EP-> rdllist. In this way, when epoll_wait returns, rdllist is ready for FD list_add_tail (& epi-> rdllink, & EP-> rdllist ); Finally, let's look at a wait function.
Syscall_define4 (epoll_wait, Int, epfd, struct epoll_event _ User *, events, Int, maxevents, Int, timeout) {int error; struct file * file; struct eventpoll * EP; /* the maximum number of event must be greater than zero */If (maxevents <= 0 | maxevents> ep_max_events) Return-einval; /* Verify that the area passed by the user is writeable */If (! Access_ OK (verify_write, events, maxevents * sizeof (struct epoll_event) {error =-efault; goto error_return ;} /* Get the "struct file *" for the eventpoll file */error =-ebadf; file = fget (epfd); If (! File) goto error_return;/** we have to check that the file structure underneath the FD * the user passed to us _ is _ An eventpoll file. */error =-einval; If (! Is_file_epoll (File) goto error_fput;/** at this point it is safe to assume that the "private_data" contains * our own data structure. */EP = file-> private_data;/* Time to fish for events... */error = ep_poll (Ep, events, maxevents, timeout);/* perform ep_poll at the core, that is, perform poll */error_fput on the items on the EP: fput (File); error_return: Return Error;} static int ep_poll (struct eventpoll * EP, struct epoll_event __ User * events, int maxevents, long timeout) {int res, eavail; unsigned long flags; long jtimeout; wait_queue_t wait; /** calculate the timeout by checking for the "infinite" value (-1) * and the overflow condition. the passed timeout is in milliseconds, * That why (T * Hz)/1000. */jtimeout = (timeout <0 | timeout> = ep_max_mstimeo )? Max_schedule_timeout: (timeout * Hz + 999)/1000; retry: spin_lock_irqsave (& EP-> lock, flags); Res = 0; If (list_empty (& EP-> rdllist )) {/* No wonder the efficiency is high. Just check EP-> rdllist * // ** we don't have any available event to return to the caller. * We need to sleep here, and we will be wake up by * ep_poll_callback () when events will become available. */init_waitqueue_entry (& wait, current);/* if it is null, sleep until it is awakened */Wait. flag S | = wq_flag_exclusive; _ add_wait_queue (& EP-> WQ, & wait); (;;) {/* wake up, go in and process it * // ** we don't want to sleep if the ep_poll_callback () sends us * a wakeup in. that's why we set the task state * To task_interruptible before doing the checks. */set_current_state (task_interruptible); If (! List_empty (& EP-> rdllist) |! Jtimeout) break; If (signal_pending (current) {res =-eintr; break;} spin_unlock_irqrestore (& EP-> lock, flags); jtimeout = schedule_timeout (jtimeout ); spin_lock_irqsave (& EP-> lock, flags); }__ remove_wait_queue (& EP-> WQ, & wait); set_current_state (task_running );} /* Is it worth to try to dig for events? */Eavail =! List_empty (& EP-> rdllist) | EP-> ovflist! = Ep_unactive_ptr; spin_unlock_irqrestore (& EP-> lock, flags);/** try to transfer events to user space. in case we get 0 events and * there's still timeout left over, we go trying again in search of * more luck. */If (! Res & eavail &&! (RES = ep_send_events (Ep, events, maxevents) & jtimeout) goto retry;/* Send the event to the user State, upload the prepared FD to the user State */return res ;}