Linux核心筆記：epoll實現原理

最後更新：2017-04-16 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：poll 函數系統調用 fine 分配空間 is_file opera 結構圖

一、說明

針對的核心版本為4.4.10。

本文只是我自己看源碼的簡單筆記，如果想瞭解epoll的實現，強烈推薦下面的文章:

The Implementation of epoll(1)

The Implementation of epoll(2)

The Implementation of epoll(3)

The Implementation of epoll(4)

二、epoll_create()

系統調用epoll_create()會建立一個epoll執行個體並返回該執行個體對應的檔案描述符fd。在核心中，每個epoll執行個體會和一個struct eventpoll類型的對象一一對應，該對象是epoll的核心，其聲明在fs/eventpoll.c檔案中.

epoll_create的介面定義在這裡，主要源碼分析如下：

首先建立一個struct eventpoll對象：

struct eventpoll *ep = NULL;...error = ep_alloc(&ep);if (error < 0)    return error;

然後分配一個未使用的檔案描述符：

fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));if (fd < 0) {    error = fd;    goto out_free_ep;}

然後建立一個struct file對象，將file中的struct file_operations *f_op設定為全域變數eventpoll_fops，將void *private指向剛建立的eventpoll對象ep：

struct file *file;...file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC));if (IS_ERR(file)) {    error = PTR_ERR(file);    goto out_free_fd;}

然後設定eventpoll中的file指標：

ep->file = file;

最後將檔案描述符添加到當前進程的檔案描述符表中，並返回給使用者

fd_install(fd, file);
return fd;

操作結束後主要結構關係如：

三、epoll_ctl()

系統調用epoll_ctl()在核心中的定義如下，各個參數的含義可參見epoll_ctl的man手冊

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event)

epoll_ctl()首先判斷op是不是刪除操作，如果不是則將event參數從使用者空間拷貝到核心中：

struct epoll_event epds;...if (ep_op_has_event(op) &&     copy_from_user(&epds, event, sizeof(struct epoll_event)))         goto error_return;

ep_op_has_event()實際就是判斷op是不是刪除操作：

static inline int ep_op_has_event(int op){    return op != EPOLL_CTL_DEL;}

接下來判斷使用者是否設定了EPOLLEXCLUSIVE標誌，這個標誌是4.5版本核心才有的，主要是為瞭解決同一個檔案描述符同時被添加到多個epoll執行個體中造成的“驚群”問題，詳細描述可以看這裡。這個標誌的設定有一些限制條件，比如只能是在EPOLL_CTL_ADD操作中設定，而且對應的檔案描述符本身不能是一個epoll執行個體，下面代碼就是對這些限制的檢查：

/* *epoll adds to the wakeup queue at EPOLL_CTL_ADD time only, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. * Also, we do not currently supported nested exclusive wakeups. */ if (epds.events & EPOLLEXCLUSIVE) {     if (op == EPOLL_CTL_MOD)         goto error_tgt_fput;     if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||            (epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))         goto error_tgt_fput;}

接下來從傳入的檔案描述符開始，一步步獲得struct file對象，再從struct file中的private_data欄位獲得struct eventpoll對象：

struct fd f, tf;struct eventpoll *ep;... f = fdget(epfd); ... tf = fdget(fd); ...ep = f.file->private_data;

如果要添加的檔案描述符本身也代表一個epoll執行個體，那麼有可能會造成死迴圈，核心對此情況做了檢查，如果存在死迴圈則返回錯誤。這部分的代碼目前我還沒細看，這裡不再貼出。

接下來會從epoll執行個體的紅/黑樹狀結構裡尋找和被監控檔案對應的epollitem對象，如果不存在，也就是之前沒有添加過該檔案，返回的會是NULL。

struct epitem *epi;...epi = ep_find(ep, tf.file, fd);

ep_find()函數本質是一個紅/黑樹狀結構尋找過程，紅/黑樹狀結構尋找和插入使用的比較函數是ep_cmp_ffd()，先比較struct file對象的地址大小，相同的話再比較檔案描述符大小。struct file對象地址相同的一種情況是通過dup()系統調用將不同的檔案描述符指向同一個struct file對象。

static inline int ep_cmp_ffd(struct epoll_filefd *p1, 
                             struct epoll_filefd *p2){        return (p1->file > p2->file ? +1:                (p1->file < p2->file ? -1 : p1->fd - p2->fd));}

接下來會根據操作符op的不同做不同的處理，這裡我們只看op等於EPOLL_CTL_ADD時的添加操作。首先會判斷上一步操作中返回的epollitem對象地址是否為NULL，不是NULL說明該檔案已經添加過了，返回錯誤，否則調用ep_insert()函數進行真正的添加操作。在添加檔案之前核心會自動為該檔案增加POLLERR和POLLHUP事件。

if (!epi) {    epds.events |= POLLERR | POLLHUP;    error = ep_insert(ep, &epds, tf.file, fd, full_check);} else    error = -EEXIST;if (full_check)    clear_tfile_check_list();

ep_insert()返回之後會判斷full_check標誌，該標誌和上文提到的死迴圈檢測相關，這裡也略去。

四、ep_insert()

ep_insert()函數中，首先判斷epoll執行個體中監視的檔案數量是否已超過限制，沒問題則為待添加的檔案建立一個epollitem對象：

int error, revents, pwake = 0;unsigned long flags;long user_watches;struct epitem *epi;struct ep_pqueue epq; user_watches = atomic_long_read(&ep->user->epoll_watches);if (unlikely(user_watches >= max_user_watches))        return -ENOSPC;if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))        return -ENOMEM;

接下來是對epollitem的初始化：

INIT_LIST_HEAD(&epi->rdllink);INIT_LIST_HEAD(&epi->fllink);INIT_LIST_HEAD(&epi->pwqlist);epi->ep = ep;ep_set_ffd(&epi->ffd, tfile, fd);epi->event = *event;epi->nwait = 0;epi->next = EP_UNACTIVE_PTR;if (epi->event.events & EPOLLWAKEUP) {        error = ep_create_wakeup_source(epi);        if (error)                goto error_create_wakeup_source;} else {        RCU_INIT_POINTER(epi->ws, NULL);}

接下來是比較重要的操作：將epollitem對象添加到被監視檔案的等待隊列上去。等待隊列實際上就是一個回呼函數鏈表，定義在/include/linux/wait.h檔案中。因為不同檔案系統的實現不同，無法直接通過struct file對象擷取等待隊列，因此這裡通過struct file的poll操作，以回調的方式返回對象的等待隊列，這裡設定的回呼函數是ep_ptable_queue_proc:

struct ep_pqueue epq;.../* Initialize the poll table using the queue callback */epq.epi = epi;init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);/* * Attach the item to the poll hooks and get current event bits. * We can safely use the file* here because its usage count has * been increased by the caller of this function. Note that after * this operation completes, the poll callback can start hitting * the new item. */revents = ep_item_poll(epi, &epq.pt);

上面代碼中結構體ep_queue的作用是能夠在poll的回呼函數中取得對應的epollitem對象，這種做法在Linux核心裡非常常見。

在回呼函數ep_ptable_queue_proc中，核心會建立一個struct eppoll_entry對象，然後將等待隊列中的回呼函數設定為ep_poll_callback()。也就是說，當被監控檔案有事件到來時，比如socker收到資料時，ep_poll_callback()會被回調。ep_ptable_queue_proc()代碼如下：

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,                                 poll_table *pt){        struct epitem *epi = ep_item_from_epqueue(pt);        struct eppoll_entry *pwq;        if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {                init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);                pwq->whead = whead;                pwq->base = epi;                if (epi->event.events & EPOLLEXCLUSIVE)                        add_wait_queue_exclusive(whead, &pwq->wait);                else                        add_wait_queue(whead, &pwq->wait);                list_add_tail(&pwq->llink, &epi->pwqlist);                epi->nwait++;        } else {                /* We have to signal that an error occurred */                epi->nwait = -1;        }}

eppoll_entry和epitem等結構關係如：

在回到ep_insert()函數中。ep_item_poll()調用完成之後，會將epitem中的fllink欄位添加到struct file中的f_ep_links鏈表中，這樣就可以通過struct file找到所有對應的struct epollitem對象，進而通過struct epollitem找到所有的epoll執行個體對應的struct eventpoll。

spin_lock(&tfile->f_lock);list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);spin_unlock(&tfile->f_lock);

然後就是將epollitem插入到紅/黑樹狀結構中：

ep_rbtree_insert(ep, epi)

最後再更新下狀態就返回了，插入操作也就完成了。

在返回之前還會判斷一次剛才添加的檔案是不是當前已經有事件就緒了，如果是就將其加入到epoll的就緒鏈表中，關於就緒鏈表放到下一部分中講，這裡略過。

最後是我畫的幾個結構體之間的結構圖。

Linux核心筆記：epoll實現原理

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Linux核心筆記：epoll實現原理

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support