2017-07-20
Eventfd in Linux is a relatively new process of communication, and the semaphore is different than the event can not only be used for inter-process communication, but also the user to send signals to the user layer of the process. EVENTFD plays a significant role as a medium for vhost and KVM interaction in the implementation of Virtio back-end drive vhost. This section unifies the Linux source code to EVENTFD the concrete realization sits down the brief analysis.
EVENTFD has functions under the user layer
#include <sys/eventfd.h>
int eventfd (unsigned int initval, int flags);
This function returns a file descriptor, similar to other file descriptor operations, which can be used to perform a series of operations, such as reading, writing, poll, select, and so on, but here we only consider read, write. Look at the kernel implementation of the function
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{ int fd, error; struct file *file;
error = get_unused_fd_flags(flags & EFD_SHARED_FCNTL_FLAGS); if (error < 0) return error;
fd = error;
file = eventfd_file_create(count, flags); if (IS_ERR(file)) {
error = PTR_ERR(file); goto err_put_unused_fd;
}
fd_install(fd, file); return fd;
err_put_unused_fd:
put_unused_fd(fd); return error;
}
The code itself is simple, and first gets a free file descriptor, which is no different from the normal file descriptor. It then calls Eventfd_file_create to create a file structure. This function has a series of operations for EVENTFD, see the function
struct file *eventfd_file_create(unsigned int count, int flags)
{ struct file *file; struct eventfd_ctx *ctx; /* Check the EFD_* constants for consistency. */ BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK); if (flags & ~EFD_FLAGS_SET) return ERR_PTR(-EINVAL);
ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) return ERR_PTR(-ENOMEM);
kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS)); if (IS_ERR(file))
eventfd_free_ctx(ctx); return file;
}
Here, each eventfd in the kernel corresponding to a EVENTFD_CTX structure, the structure after we say again, the function in the first allocation of memory and then initialize the structure, note that there is a wait queue and count, The wait queue is the wait queue for the corresponding EVNETFD when the process needs to be blocked, and count is the value of the read and write operations. Then call Anon_inode_getfile to get a file object, specifically, there is nothing to say, just note that the eventfd_ctx just allocated as a private member of the file structure is private_data, and associated with EVENTFD's own Operation function table Eventfd_fops, there are not many functions implemented, as follows
static const struct file_operations eventfd_fops = {
#ifdef CONFIG_PROC_FS
.show_fdinfo = eventfd_show_fdinfo, #endif .release = eventfd_release,
.poll = eventfd_poll,
.read = eventfd_read,
.write = eventfd_write,
.llseek = noop_llseek,
};
We focus on the read and write functions. When user space initiates a read operation on the Eventfd file descriptor, the final call is to the Eventfd_read function in the above function table.
static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
loff_t *ppos)
{ struct eventfd_ctx *ctx = file->private_data;
ssize_t res;
__u64 cnt; if (count < sizeof(cnt)) return -EINVAL;
res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt); if (res < 0) return res; return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
}
First get eventfd_ctx from Private_data, and then determine whether the requested read size satisfies the condition, here count is 64 bits is 8 bytes, so the minimum read 8 bytes, if insufficient is wrong. No problem. Call Eventfd_ctx_read, which actually returns count count in Eventfd_ctx, and returns if there is a problem with the read, otherwise writes the value to the user space. The front eventfd_ctx_read is the core of the read, when will return a value less than 0, we look at the implementation of the function
ssize_t eventfd_ctx_read (struct eventfd_ctx * ctx, int no_wait, __u64 * cnt)
{
ssize_t res;
DECLARE_WAITQUEUE (wait, current);
spin_lock_irq (& ctx-> wqh.lock);
* cnt = 0;
res = -EAGAIN;
if (ctx-> count> 0)
res = 0;
else if (! no_wait) {
/ * add to wait queue * /
__add_wait_queue (& ctx-> wqh, & wait);
for (;;) {
/ * Set blocking status * /
set_current_state (TASK_INTERRUPTIBLE);
/ * If the signal becomes stateful. Break * /
if (ctx-> count> 0) {
res = 0;
Break;
}
/ * If there are unprocessed signals, also break and process them * /
if (signal_pending (current)) {
res = -ERESTARTSYS;
Break;
}
/ * Otherwise trigger the scheduler to perform scheduling * /
spin_unlock_irq (& ctx-> wqh.lock);
schedule ();
spin_lock_irq (& ctx-> wqh.lock);
}
/ * remove from the wait queue * /
__remove_wait_queue (& ctx-> wqh, & wait);
/ * set processs state * /
__set_current_state (TASK_RUNNING);
}
if (likely (res == 0)) {
/ * read fdcount again * /
eventfd_ctx_do_read (ctx, cnt);
/ ** /
if (waitqueue_active (& ctx-> wqh))
wake_up_locked_poll (& ctx-> wqh, POLLOUT);
}
spin_unlock_irq (& ctx-> wqh.lock);
return res;
}
This function is relatively long, we analyze slowly, first operation Eventfd_ctx to lock to ensure security. At first the res is initialized to-eagain, if the count count is greater than 0, then the res is 0, otherwise it means count=0 (count is not less than 0), in which case the parameter flags are passed in, and if O_nonblock is set, no wait is required. return directly to Res. That's what we said earlier. The return value is less than 0. If the O_NONBLOCK flag is not specified, it is blocked here because the count value is not read (the count value is 0). Specifically to add the current process to the Eventfd_ctx queue, it is necessary to say Declare_waitqueue (wait, current), the macro declares and initializes a wait_queue_t object whose associated function is Default_wake _function, which is present as a wake-up function. OK, next up, join the queue after entering a dead loop, set the current process state to task_interruptible, and constantly check the count value, if count is greater than 0, means that there is a signal, set res=0, then break, Then remove the process from the wait queue and set the status task_running. If the count value is 0, check if there is a pending signal, and if there is a signal, the signal needs to be processed first, but this would be considered a failure. If you have a normal block, call the scheduler to dispatch. After break, if res==0, the Count value is read, this corresponds to the case where the count value is greater than 0 in the above loop. Specifically read through the Eventfd_ctx_do_read function, the function is simple
static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
{ *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
ctx->count -= *cnt;
}
If the EFD_SEMAPHORE flag is not specified, the count value is returned, which specifies that the EVENTFD is used as the semaphore, but the kernel after 2.6 is set to 0. Then subtract from count, which is actually minus 0. After reading the value, the count value becomes smaller, and if there is a write process that is blocked on the EVENTFD, it can now wake up, so here we check that if there is a process waiting for the queue, call Wake_up_locked_poll to wake up the corresponding process.
The write operation of the user space is eventually called to Eventfd_write, but the implementation of the function is similar to the above read operation, where it is not duplicated and interested in self-analysis of the source code. The front says that the kernel can also actively send signals to the EVENTFD, which is achieved through the Eventfd_signal function
__u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
{
unsigned long flags;
spin_lock_irqsave(&ctx->wqh.lock, flags); if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;
ctx->count += n; /*mainly judge if wait is empty*/ if (waitqueue_active(&ctx->wqh))
wake_up_locked_poll(&ctx->wqh, POLLIN);
spin_unlock_irqrestore(&ctx->wqh.lock, flags); return n;
}
This function is similar to the Write function, but does not block if the specified n is too large to cause count plus to exceed Ullong_max, then n is the difference between the current count and Ullong_max, which means that count is not overrun. Then if you wait for the queue to have a waiting process, it wakes up to its process, and of course it should be a process that requires a read operation.
To here for the introduction of EVENTFD basically finished, generally very simple thing, but after the above analysis is not difficult to find, EVENTFD should be due to the ranks of low-level communications, that is not used for transmitting large amounts of data, only for notification or synchronous operation.
For instructions on how to use EVENTFD, refer to the manual: Https://linux.die.net/man/2/eventfd
Emmanuel
Resources:
Linux kernel 3.10.1 Source code
Linux EVENTFD Analysis