Source: http://www.cnblogs.com/sharra/archive/2010/12/30/1921287.html
Because you need to understand the principle of access to the underlying device, you need to understand the Access Mechanism of Linux devices, especially when dealing with a set of non-blocking I/O, the standard term seems to be multiplexing. Some of the sentences in the following articles are quoted and the source is not pointed out in detail.
Readers who have been familiar with Linux kernel or device driver development must be aware of the poll and select system calls and the epoll mechanism introduced from version 2.5 (the epoll mechanism includes three system calls ). Their articles on the Internet have some usage instructions, which are very detailed, with more source code analysis and more in-depth, and a lot of details. After reading several articles, I will write down what I think is more important. My intention is to compare the similarities and differences between the three with a simple concept as much as possible.
After several queries, I determined that poll and select should be classified as such system calls. They can block and detect a group of non-blocking Io devices at the same time, whether an event occurs (such as readable, writable, error output with a high priority, or error ), until a device triggers an event or exceeds the specified wait time-that is, they are not responsible for Io, but to help the caller find the currently ready device. The same type of product is Windows iocp, which also processes multiplexing, but encapsulates Io and probe together.
The preparation knowledge includes: 1. FD; 2. Op-> poll.
In Linux, all devices are abstracted as files, and a series of device files have their own virtual file systems. Therefore, the device's representation in the System Call parameters is file description. FD is actually an integer (in particular, the FD corresponding to the standard input, output, and error output are 0, 1, and 2 respectively ). When dealing with the kernel, FD that passes an integer can further check whether it is valid in its own file system. If it only returns a pointer, it cannot do so, after all, pointers are meaningless.
Access file through FD, and access its fileoperator through file. One of the concerns here is that fileop is poll. Because the system calls poll and select, it relies on this file to operate poll. The poll file operation has two parameters: one is the file itself, and the other can be seen as the callback function called when the device is not ready. This function transmits the queue unique to the kernel, let the kernel mount the current process to it (because when the device is ready, the device should wake up all nodes in its own wait queue, in this way, the current process gets the signal of completion ). The poll file operation must return a set of standard masks, each bit of which indicates the current different readiness status (all 0 indicates that no event is triggered ).
Let's talk about the early multiplexing versions poll and select.
In essence, what poll and select have in common is to do a poll for all specified devices. Of course, this is usually not ready yet, the callback function registers the current process to the device's waiting queue. If no event is triggered in the mask returned by all devices, the callback function pointer is removed, enter a time-limited sleep state, recover and continuously perform poll, and then perform a time-limited sleep until an event is triggered on one of the devices. As long as an event is triggered, the system calls back and returns to the user State, the user can perform further read or write operations on the related FD. Of course, not all devices are ready at this time, so you have to constantly poll or select the device. To make such a system call, you have to round-robin all the devices, the number of times is the number of devices * (sleep-1), that is, the time complexity is O (n), and the number of O (n) times is required. It can be seen that for common server programs that require concurrent listening to thousands of connections at the same time, and connections need to be reused, Poll and select have such performance bottlenecks. In addition, thousands of device FD must be copied from the user space to the kernel space during each call. The overhead here cannot be ignored.
Poll and select are put together because the mechanism is consistent, and the parameters and data structure are slightly different. Select one-time input of three groups of devices acting on different channels FD, respectively input, output and error exceptions. Each group's FD is expected to be unique to each group, and a group of events specified by the Code, such as input channel events that are expected to be input ready, input pending events, and errors. Then, select selects the FD that the caller cares about and performs poll file operations to check the returned mask to see if there are any events that interest the FD channel, for example, you can check whether the FD of the output channel has a series of events such as ready output. Similarly, if there is an FD event of interest, a call is returned. Select: in order to process three groups of FD with different event judgment rules at the same time, a bitmap is used to represent a group of bitmaps. The bit length is the largest FD value, the upper limit is 1024, and the three groups are 3072. In addition, this is only the input bitmap, and there is an outgoing bitmap of the same size. When the number of FD instances increases, the storage overhead is high.
Since the processing of a group of FD is relatively extensive, each FD is ready. Poll () system call is System
V's multi-I/O solution. It has three parameters. The first is the array pointer of the pollfd structure, that is, the pointer to a group of FD and its related information, because this structure contains, there are also expected event masks and returned event masks. In essence, the FD, input and output parameters in the SELECT statement are put under the same structure, and FD is no longer divided into three groups, you do not need to specify events of interest to fd, which are set by the caller. In this way, you do not need to use bitmaps to organize data, so you do not need to traverse all bitmaps. Each FD performs poll file operations, checks whether the returned mask has an expected event, and checks whether there is a need for suspension or error. If an event is triggered, you can return the call.
Back to the commonalities of poll and select, in the face of high-concurrency multi-connection application scenarios, they show the shortcomings that were not taken into account, although poll has improved compared with select. In addition to the above, each call requires a copy from the user space to the kernel space. The problem is that when in such an application situation, poll and select will have to perform multiple operations, and each operation may need to go to sleep for multiple times, that is, round robin FD for multiple times, what should we do with repeated and meaningless operations.
These repetitive and meaningless operations include: 1. Copying from the user to the kernel space. Since these FD instances are monitored for a long time, even the expected events will not change, the copy is undoubtedly repeated and meaningless. We can allow the kernel to save all FD or even expect events to be monitored for a long time, or modify some of the expected events when necessary; 2. Add the current thread to the wait queue of the corresponding device of each FD in turn. This is nothing more than notifying the process to exit calling when the device is ready, find a "proxy" Callback Function to replace the waiting queue for the current process to join the FD (this is what I just concluded later, the Linux waiting queue, it is actually a callback function queue. You can also use a macro to "join" the current process to wait for the queue. In fact, it is to add the callback function that wakes up the current process to the queue ). In this way, just like a poll system call, when the poll file operation finds that it is not ready, it calls an incoming callback function, which is the callback function specified by epoll, it is no longer like the previous poll system calls the specified callback function, but just adds the "proxy" callback function to the device's waiting queue, the callback function of this agent wakes up the device when the device is ready. Then, it puts the FD of the device to a specified place and wakes up the process that may be waiting, just get FD at the specified place. We can combine 1 and 2 to copy the FD only once. Once the FD is determined, we can perform poll file operations. If there is an event, of course, put FD in the specified place immediately, but there is usually no such thing. Then add a callback function to the FD wait queue. If there is an event, the FD will be automatically placed in the specified place, the current process does not need to wait for Poll and sleep one by one.
The epoll mechanism is improved in this way. It is true that when FD is few, the current process is not a problem, but now there are more and more, it is difficult to manage. In the past, when a device event was triggered, it was enough to wake up the current process, and the current process could only wait or loop in poll, and then poll again, I do not know the performance of the poll provided by the device. Can I check whether the current process has returned immediately after waiting, I have removed the callback function pointer and have to do it again. Isn't it enough to wake up the process?
Now let the event-triggered callback function do more. A callback function is called before the device is ready. Now, register another callback function in this callback function to trigger a device event, not only does it wake up the current process, but it also places its FD in a specified place. Just like the shift leader who collects the book, I had to ask students one by one if there was any book. If not, I had to wait for a while and then continue to ask. Now, I just want to go once, if there is no notebook, the shift leader will tell everyone to hand in the notebook there. When the shift leader thinks that he wants to take the notebook, he will go there and check it out or wait for some time before leaving. When the notebook arrives, he will be woken up, and then take it away. This principle is simple. as teachers and class workers often say, it is much easier for me to do more work, especially when more and more things need to be managed.
This mechanism or mode should also be used in the futuretask of Java, a bunch of threads running in the thread pool (of course, this is a task, not a thread, the interface is callable <v>, not runnable. run: callable. call, which can return results). Who should handle it first, but do they have to ask one by one? Simply expose yourself according to the established operations, so that the get method of futuretask can immediately know the first thread to be completed, and then obtain the result returned by this thread.
Epoll consists of three system calls: epoll_create, epoll_ctl, and epoll_wait. Epoll_create is used to create and initialize some internal data structures. epoll_ctl is used to add, delete, or modify the specified FD and its expected events. epoll_wait is used to wait for any previously specified FD event.
I don't know more about the internal data structure of epoll.