Linux io mode and select, poll, Epoll detailed

Last Update:2016-08-14 Source: Internet

Author: User

Tags epoll int size readable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Note: This article is a lot of blog learning and summary, there may be an understanding error. Please take a skeptical look at the same time if there are mistakes to be pointed out. What is the difference between synchronous IO and asynchronous Io, what is blocking IO and non-blocking IO respectively? The answers given by different people in different contexts are different. So first limit the context of this article. This article discusses the background of network IO in a Linux environment. A concept note before you explain, there are several concepts:-user space and kernel space-process switching-process blocking-file descriptors-cache I/O user space and kernel space The operating system now uses virtual memory, so for 32-bit operating systems, Its addressing space (virtual storage space) is 4G (2 of the 32-time Square). The core of the operating system is the kernel, which is independent of the normal application, has access to protected memory space, and has all the permissions to access the underlying hardware device. In order to ensure that the user process can not directly manipulate the kernel (kernel), to ensure the security of the kernel, worry about the system to divide the virtual space into two parts, part of the kernel space, part of the user space. For the Linux operating system, the highest 1G bytes (from the virtual address 0xc0000000 to 0xFFFFFFFF) for the kernel to use, called the kernel space, and the lower 3G bytes (from the virtual address 0x00000000 to 0xBFFFFFFF) for each process to use, Called User space. Process switching in order to control the execution of a process, the kernel must have the ability to suspend a process that is running on the CPU and resume execution of a previously suspended process. This behavior is called process switching. So it can be said that any process that runs under the support of the operating system kernel is closely related to the kernel. To run from one process to another, the process goes through the following changes: 1. Save the processor context, including program counters and other registers. 2. Update the PCB information. 3. Move the PCB of the process into the appropriate queue, such as ready, in an event blocking queue. 4. Select another process to execute and update its PCB. 5. Update the data structure of the memory management. 6. Restore the processing machine context. Note: All in all, the process of blocking the process is running, because some expected events do not occur, such as requesting system resources failed, waiting for the completion of an operation, the new data has not arrived or no new work to do, and so on, the system automatically executes the blocking primitive (block), making itself from the running state into a blocking state. It can be seen that the blocking of a process is an active behavior of the process itself, and therefore it is possible to turn it into a blocking state only if the process is in a running state (acquiring the CPU). When a process goes into a blocking state, it does not consume CPU resources. The file descriptor FD file descriptor (filename descriptor) is a term in computer science that is used to express a reference to a file.The concept of image. The file descriptor is formally a non-negative integer. In fact, it is an index value that points to the record table in which the kernel opens a file for each process maintained by the process. When a program opens an existing file or creates a new file, the kernel returns a file descriptor to the process. In programming, some of the underlying programming often revolves around file descriptors. However, the concept of file descriptors is often applied only to operating systems such as UNIX and Linux. The cache I/O cache I/O is also referred to as standard I/O, and most file system default I/O operations are cache I/O. In the Linux cache I/O mechanism, the operating system caches the I/O data in the file system's page cache, which means that the data is copied into the buffer of the operating system kernel before it is copied from the operating system kernel buffer to the application's address space. The disadvantage of cache I/O: The data copy operation in the application address space and the kernel requires multiple copies of data during the transfer process, and the CPU and memory overhead of these data copy operations is very large. Two IO mode just now, for an IO access (read example), the data is copied to the operating system kernel buffer before it is copied from the operating system kernel buffer to the application's address space. So, when a read operation occurs, it goes through two stages: 1. Wait for data preparation (waiting for the database is ready) 2. Copy the data from the kernel into the process (Copying the data from the kernel to the process) formally because of these two stages, the Linux system produces the following five kinds of network mode scheme. -Blocking I/O (blocking IO)-non-blocking I/O (nonblocking io)-I/O multiplexing (IO multiplexing)-signal driven I/O (signal driven IO)-Asynchronous I/O (asynchrono US IO) Note: Since signal driven IO is not commonly used in practice, I only refer to the remaining four IO Model. Blocking I/O (blocking IO) in Linux, all sockets are blocking by default, and a typical read operation flow is this: when the user process calls the Recvfrom system call, Kernel began the first phase of IO: Preparing the data (for network IO, many times the data did not arrive at the beginning.) For example, you have not received a full UDP packet. This time kernel will have to wait for enough dataArrival). This process needs to wait, which means that the data is copied into the buffer of the operating system kernel, which requires a process. On this side of the user process, the entire process is blocked (of course, by the process's own choice of blocking). When kernel waits until the data is ready, it copies the data from the kernel to the user's memory, and then kernel returns the result, and the user process removes the block state and re-runs it. Therefore, the blocking IO is characterized by block in both phases of IO execution. Under non-blocking I/O (nonblocking IO) Linux, you can make it non-blocking by setting the socket. When performing a read operation on a non-blocking socket, the process looks like this: when the user process issues a read operation, if the data in the kernel is not yet ready, it does not block the user process, but returns an error immediately. From the user process point of view, it initiates a read operation and does not need to wait, but immediately gets a result. When the user process determines that the result is an error, it knows that the data is not ready, so it can send the read operation again. Once the data in the kernel is ready and again receives the system call of the user process, it immediately copies the data to the user's memory and then returns. Therefore, nonblocking IO is characterized by the user process needs to constantly proactively ask kernel data well no. I/O multiplexing (IO multiplexing) IO multiplexing is what we call Select,poll,epoll, and in some places this IO mode is the event driven IO. The benefit of Select/epoll is that a single process can simultaneously handle multiple network connections of IO. The basic principle of the select,poll,epoll is that the function will constantly poll all sockets that are responsible, and when a socket has data arrives, notifies the user of the process. When the user process invokes select, the entire process is blocked, and at the same time, kernel "monitors" all select-responsible sockets, and when the data in any one socket is ready, select returns. This time the user process then invokes the read operation, copying the data from the kernel to the user process. Therefore, I/O multiplexing is characterized by a mechanism in which a process can wait for multiple file descriptors at the same time, and any one of these file descriptors (socket descriptors) goes into a read-ready state, and the Select () function can be returned. Figure of this figure and blocking IOIn fact, it's not much different, in fact, it's even worse. Because two system calls (select and Recvfrom) are required, blocking IO only invokes one system call (Recvfrom). However, the advantage of using select is that it can handle multiple connection at the same time. Therefore, if the number of connections processed is not high, Web server using Select/epoll does not necessarily perform better than the Web server using multi-threading + blocking IO, and may be more delayed. The advantage of Select/epoll is not that a single connection can be processed faster, but that it can handle more connections. In the IO multiplexing model, in practice, for each socket, it is generally set to non-blocking, but, as shown, the entire user's process has been block. Only the process is the block of the Select function, not the socket IO. The asynchronous IO under asynchronous I/O (asynchronous IO) Inux is actually very small. Take a look at its process: After the user process initiates the read operation, you can begin to do other things immediately. On the other hand, from the perspective of kernel, when it receives a asynchronous read, first it returns immediately, so no block is generated for the user process. Then, kernel waits for the data to be ready and then copies the data to the user's memory, and when all this is done, kernel sends a signal to the user process to tell it that the read operation is complete. Summing up the difference between blocking and non-blocking call blocking IO will block the corresponding process until the operation is completed, and non-blocking IO will return immediately if the kernel also prepares the data. The difference between synchronous IO and asynchronous IO requires a definition before explaining the difference between synchronous IO and asynchronous IO. The POSIX definition is this:-A synchronous I/O operation causes the requesting process to being blocked until that I/O operation completes ;-an asynchronous I/O operationdoes not cause the requesting process to be blocked; the difference is that synchronous IO does "io operation" when it blocks the process. According to this definition, the blocking io,non-blocking Io,io Multiplexing described previously are synchronous IO. Some people will say, non-blocking io is not block AH. Here is a very "tricky" place, defined in the "IO operation" refers to the real IO operation, is the example of recvfrom this system call. Non-blocking IO does not block the process when it executes recvfrom this system call if the kernel data is not ready. However, when the data in the kernel is ready, recvfrom copies the data from the kernel to the user's memory, at which point the process is blocked, during which time the process is block. The asynchronous IO is not the same, and when the process initiates an IO operation, the direct return is ignored until the kernel sends a signal telling the process that IO is complete. Throughout this process, the process has not been blocked at all. Comparison of the various IO model: Through the above image, you can find the difference between non-blocking io and asynchronous IO is still very obvious. In non-blocking io, although the process will not be blocked for most of the time, it still requires the process to go to the active check, and when the data is ready, it is also necessary for the process to proactively call Recvfrom to copy the data to the user's memory. and asynchronous Io is completely different. It's like a user process handing over an entire IO operation to someone else (kernel) and then sending a signal notification when someone finishes it. During this time, the user process does not need to check the status of the IO operation, nor does it need to actively copy the data. The three I/O multiplexing Select, poll, Epoll detailed select,poll,epoll are all IO multiplexing mechanisms. I/O multiplexing is a mechanism by which a process can monitor multiple descriptors and, once a descriptor is ready (usually read-ready or ready to write), notifies the program to read and write accordingly. But select,poll,epoll are essentially synchronous I/O because they all need to read and write when the read-write event is ready, which means that the read and write process is blocked, and asynchronous I/O does notThe implementation of asynchronous I/O is responsible for copying the data from the kernel to the user space. (verbose here) selectint Select (int n, fd_set *readfds, Fd_set *writefds, Fd_set *exceptfds, struct timeval *timeout); Select function Monitoring The file descriptor is divided into 3 categories, namely Writefds, Readfds, and Exceptfds. After the call, the Select function blocks until a description is ready (with data readable, writable, or except), or timed out (timeout Specifies the wait time, and if the return is set to null immediately), the function returns. When the Select function returns, you can find the ready descriptor by traversing Fdset. Select is currently supported on almost all platforms, and its good cross-platform support is one of its advantages. A disadvantage of select is that the maximum number of file descriptors that a single process can monitor is 1024 on Linux, which can be improved by modifying the macro definition or even recompiling the kernel, but this also results in a decrease in efficiency. Pollint poll (struct POLLFD *fds, unsigned int nfds, int timeout), different and select uses three bitmaps to represent three Fdset, poll is implemented using a POLLFD pointer. struct POLLFD {int fd;/* File descriptor */short events;/* Requested Events to watch */short revents;/* returned Eve NTS witnessed */};p The OLLFD structure contains the event to be monitored and the event that occurred, no longer using the Select "parameter-value" delivery method. At the same time, POLLFD does not have the maximum number of limits (but the performance will also decrease if the number is too large). As with the Select function, poll returns, you need to poll the POLLFD to get the ready descriptor. From the above, select and poll need to traverse the file descriptor to get a ready socket after returning. In fact, a large number of clients connected at the same time may only be in a very small state of readiness at a time, so their efficiency will decrease linearly as the number of monitored descriptors increases. Epollepoll is presented in the 2.6 kernel and is an enhanced version of the previous select and poll. Compared to select and poll,Epoll is more flexible and has no descriptor restrictions. Epoll uses a file descriptor to manage multiple descriptors, storing the event of the user-relationship's file descriptor in an event table in the kernel, which is only needed once for the user-space and kernel-space copy. A epoll operation Process Epoll operation process requires three interfaces, as follows: int epoll_create (int size),//Create a epoll handle, size is used to tell the kernel how large the number of this listener is, int epoll_ctl (int EPFD, int op, int fd, struct epoll_event *event); int epoll_wait (int epfd, struct epoll_event * events, int maxevents, int timeout); 1. int epoll_create (int size); Creates a epoll handle that tells the kernel how large the number of listeners is, which differs from the first parameter in select () and gives the value of the maximum listening fd+1. The parameter size does not limit the maximum number of descriptors that epoll can listen to, but is a recommendation for the kernel to initially allocate internal data structures. When the Epoll handle is created, it will occupy an FD value, under Linux if the view/proc/process id/fd/, is able to see this fd, so after the use of Epoll, must call Close () closed, otherwise it may cause FD to be exhausted. 2. int epoll_ctl (int epfd, int op, int fd, struct epoll_event *event); The function is an OP operation on the specified descriptor FD. -EPFD: Is the return value of Epoll_create (). -op: Represents the OP operation, represented by three macros: Add Epoll_ctl_add, delete Epoll_ctl_del, modify Epoll_ctl_mod. Add, delete, and modify the listener events for FD, respectively. -FD: Is the FD (file descriptor) that needs to be monitored-epoll_event: is to tell the kernel what to listen for, struct epoll_event structure is as follows: struct epoll_event {__uint32_t events;/* Epoll Events */epoll_data_t data; /* User data variable */};//events can be a collection of several macros: Epollin: represents the corresponding fileDescriptors can be read (including a graceful closing of the socket); Epollout: Indicates that the corresponding file descriptor can be written; Epollpri: indicates that the corresponding file descriptor has an urgent data readable (this should indicate that there is out-of-band data coming) ; Epollerr: Indicates an error in the corresponding file descriptor; Epollhup: Indicates that the corresponding file descriptor is hung up; Epollet: Sets the Epoll to edge trigger (edge triggered) mode, which is relative to the horizontal trigger (level triggered ). Epolloneshot: Listen to only one event, when the event is monitored, if you still need to continue to listen to the socket, you need to add the socket to the Epoll queue 3. int epoll_wait (int epfd, struct epoll_event * events, int maxevents, int timeout), wait for IO events on EPFD, and return up to maxevents events. Parameter events is used to get the collection of event from the kernel, the maxevents tells the kernel how big the events are, the value of this maxevents cannot be greater than the size when the Epoll_create () was created, the parameter timeout is the timeout (in milliseconds, 0 returns immediately , 1 will be uncertain, and there are statements that are permanently blocked). The function returns the number of events that need to be processed, such as returning 0 to indicate a timeout. Two modes of operation Epoll to file descriptors: LT (Level Trigger) and ET (Edge trigger). The LT mode is the default mode and the LT mode differs from the ET mode as follows: LT mode: When Epoll_wait detects that a descriptor event occurs and notifies the application of the event, the application can not process the event immediately. The next time you call Epoll_wait, the application will respond again and notify you of this event. ET mode: When Epoll_wait detects that a descriptor event occurs and notifies the application of this event, the application must handle the event immediately. If it is not processed, the next time you call Epoll_wait, the application will not respond again and notify this event. 1. The LT Mode Lt (level triggered) is the default way of working and supports both block and No-block sockets. In this practice, the kernel tells you whether a file descriptor is ready, and then you can do an IO operation on this ready FD. If you don't do anything, the kernel will still keep you informed. 2. Et mode et (edge-triggered) is a high-speed operation and supports only no-block sockets. In this mode, the kernel tells you through Epoll when the descriptor is never ready to be ready. It then assumes that you know that the file descriptor is ready and no more ready notifications are sent for that file descriptor until you do something that causes the file descriptor to no longer be ready (for example, you are sending, receiving, or receiving requests, Or send received less than a certain amount of data caused by a ewouldblock error). Note, however, that the kernel does not send more notifications (only once) if the IO operation has not been done for this FD (which causes it to become not ready again), and the ET mode greatly reduces the number of times the Epoll event is repeatedly triggered, so the efficiency is higher than the LT mode. Epoll working in the ET mode, the non-blocking socket interface must be used to avoid the task of handling multiple file descriptors starve due to a blocking read/block write operation on a file handle. 3. Summary If there is an example of this: 1. We have added a file handle (RFD) that is used to read data from the pipeline to the Epoll descriptor 2. This time from the other end of the pipe was written to the 2KB data 3. Call Epoll_wait (2), and it returns RFD, stating that it is ready to read Operation 4. Then we read the data of 1KB 5. Call Epoll_wait (2) ... Lt Mode: If it is the LT mode, then after the 5th step call epoll_wait (2), you can still be notified. ET mode: If we use the Epollet flag when we add RFD to the Epoll descriptor in step 1th, it is possible to suspend after the 5th step call epoll_wait (2) because the remaining data still exists in the file's input buffer. and the data-issuing side is waiting for a feedback message for the data that has been sent. The ET work mode reports events only when an event occurs on a monitored file handle. Therefore, at the 5th step, the caller may discard the remaining data that is still in the file input buffer. When using the Epoll et model to work, when a Epollin event is generated, the reading data needs to be considered when the size of the recv () returned is equal to the size of the request, then it is likely that the buffer and the data is not read, it also means that the event has not been finished, So it needs to be read again: while (rs) {Buflen = recv (ACTIVEEVENTS[I].DATA.FD, buf, sizeof (BUF), 0), if (Buflen < 0) {//due to non-blocking mode, so when ERRN O is Eagain, indicates that the current buffer has no data to read//here is considered to be the processing of the event. if (errno = = Eagain{break;} else{return;}} else if (Buflen = = 0) {//This indicates that the socket on the end is closed gracefully.} if (Buflen = = sizeof (BUF) {0 = 1;//needs to be read again} else{rs =;}} The Eagain meaning in Linux is that development in Linux often encounters many errors (setting errno), where Eagain is one of the more common errors (for example, in non-blocking operations). Literally, it is a hint to try again. This error often occurs when the application is doing some non-blocking (non-blocking) operations (for files or sockets). For example, open the file/socket/fifo with the O_nonblock flag, if you do a sequential read operation and no data is readable. At this point the program does not block up and waits for the data to be ready to return, and the Read function returns an error eagain, prompting your application to now have no data to read please try again later. For example, when a system call (such as fork) fails because there is not enough resources (such as virtual memory), the return eagain prompts it to be called again (perhaps next time it succeeds). Three code demonstration below is an incomplete code and the format is not correct, is intended to express the above procedure, removed some template code. #define IPADDRESS "127.0.0.1" #define PORT 8787#define MAXSIZE 1024#define listenq 5#define fdsize 1000#define epollevents 100LISTENFD = Socket_bind (ipaddress,port); struct epoll_event events[epollevents];//Create a descriptor EPOLLFD = Epoll_create ( fdsize)///Add Listener Descriptor Event add_event (Epollfd,listenfd,epollin);//loop wait for (;;) {//The function returns the number of descriptor events already prepared RET = epoll_wait (epollfd,events,epollevents,-1);//handles the received connection handle_events (Epollfd,events,ret, LISTENFD,BUF);} The event handler function static void handle_events (int epollfd,stRuct epoll_event *events,int num,int listenfd,char *buf) {int i; int fd;//traverse; here, just traverse the prepared IO event. Num was not fdsize at the time of Epoll_create. for (i = 0;i < num;i++) {FD = EVENTS[I].DATA.FD;//based on the type and event type of the descriptor to be processed if (fd = = LISTENFD) && (events[i].events & Epollin) Handle_accpet (EPOLLFD,LISTENFD); else if (events[i].events & Epollin) Do_read (EPOLLFD,FD,BUF); else if (events[i].events & Epollout) do_write (EPOLLFD,FD,BUF); }}//Add event static void add_event (int epollfd,int fd,int state) {struct epoll_event ev; ev.events = state; ev.data.fd = FD; EP Oll_ctl (Epollfd,epoll_ctl_add,fd,&ev);} Handles the received connection to the static void Handle_accpet (int epollfd,int listenfd) {int clifd; struct sockaddr_in cliaddr; socklen_t Cliaddrle N CLIFD = Accept (LISTENFD, (struct sockaddr*) &cliaddr,&cliaddrlen); if (CLIFD = =-1) perror ("Accpet error:"); else {printf ("Accept a new client:%s:%d\n", Inet_ntoa (CLIADDR.SIN_ADDR), cliaddr.sin_port);//Add a Customer descriptor and event Add_event ( Epollfd,clifd,epollin); }}//read processing static void Do_Read (int epollfd,int Fd,char *buf) {int nread; nread = Read (fd,buf,maxsize); if (nread = =-1) {perror ("read error:"); Close (FD); Remember the close FD delete_event (Epollfd,fd,epollin); Delete Listener} else if (nread = = 0) {fprintf (stderr, "client close.\n"); Close (FD);//Remember Close FD delete_event (Epollfd,fd,epollin ); Delete Listener} else {printf ("read message is:%s", buf),//modify descriptor corresponding to the event, changed from read to write Modify_event (epollfd,fd,epollout);}} Write-processing static void Do_write (int epollfd,int Fd,char *buf) {int nwrite; nwrite = Write (Fd,buf,strlen (BUF)); if (nwrite = =-1 {perror ("Write Error:"); Close (FD); Remember the close FD delete_event (epollfd,fd,epollout); Delete Listener}else{modify_event (Epollfd,fd,epollin);} memset (Buf,0,maxsize); }//Delete event static void delete_event (int epollfd,int fd,int state) {struct epoll_event ev; ev.events = state; ev.data.fd = FD; Epoll_ctl (Epollfd,epoll_ctl_del,fd,&ev);} Modify event static void modify_event (int epollfd,int fd,int state) {struct epoll_event ev; ev.events = state; ev.data.fd = FD; E Poll_ctl (epollfd,ePoll_ctl_mod,fd,&ev);} Note: At the other end, I'll save four Epoll summary in Select/poll, the kernel scans all monitored file descriptors only after a certain method is called, and Epoll registers a file descriptor with Epoll_ctl () beforehand. Once a file descriptor is in place, the kernel uses a callback mechanism like callback to quickly activate the file descriptor and be notified when the process calls Epoll_wait (). (This removes the traversal of the file descriptor, but rather through the mechanism of listening for callbacks.) This is where Epoll's charm lies. The main advantages of epoll are a few aspects: 1. The number of monitored descriptors is unrestricted, it supports the maximum number of open files, this number is generally far greater than 2048, for example, in 1GB memory of the machine about about 100,000, the specific number can be cat/proc/sys/fs/file-max to see, In general, this number is very much related to system memory. The biggest disadvantage of select is that there is a limit to the number of FD that the process opens. This is not sufficient for servers with a larger number of connections. Although it is possible to choose a multi-process solution (as Apache does), although the cost of creating a process on Linux is relatively small, it is still not negligible, and data synchronization between processes is far less efficient than synchronization between threads, so it is not a perfect solution. The efficiency of 2.IO does not decrease as the number of monitored FD increases. Epoll is not the same as select and poll polling, but is implemented by each FD-defined callback function. Only the ready FD will execute the callback function. If there is not a lot of idle-connection or dead-connection,epoll efficiency is not much higher than select/poll, but when encountering a lot of idle-connection, You will find that Epoll is much more efficient than select/poll.

Linux io mode and select, poll, Epoll detailed

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More