1 select
The Select () System Call provides a mechanism to synchronize multiple I/O:
# Include <sys/time. h>
# Include <sys/types. h>
# Include <unistd. h>
Int select (int n,
Fd_set * readfds,
Fd_set * writefds,
Fd_set * required TFDs,
Struct timeval * timeout );
Fd_clr (int fd, fd_set * Set );
Fd_isset (int fd, fd_set * Set );
Fd_set (int fd, fd_set * Set );
Fd_zero (fd_set * Set );
The call to select () will be blocked until the specified file descriptor is ready to execute I/O, or the time specified by the optional parameter timeout has passed.
The monitored file descriptors are divided into three types: Set, each corresponding to waiting for different events. The file descriptors listed in readfds are monitored for data available for reading (if the read operation is complete, it will not block ). The file descriptors listed in writefds are monitored to determine whether the write operation is complete without blocking. Finally, the file descriptors listed in ipvtfds are monitored for exceptions or uncontrolled data availability (these statuses are only applied to sockets ). The three types of set can be null. In this case, select () does not monitor this type of event.
When select () is returned successfully, each set is modified so that it only contains the file descriptor for preparing I/O. For example, assume that there are two file descriptors with values 7 and 9, which are placed in readfds. When select () returns, if 7 is still in set, the file descriptor is ready to be read without blocking. If '9' is no longer in 'set', reading it may be blocked (I said it may be because the data may be available exactly after 'select' is returned. In this case, the next call to 'select () prepare the returned file descriptor for reading ).
The first parameter n is equal to the value of the maximum file descriptor in all sets plus 1. Therefore, the Select () caller checks which file descriptor has the maximum value, and adds this value to 1 and passes it to the first parameter.
The timeout parameter is a pointer to the timeval struct. The timeval is defined as follows:
# Include <sys/time. h>
Struct timeval {
Long TV _sec;/* seconds */
Long TV _usec;/* 10e-6 second */
};
If this parameter is not null, select () will return after TV _sec seconds and TV _usec microseconds even if no file descriptor is ready for I/O. When select () is returned, the status of the timeout parameter is undefined in different systems. Therefore, the timeout and file descriptor set must be reinitialized before each select () call. In fact, the current version of Linux automatically modifies the timeout parameter and sets its value to the remaining time. Therefore, if the timeout value is set to 5 seconds and then 3 seconds before the file descriptor is ready, TV _sec will change to 2 when the select () call returns.
If both values in timeout are set to 0, the call to select () will return immediately. All pending events will be reported, but no subsequent events will be waited.
File descriptor set is not directly operated. It is generally managed using several helper macros. This allows UNIX systems to implement the file descriptor set in their preferred way. However, most systems simply implement a set array. Fd_zero removes all file descriptors from the specified set. You should call select () before each call.
Fd_set writefds;
Fd_zero (& writefds );
Fd_set adds a file descriptor to the specified set. fd_clr removes a file descriptor from the specified set:
Fd_set (FD, & writefds);/* Add 'fd 'to the Set */
Fd_clr (FD, & writefds);/* oops, remove 'fd 'from the set */
Well-designed code should never use fd_clr, And it is rarely used in actual situations.
Fd_isset tests whether a file descriptor specifies a part of the set. If the file descriptor is set, a non-zero integer is returned. If not, 0 is returned. Fd_isset is used after select () is called to return data. It is used to test whether the specified file descriptor has the relevant action ready:
If (fd_isset (FD, & readfds ))
/* 'Fd 'is readable without blocking! */
Because the file descriptor set is created statically, they impose a limit on the maximum number of file descriptors. The value of the maximum file descriptor that can be put into the set is specified by fd_setsize. In Linux, the value is 1024.
When select () succeeds, the number of file descriptors that prepare I/O is returned, including all three sets. If timeout is provided, the returned value may be 0. If an error occurs,-1 is returned, and errno is set to one of the following values:
Ebadf
An invalid file descriptor is provided to a set.
Eintr
A signal is captured while waiting, and a call can be initiated again.
Einval
The parameter n is a negative number, or the specified timeout is invalid.
Enomem
Insufficient memory to complete the request.
Bytes --------------------------------------------------------------------------------------------------------------
2 poll
Poll () is a multi-I/O SOLUTION OF SYSTEM V. It solves several limitations of select (), although select () is still frequently used (mostly out of habit, or in the name of portability ):
# Include <sys/poll. h>
Int poll (struct pollfd * FDS, unsigned int NFDs, int timeout );
Unlike select (), Poll () does not use an inefficient set of three bit-based file descriptors. Instead, it uses a separate structure pollfd array that points the FDS pointer to this group. The pollfd struct is defined as follows:
# Include <sys/poll. h>
Struct pollfd {
Int FD;/* file descriptor */
Short events;/* Requested events to watch */
Short revents;/* returned events witnessed */
};
Each pollfd struct specifies a monitored file descriptor. It can transmit multiple structs to instruct poll () to monitor multiple file descriptors. The events field of each struct is the event mask that monitors the file descriptor, which is set by the user. The revents field is the event mask of the file descriptor operation result. The kernel sets this field when calling the response. Any event requested in the events domain may be returned in the revents domain. Valid events are as follows:
Pollin
Data is readable.
Pollrdnorm
Common Data is readable.
Pollrdband
Readable data is preferred.
Pollpri
There is urgent data readable.
Pollout
Writing data does not cause blocking.
Pollwrnorm
Writing common data does not cause blocking.
Pollwrband
Writing priority data does not cause blocking.
Pollmsg
The sigpoll message is available.
In addition, the revents domain may return the following events:
Poller
The specified file descriptor is incorrect.
Pollhup
The specified file descriptor suspension event.
Pollnval
The specified file descriptor is invalid.
These events are meaningless in the events domain because they are always returned from revents when appropriate. Poll () is different from select (). You do not need to explicitly request exception reports.
Pollin | pollpri is equivalent to the read event of select (), and pollout | pollwrband is equivalent to the write event of select. Pollin is equivalent to pollrdnorm | pollrdband, while pollout is equivalent to pollwrnorm.
For example, to monitor whether a file descriptor is readable and writable, we can set events to Pollin | pollout. When poll returns, we can check the flag in revents, which corresponds to the events structure of the file descriptor request. If the Pollin event is set, the file descriptor can be read without blocking. If pollout is set, the file descriptor can be written without blocking. These flags are not mutually exclusive: they may be set at the same time, indicating that the read and write operations of the file descriptor will return normally without blocking.
The timeout parameter specifies the number of milliseconds to wait. Poll returns no matter whether I/O is ready or not. If the value of timeout is negative, the infinite timeout is indicated. If the value of timeout is 0, the poll call returns immediately and lists the file descriptors for preparing I/O, but does not wait for other events. In this case, Poll () is returned as soon as it is elected.
Return Value and error code
When the request succeeds, Poll () returns the number of file descriptors whose revents field is not 0. If no event occurs before the timeout, Poll () returns 0. If the request fails, Poll () return-1 and set errno to one of the following values:
Ebadf
The specified file descriptor in one or more struct is invalid.
Efault
The FDS Pointer Points to an address that exceeds the address space of the process.
Eintr
A signal is generated before the request event, and the call can be initiated again.
Einval
The NFDs parameter exceeds the plimit_nofile value.
Enomem
The request cannot be completed because the available memory is insufficient.
The above content is from oreilly. Linux. system. Programming-talking. directly. to. The. kernel. And. C. library.2007.
Bytes --------------------------------------------------------------------------------------------------------------
3 epoll
What is epoll? 2.6 New Methods for Improving I/O performance in the kernel. According to man's Manual: poll is improved to handle large volumes of handles.
To use epoll, you only need to call these three systems: epoll_create (2), epoll_ctl (2), and epoll_wait (2 ). The only trouble is that epoll can work in two ways: Lt and ET.
LT (Level triggered) is the default working method, and supports both block and no-block socket. in this way, the kernel tells you whether a file descriptor is ready, and then you can perform Io operations on this ready FD. If you do not perform any operation, the kernel will continue to inform you, so the possibility of programming errors in this mode is lower. The traditional select/poll model is representative of this model.
Et (edge-triggered) is a high-speed operating method that only supports no-block socket. In this mode, when the descriptor is never ready, the kernel tells you through epoll. Then it will assume that you know that the file descriptor is ready and will not send more ready notifications for that file descriptor, until you do some operations, the file descriptor is no longer ready (for example, you are sending, receiving, or receiving requests, or an ewouldblock error occurs when the number of data sent and received is less than a certain amount ). However, please note that if I/O operations are not performed on this FD all the time, the kernel will not send more notifications (only once), but in the TCP protocol, more benchmark validation is still required for the acceleration utility of et mode.
(1) function Introduction
Epoll differs from select/poll in that it is composed of a group of system calls.
Int epoll_create (INT size );
Int epoll_ctl (INT epfd, int op, int FD, struct epoll_event * event );
Int epoll_wait (INT epfd, struct epoll_event * events, int maxevents, int timeout );
Epoll-related system calls are introduced in Linux 2.5.44. This system call is designed with great changes to the shortcomings of the traditional select/poll system call. The disadvantage of select/poll is:
1. parameters must be repeatedly read from the user State during each call.
2. Scan the file descriptor repeatedly during each call.
3. At the beginning of each call, the current process should be put into the waiting queue of each file descriptor. After the call is completed, the process is deleted from each waiting queue.
In practical applications, select/poll may monitor a large number of file descriptors. If only a small part is returned each time, select/poll may not be efficient in this case. Epoll is designed to split a single select/poll operation into one epoll_create + multiple epoll_ctl + one epoll_wait. In addition, the kernel adds a file system "eventpollfs" for epoll operations. Each or more file descriptors to be monitored have an inode node of the corresponding eventpollfs file system, the main information is stored in the eventpoll struct. Important information of monitored files is stored in the epitem structure. So they are one-to-many relationships. Since the user State information has been saved to the kernel state when epoll_create and epoll_ctrl are executed, even if epoll_wait is called repeatedly, parameters are not copied repeatedly, and file descriptors are scanned, repeatedly put the current process into/out the waiting queue. In this way, the above three shortcomings are avoided. Next let's take a look at their implementation:
/* Wrapper struct used by poll queueing */
Struct ep_pqueue {
Poll_table pt;
Struct epitem * EPI;
};
(2) Key struct
This struct is similar to struct poll_wqueues in select/poll. Since epoll needs to store a large amount of information in the kernel state, a pointer to the callback function of Guang cannot meet the requirements, so a new struct epitem is introduced here.
/*
* Each file descriptor added to the eventpoll interface will
* Have an entry of this type linked to the hash.
*/
Struct epitem
{
Struct rb_node RBN; // red/black tree, used to save eventpoll
Struct list_head rdllink; // bidirectional linked list, used to save the complete eventpoll
Struct epoll_filefd FFD; // information of the monitored file descriptor corresponding to this struct
Int nwait; // number of events in the poll operation
// A two-way linked list stores the waiting queue of the monitored file. The function is similar to poll_table in select/poll.
Struct list_head pwqlist;
Struct eventpoll * EP; // point to eventpoll. Multiple epitems correspond to one eventpoll.
Struct epoll_event event; // record the event and the corresponding FD
Atomic_t usecnt; // reference count
// Bidirectional linked list, used to link the struct file corresponding to the monitored file descriptor. Because file contains f_ep_link,
// Stores all epoll nodes that monitor this file
Struct list_head fllink;
Struct list_head txlink; // bidirectional linked list used to save the transmission queue
Unsigned int revents; // The state of the file descriptor, which is used to lock an empty event set during collection and transmission.
};
This struct is used to store multiple file descriptors associated with the epoll node. The storage method is to use the hash table implemented by the red/black tree. The following sections describe how to save it. It corresponds to the monitored file descriptor one by one.
View plain
Struct eventpoll
{
Rwlock_t lock; // read/write lock
Struct rw_semaphore SEM; // read/write semaphores
Wait_queue_head_t WQ; // wait queue used by sys_epoll_wait ()
Wait_queue_head_t poll_wait; // wait queue used by file-> poll ()
Struct list_head rdllist; // queue of completed operation events.
Struct rb_root RBR; // save the file descriptor monitored by epoll
};
This struct stores the extended information of the epoll file descriptor, which is stored in private_data of the file struct. It corresponds to the epoll file nodes one by one. Generally, an epoll file node corresponds to multiple monitored file descriptors. Therefore, an eventpoll structure corresponds to multiple epitem structures. So where are the waiting events in epoll? See the following
// Wait structure used by the poll hooks
Struct eppoll_entry {
Struct list_head llink; // list header used to link this structure to the "struct epitem"
Void * base; // The "base" pointer is set to the container "struct epitem"
Wait_queue_t wait; // wait queue item that will be linked to the target file wait queue head.
Wait_queue_head_t * whead; // The wait queue head that linked the "wait" Wait queue item
};
Compared with struct poll_table_entry of select/poll, epoll indicates that the structure of the waiting queue node is slightly different. It is compared with struct poll_table_entry.
Struct poll_table_entry {
Struct file * filp;
Wait_queue_t wait;
Wait_queue_head_t * wait_address;
};
Because epitem corresponds to a monitored file, you can use base to conveniently obtain information about the monitored file. Because a file may have multiple events, you can use llink to link these events.
(3) epoll_create implementation
Epoll_create () is used to create an inode node of the eventpollfs file system. This is done by ep_getfd. Ep_getfd () first calls ep_eventpoll_inode () to create an inode node, and then calls d_alloc () to assign a dentry to inode. Finally, we associate file, dentry, and inode. After executing ep_getfd (), it calls ep_file_init (), allocates the eventpoll struct, and assigns the eventpoll pointer to the file struct, in this way, eventpoll is associated with the file struct.
Note that the size parameter of epoll_create () is only for reference. As long as it is not smaller than or equal to 0, the number of file descriptors associated with the epoll inode is not limited.
(4) epoll_ctl implementation
Epoll_ctl is used to implement a series of operations, such as associating files with inode nodes of the eventpollfs file system. Here we will introduce the eventpoll struct, Which is saved in file-> f_private and records important information about inode nodes of the eventpollfs file system, the member RBR stores all file descriptors monitored by the epoll file node. The organization is a red-black tree, which is very efficient in node search. First, it calls ep_find () to obtain the epitem struct from the red/black tree in eventpoll. Select different operations based on OP parameters. If op is epoll_ctl_add, under normal circumstances, epitem cannot be found in the red/black tree of eventpoll. Therefore, you can call ep_insert to create an epitem struct and insert it into the corresponding red/black tree. Ep_insert () first allocates an epitem object, initializes it, and puts it into the corresponding red/black tree. In addition, this function also performs an operation to put the current process into the waiting queue for corresponding file operations. This step is completed by the following code.
Init_poll_funcptr (& epq.pt, ep_ptable_queue_proc );
...
Revents = tfile-> f_op-> poll (tfile, & epq.pt );
The function first calls init_poll_funcptr to register a callback function ep_ptable_queue_proc, which will be executed when f_op-> poll is called. This function assigns an epoll waiting queue node eppoll_entry: on the one hand, it is mounted to the waiting queue for file operations, on the other hand, it is mounted to the queue of epitem. In addition, it also registers a callback function ep_poll_callback waiting for the queue. After the file operation is complete, Before waking up the current process, the system will call ep_poll_callback (), put eventpoll in the epitem Completion queue, and wake up the waiting process. If you find that the operations on the monitored file have been completed after executing f_op-> poll, put it in the completed queue and immediately wake up the processes waiting for the operation.
(5) epoll_wait implementation
Epoll_wait is used to wait for the file operation to complete and return.
Its subject is ep_poll (). In the for loop, this function checks whether there are any completed events in the epitem. If yes, the result is returned. If no, call schedule_timeout () to sleep until the process is awakened again or times out.
(6) Performance Analysis
The epoll mechanism is designed for select/poll defects. With the newly introduced eventpollfs file system, epoll copies the parameters to the kernel state and does not copy the parameters repeatedly during each round robin. Split the operation into epoll_create, epoll_ctl, and epoll_wait to avoid repeated file descriptors to be monitored. In addition, since the epoll process called is awakened, you only need to find the completed event from the epitem Completion queue and find out the complexity of the completed event from O (N) reduced to O (1 ). However, there is a premise for epoll to improve its performance, that is, there are a lot of file descriptors to monitor, and there are very few files to complete each operation. Therefore, whether epoll can significantly improve efficiency depends on actual application scenarios. Further tests are required.
(7) epoll example
The following code consists of the BBS user safedead (http://bbs.chinaunix.net/viewpro.php? Uid = 407631:
Static int s_epfd; // epoll description
{
// Initialize epoll
Struct epoll_event EV;
// Set epoll
S_epfd = epoll_create (65535 );
{// This process can be cyclically used to add multiple listen sockets to the epoll event set.
// Create a server listener
Rc = listen (); // The listen parameter is omitted here.
// Add the epoll event set
Ev. Events = epollin;
Ev. Data. FD = RC;
If (epoll_ctl (s_epfd, epoll_ctl_add, RC, & eV) <0 ){
Fprintf (stderr, "epoll set insertion error: FD = % d", RC );
Return (-1 );
}
}
}
{// Epoll event handling
Int I, NFDs, sock_new;
Struct epoll_event events [16384];
For (;;){
// Wait for the epoll event
NFDs = epoll_wait (s_epfd, events, 16384,-1 );
// Process epoll events
For (I = 0; I <NFDs; I ++ ){
// Events [I]. Data. FD is the socket popped up in the epoll event
// Receive connection
Sock_new = accept (events [I]. Data. FD); // other accept parameters are omitted here
If (0> sock_new ){
Fprintf (stderr, "failed to receive client connection \ n ");
Continue;
}
}
}
}
Bytes --------------------------------------------------------------------------------------------------------------
Why is select lagging behind?
First, in the Linux kernel, The fd_set used by select is limited, that is, there is a parameter _ fd_setsize in the kernel that defines the number of handles for each fd_set, in the 2.6.15-25-386 kernel I used, the value is 1024. Search for the kernel source code to get:
Include/Linux/posix_types.h: # DEFINE _ fd_setsize 1024
That is to say, if you want to detect the readable state of 1025 handles at the same time, it is impossible to use select. Or it is impossible to detect the write status of 1025 handles at the same time.
In the kernel, the SELECT statement uses the round robin method, that is, each check traverses all the handles in fd_set. Obviously, the execution time of the select function is proportional to the number of handles in fd_set, that is, the more handles the SELECT statement needs to detect, the more time it takes.
Advantages of epoll:
1. support a process to open a large number of socket Descriptors (FD)
The most intolerable thing about the SELECT statement is that the FD opened by a process has certain limitations, which are set by fd_setsize. The default value is 2048. For im servers that need to support tens of thousands of connections, there are obviously too few. At this time, you can choose to modify this macro and then re-compile the kernel. However, the materials also point out that this will bring about a reduction in network efficiency, second, you can select a multi-process solution (the traditional Apache solution). However, although the cost of creating a process on Linux is relatively small, it cannot be ignored, in addition, data synchronization between processes is far less efficient than inter-thread synchronization, so it is not a perfect solution. However, epoll does not have this limit. The FD limit supported by epoll is the maximum number of files that can be opened. This number is generally greater than 2048. For example, the size of a machine with 1 GB of memory is about 0.1 million. You can check the number of machines with CAT/proc/sys/fs/file-max. Generally, this number has a great relationship with the system memory.
2. Io efficiency does not decrease linearly as the number of FD increases
Another critical weakness of traditional select/poll is that when you have a large set of sockets, but due to network latency, only some of the sockets at any time are "active, however, each select/poll call will linearly scan all sets, resulting in a linear decline in efficiency. However, epoll does not have this problem. It only operates on "active" sockets-this is because epoll is implemented based on the callback function on each FD in kernel implementation. Then, only the "active" socket will take the initiative to call the callback function, other idle status socket will not, in this regard, epoll implements a "pseudo" AIO, this is because the driver is in the OS kernel. In some benchmarks, if all the sockets are basically active-for example, in a high-speed LAN environment, epoll is not more efficient than select/poll. On the contrary, if epoll_ctl is used too much, the efficiency is also slightly lower. However, once idle connections is used to simulate the WAN environment, epoll is far more efficient than select/poll.
3. Use MMAP to accelerate message transmission between the kernel and user space.
This actually involves the specific implementation of epoll. Both select, poll, and epoll require the kernel to notify users of FD messages. It is important to avoid unnecessary memory copies, epoll is implemented through the same memory of the user space MMAP kernel. If you want me to focus on epoll from the 2.5 kernel, you will not forget the manual MMAP step.