Epoll in Linux

Source: Internet
Author: User

 

In Linux network programming, select is used for event triggering for a long time. In the new Linux kernel, there is a mechanism to replace it, that is, epoll.
Compared with select, epoll does not reduce the efficiency as the number of FD listeners increases. In the select implementation in the kernel, it uses polling for processing. The larger the number of FD polling, the more time it takes. In addition, the header file/usr/include/Linux/posix_types.h has the following statement:
# DEFINE _ fd_setsize 1024
It indicates that select can listen to a maximum of 1024 FD at the same time. Of course, you can modify the header file and re-compile the kernel to expand this number, but this does not seem to be a permanent cure.

The epoll interface is very simple. There are three functions in total:
1. Int epoll_create (INT size );
Create an epoll handle and the size is used to tell the kernel the total number of the listeners. This parameter is different from the first parameter in select () and returns the value of FD + 1 for the maximum listener. Note that after the epoll handle is created, it occupies an FD value. In Linux, If you view/proc/process ID/FD /, you can see this FD, so you must call close () to close it after epoll is used. Otherwise, the FD may be exhausted.

2. Int epoll_ctl (INT epfd, int op, int FD, struct epoll_event * event );
The epoll event registration function differs from the select () function in that it tells the kernel what type of event to listen to when listening to the event. Instead, it registers the event type to be listened. The first parameter is the returned value of epoll_create (). The second parameter represents an action and is represented by three macros:
Epoll_ctl_add: register a new FD to epfd;
Epoll_ctl_mod: modifies the listener events of the registered FD;
Epoll_ctl_del: delete an FD from epfd;
The third parameter is the FD to be monitored, and the fourth parameter is to tell the kernel what to listen for. The structure of struct epoll_event is as follows:
Struct epoll_event {
_ Uint32_t events;/* epoll events */
Epoll_data_t data;/* User Data variable */
};

Events can be a collection of the following macros:
Epollin: indicates that the corresponding file descriptor can be read (including the normal shutdown of the Peer socket );
Epollout: indicates that the corresponding file descriptor can be written;
Epollpri: indicates that the corresponding file descriptor has an urgent readable data (Here it should indicate that out-of-band data has arrived );
Epollerr: indicates that the corresponding file descriptor is incorrect;
Epollhup: indicates that the corresponding file descriptor is hung up;
Epollet: Set epoll to edge triggered mode, which is relative to level triggered.
Epolloneshot: only listens for an event once. After listening for this event, if you want to continue listening for this socket, you need to add this socket to the epoll queue again.

3. Int epoll_wait (INT epfd, struct epoll_event * events, int maxevents, int timeout );
Wait for event generation, similar to the select () call. The events parameter is used to get the event set from the kernel. maxevents tells us how big the kernel events is. The value of maxevents cannot be greater than the size when epoll_create () is created. The timeout parameter is the timeout time (in milliseconds, 0 will be returned immediately,-1 will be uncertain, or it is said to be permanently blocked ). This function returns the number of events to be processed. If 0 is returned, the Operation has timed out.

Bytes --------------------------------------------------------------------------------------------

The specific descriptions of ET and Lt are as follows:

There are two types of epoll events:
Edge triggered (ET)
Level triggered (LT)

For example:
1. We have added a file handle (RFD) used to read data from the pipeline to the epoll descriptor.
2. At this time, 2 kb of data is written from the other end of the pipeline.
3. Call epoll_wait (2) and it will return RfD, indicating that it is ready to read
4. Then we read 1 kb of data.
5. Call epoll_wait (2 )......

Edge triggered working mode:
If the epollet flag is used when RfD is added to the epoll descriptor in step 1, it may be suspended after epoll_wait (2) is called in step 2, because the remaining data still exists in the input buffer of the file, and the data sending end is still waiting for a feedback message for the sent data. Only when an event occurs on the monitored file handle does the et work mode report the event. Therefore, in step 1, the caller may give up waiting for the remaining data still in the file input buffer. In the above example, an event is generated on the RfD handle, because a write operation is executed in step 1, and the event will be destroyed in step 2. Because the read operation in step 1 does not read the data in the input buffer of an empty file, we are not sure whether to suspend after calling epoll_wait (2) in step 2. When epoll works in the et mode, it is necessary to use a non-blocking set interface to avoid starving the task of processing multiple file descriptors due to the blocking read/blocking write operation of a file handle. It is best to call the epoll interface in et mode in the following ways to avoid possible defects.
I based on non-blocking file handle
II it must be suspended only when read (2) or write (2) returns eagain. However, this does not mean that every read () operation needs to be performed cyclically until an eagain is generated. When read () when the returned data length is less than the requested data length, you can determine that there is no data in the buffer at this time, so that the read event has been processed.

Level triggered Working Mode
On the contrary, when the epoll interface is called in the LT method, it is equivalent to a fast poll (2), and they have the same function regardless of whether the subsequent data is used. Because even if you use epoll in the et mode, multiple events are still generated when you receive data from multiple chunks. The caller can set the epolloneshot flag. After the epoll_wait (2) receives the event, epoll will disable the file handle associated with the event from the epoll descriptor. Therefore, when epolloneshot is set, using epoll_ctl (2) with epoll_ctl_mod mark to process the file handle becomes a required task for the caller.

Then explain et in detail, LT:

LT (Level triggered) is the default working method, and supports both block and no-block socket. in this way, the kernel tells you whether a file descriptor is ready, and then you can perform Io operations on this ready FD. If you do not perform any operation, the kernel will continue to inform you, so the possibility of programming errors in this mode is lower. The traditional select/poll model is representative of this model.

Et (edge-triggered) is a high-speed operating method that only supports no-block socket. In this mode, when the descriptor is never ready, the kernel tells you through epoll. Then it will assume that you know that the file descriptor is ready and will not send more ready notifications for that file descriptor, until you do some operations, the file descriptor is no longer ready (for example, you are sending, receiving, or receiving requests, or an ewouldblock error occurs when the number of data sent and received is less than a certain amount ). However, please note that if I/O operations are not performed on this FD all the time, the kernel will not send more notifications (only once), but in the TCP protocol, more benchmark validation is needed for the acceleration utility of the et mode (this sentence is not understandable ).

In many tests, we can see that if there is not a large number of idle-connection or dead-connection, epoll will not be much more efficient than select/poll, however, when we encounter a large number of idle-connections (such as a large number of slow connections in the WAN environment), we will find that epoll is much more efficient than select/poll. (Not tested)

In addition, when epoll's et model is used to work, when an epollin event is generated,
When reading data, you must consider that if the size returned by Recv () is equal to the request size, it is likely that there is still data in the buffer zone that has not been read, which means that the event has not been processed, therefore, you need to read it again:
While (RS)
{
Buflen = Recv (activeevents [I]. Data. FD, Buf, sizeof (BUF), 0 );
If (buflen <0)
{
// Because of the non-blocking mode, when errno is eagain, it indicates that no data is readable in the current buffer.
// Here it is treated as the event that has been handled.
If (errno = eagain)
Break;
Else
Return;
}
Else if (buflen = 0)
{
// This indicates that the socket at the peer end is closed normally.
}
If (buflen = sizeof (BUF)
Rs = 1; // read again
Else
Rs = 0;
}

In addition, if the sending end traffic is greater than the receiving end traffic (that is, epoll's program reads faster than the forwarded socket), even if the send () function returns, however, the data in the actual buffer zone is not actually sent to the receiving end. In this way, the eagain error is generated when the buffer zone is full (refer to man send), and the data sent by this request is ignored. therefore, the socket_send () function needs to be encapsulated to handle this situation. This function will try to write the data and then return it. "-1" indicates an error. Within socket_send (), when the write buffer is full (send () returns-1 and errno is eagain), the system will wait and try again. this method is not perfect. Theoretically, it may be blocked within socket_send () for a long time, but there is no better way.

Ssize_t socket_send (INT sockfd, const char * buffer, size_t buflen)
{
Ssize_t TMP;
Size_t Total = buflen;
Const char * P = buffer;

While (1)
{
TMP = Send (sockfd, P, total, 0 );
If (TMP <0)
{
// When sending receives the signal, you can continue writing, but here-1 is returned.
If (errno = eintr)
Return-1;

// If this error is returned when the socket is not blocked, it indicates that the write Buffer Queue is full,
// Retry after delay.
If (errno = eagain)
{
Usleep (1000 );
Continue;
}

Return-1;
}

If (size_t) TMP = total)
Return buflen;

Total-= TMP;
P + = TMP;
}

Return TMP;
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.