The kernel implementation principle of SELECT

Last Update:2018-07-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0 Preface

When learning network programming, we always write from the simplest server program:

Socket-> Bind-> Listen-> accept-> Read-> write-> return

The next step is to learn how to handle concurrent requests from clients. The main ideas are: Use multi-threaded/multi-process model using IO multiplexing model using multithreading + IO multiplexing model

where, using the IO multiplexing model, we always start with a select system call. However, we always hear that select efficiency is too low, and select/poll is never used in large projects, but the epollis used.

So why is the select system call inefficient? And Epoll is what aspect of improvement, so that much more efficient.

This article attempts to address the source code by tracking the select kernel to answer why the select does not apply to large projects with high concurrent volumes. 1 blocking calls and Non-blocking calls

When using multiplexing technology, it is generally required to set the file to be monitored as non-blocking mode. So let's take a look at what is blocking mode and what is non-blocking mode. Concept Description

When read or write a file [note], if Read/write cannot return immediately, the caller goes to sleep until the file becomes readable/writable. This pattern is blocking mode.

Note: In Unix culture, everything is documented. Whether it's a regular disk file, a device, or a network connection socket, it's in the form of a file in a virtual file system. Users use a unified Read/write function to read and write files, regardless of how they exist at the bottom of the file. Therefore, in Linux, you can always see various forms of "files", such as EVENTFD, you can implement event notification, and so on.

In the case of TCP connections, user-space read/write interacts with the TCP buffer rather than directly with the NIC driver. When there is no data in the receive buffer, the read operation cannot be returned immediately, thus blocking the call process until the other person sends the data and copies it to the receive buffer by the NIC, and similarly, when the network latency is too large or the receiver problem causes the end (sender) to send a buffer full, Call write cannot write data to the send buffer (one byte cannot be written) and the calling process blocks.

It should be noted here that blocking or not depends on the type of file, in other words, depending on the device driver that the file corresponds to.

For example, for normal disk files, Read/write can always return immediately, unless the disk fails. For network connections, or pipelines and other types of files, Read/write is likely to send blocking, and is a normal mode. implementation in the kernel

The implementation of a blocking call in the kernel consists of two aspects: how the blocking party enters the sleep state driver how to wake up the sleep process

Let's be clear about where these two operations are implemented, or by whom.

The process goes to sleep, it must be when calling Read/write because a condition is not satisfied (for example, the receive buffer is empty when read), so first you need to perform conditional judgment. This work must be and can only be done by the driver. In the VFS, the Read/write system call eventually invokes the internal implementation of the driver (the function pointer is saved in the struct file_operations of struct file). As a result, the process goes into sleep work and is implemented in the driver's read/write.

Similarly, it is only when certain conditions are met that the process is awakened. When read is blocked, it is because there is no data to read. When will the new data come in? For local files, of course, there is a process written to the data (call write), and for a network connection, of course, because the other person sent the data over. Both of these actions are also done in the driver's write.

Therefore, the sleep and wake actions of the process are implemented in the read and write of the corresponding device driver for the file.

Next, describe how the process sleeps and how it wakes up. At its core is a data structure called wait queues. Wait Queue

The wait queue is based on the struct List_head implementation, containing a header node and several queue elements.

struct __wait_queue_head {
    spinlock_t lock;
    struct List_head task_list;

The header node contains a pointer to the first queue element and a spin lock that protects the queue: when inserting (add_wait_queue) new elements or deleting (remove_wait_queue) old elements, the spin lock is used to guarantee synchronization.

struct __wait_queue {
    unsigned int flags;
    void *private;
    wait_queue_func_t func;
    struct List_head task_list;

Each element in the wait queue contains a callback function func that is invoked when the element is awakened. Where the private pointer is directed to the task_struct structure of the process so that it knows which process to wake up. Elements and elements use List_head to connect a two-way list. Sleep

Call any of the following functions to sleep the process on a waiting queue.

Wait_event (Wq, condition);
Wait_event_timeout (Wq, condition, timeout);

Wait_event_interruptible (Wq, condition);
Wait_event_interruptible (Wq, condition, timeout);

The difference between the two sets of functions is the state of the call process when it sleeps: The former is a task_uninterruptible state, the latter is a task_interruptible state, and can be interrupted by a signal sleep.

The process of sleep is divided into four steps: to determine whether the conditions are met, if they are satisfied without sleep; otherwise, define a new wait_queue_t queue element, insert it into the waiting queue represented by WQ, set the process state, call schedule (), yield the CPU, and schedule other processes.

Note that there may be more than one element on a wait queue, which means that there are multiple processes waiting at the same time. Therefore, when conditions are met, all processes are awakened. Furthermore, the first process that is scheduled after being awakened may "consume" the condition, so that all awakened processes need to again determine whether the condition is satisfied again after the execution.

This is also why a lot of code puts sleep operations in the while loop.

    /* Wait event
    /while (!condition) {/
        * If non-blocking call, return directly
        /if (Filp->f_flags & O_nonblock)
            Return-eagain;

        if (wait_event_interruptible (Wq, condition)) {/
            * was interrupted by a signal */
            Return-erestartsys
        }
    }
    * Handle Event * *

If you set the file's O_nonblock flag, that is, non-blocking mode, then when the condition is not satisfied, directly return to-eagain, will not go to sleep state. Wake up

In contrast to the sleep function, call any of the following functions to wake the process on the queue

void Wake_up (wait_event_head_t *queue);
void Wake_up_interruptible (wait_event_head_t *queue);

These two functions wake up all the processes that are sleeping on a queue, and are divided into the following steps: Calling its callback function Func for each element in the queue; If the callback function returns a value other than 0, look at the FLAG value of the queue element, and if the wq_flag_exclusive flag is set, stop the traversal and not continue to wake the other process, or continue to wake the next process.

So what are the callback functions and flag for this waiting queue? The values of the wait_event_t structures defined in Wait_event are as follows:

wait_queue_t name = {
    . Private    = Current,
    . Func       = autoremove_wake_function,
    . task_list  = List_head_init ((name). task_list),
};

Autoremove_wake_function: Invokes the default_wake_function wake sleep process, removes it from the waiting queue's list, and returns 1 for success.

The FLAG parameter is cleared out of the wq_flag_exclusive flag in the Prepare_to_wait function. The order of the function calls is as follows:

Wait_event_interruptible-> define_wait-> prepare_to_wait-> Schedule-> finish_wait

Among them, Prepare_to_wait completes the first 3 steps of sleep.

Therefore, wake_up_interruptible and WAKE_UP will awaken all processes. 2 Select system Calls

 * * @nfds: The maximum fd value to be monitored +1
 * @readfds: Readable files to be monitored fd set
 * @writefds: Writable files to be monitored fd set
 * @exceptfds: Abnormal files to be monitored FD collection c5/>* @timeout: Timeout setting, after waiting for a specified time to return timeout
 * Return: Returns the number of FD that satisfies the condition and, if the error returns-1, returns a timeout of 0
int Select (int Nfds, Fd_set * Readfds, Fd_set *writefds,
        fd_set *exceptfds, struct timeval);

After the normal return, READFDS/WRITEFDS/EXCEPTFDS will set the appropriate FD to meet the conditions.

Let's take a look at what Fd_set really is.

#define __NFDBITS   (8 * sizeof (unsigned long))
#define __FD_SETSIZE    1024
#define __fdset_longs   (__fd_setsize/__nfdbits)

typedef struct {
    unsigned long fds_bits [__fdset_longs];
} __kernel_fd _set;

On the x86 machine, select supports a maximum of 1024 files (__fd_setsize). Each file in the Fd_set is represented by a bit, so it is necessary to __fdset_longs a long integer and put it in an array.

The following set of functions is specifically used to manipulate fd_set:

void Fd_set (int FD, fd_set *set);
void Fd_zero (Fd_set *set);
void fd_clr (int FD, fd_set *set);
int  fd_isset (int FD, fd_set *set);

Timeout Setting

If a select is not indefinitely waiting, you can specify a direct return of 0 after a certain number of hours. The timeout time is specified by the parameter timeout and is a timeval structure.

struct Timeval {
    long    tv_sec;         /* seconds    /long tv_usec;        /* microseconds *
/};

In the kernel, however, the TIMESPEC structure is used.

struct Timespec {
    long    tv_sec;         /* seconds    /long tv_nsec;        /* nanoseconds *
/};

3 Implementation of select

Then we can look at how the select is implemented.

Let's say we're going to design a SELECT, how do we do that? 3.1 A simple select implementation

Let's simplify the question first, assuming that the select only monitors file readable conditions, then a simple algorithm (pseudocode) looks like this:

Count=0
Fd_zero (&res_rset)
for FD in Read_set
    if (readable (FD))
        count++ fdset
        (FD, &res_ RSet)
        break
    else
        add_to_wait_queue

if Count > 0 return
    count
else
    wait_any_event Return

Count

The pseudo code above only demonstrates the criteria for determining the readable file. The algorithm is also very simple, traversing all files, if the file is readable, then the select does not have to block, return directly, and mark the file readable; otherwise, the calling process is added to each file corresponding to the device-driven read wait queue and goes to sleep.

The algorithm has several unresolved problems: how to determine whether the file is currently readable (or writable), that is, how to implement the readable (FD) function. How to add a process to the driven read wait queue. If you call wait_event_interruptible, the process goes to sleep immediately after encountering the first unreadable FD and cannot continue to listen to other files.

Think carefully, both of these problems need to rely on specific device drivers. Therefore, in order to facilitate the implementation of the Select/poll function, in Linux, it is stipulated that each file-owning device driver that supports Select/poll listening must implement the poll function in struct File_operations: 3.2 Poll file operation

*
 * @filp: Pointer to open file
 * @p: Incoming poll_table pointer * return
 : A mask that flags the current state of the file, such as readable (Pollin), writable (pollout)
 *        or error (POLLERR), or reach the end of file (pollhup)/
unsigned int (*poll) (struct file *filp, struct poll_table_struct) ;

The work of the poll function has two parts: the current file state is judged and the return value is marked. Call the Poll_wait function on the wait queue for this driver.

What's the use of the poll_wait function?

static inline void poll_wait (struct file * Filp, wait_queue_head_t * wait_address, poll_table *p)
{
    if (P &&am P wait_address)
        P->qproc (Filp, wait_address, p);
}

is actually called a function pointer to the poll_table. This poll_table is passed in when the select calls the poll function of the file.

So, back to the 2 questions raised at the beginning of this section. The 1th question has been solved, and the 2nd question can be implemented in this p->qproc.

Look at how the select in the Linux kernel implements this function:

/* 1.
    The poll_table is deposited in a poll_wqueues structure body/struct Poll_wqueues {poll_table pt;
    struct Poll_table_page *table;
    struct task_struct *polling_task;/* task_struct/int triggered pointing to the sleep process;
    int error; 
    int inline_index;
struct Poll_table_entry inline_entries[n_inline_poll_entries];

}; /* 2. Invoke poll_initwait initialization poll_wqueues/void poll_initwait (struct poll_wqueues *pwq) {init_poll_funcptr (&AMP;PW) in Do_select Q-&GT;PT, __pollwait);
    /* Initialize poll_table */pwq->polling_task = current;
    pwq->triggered = 0;
    Pwq->error = 0;
    pwq->table = NULL;
Pwq->inline_index = 0;

}; /* 3. P->qproc is pointing to the __pollwait function/static void __pollwait (struct file *filp, wait_queue_head_t *wait_address, poll_table *p                                                            
    ) {struct Poll_wqueues *pwq = container_of (p, struct poll_wqueues, PT);                                                         struct Poll_table_entry *entry = poll_get_entry (PWQ);                  
    if (!entry)                                                                                                                     
    Return                                                                                                                 
    Get_file (FILP);                                                                                                             
    Entry->filp = FILP;                                                                                             
    entry->wait_address = wait_address;                                                                                                            
    Entry->key = p->key; Init_waitqueue_f                                                                              
    Unc_entry (&entry->wait, Pollwake); Entry->wait.private = PWQ                                                                                     
Add_wait_queue (wait_address, &entry->wait); }

Adds a process to the wait queue for a file in __pollwait. Select assigns a poll_table_entry structure to each file for managing the files that are listening. Poll_table_entry contains file information, waits for queue headers, and waits for queue elements. memory management for Poll_table_entry

Poll_wqueues How to allocate memory for the new Poll_table_entry object.

It takes the form of a combination of static allocation and dynamic allocation.

If there are only a few files, the poll_table_entry of the file is placed inside an array within the poll_wqueues structure, eliminating the extra memory allocation and allocating additional poll_table_page if there are many files.

Each poll_table_page occupies the size of one page of memory. All the poll_table_page are linked together into a list.

struct Poll_table_page {
    struct poll_table_page * Next;
    struct Poll_table_entry * entry;
    struct Poll_table_entry entries[0];

callback function at Wake Pollwake

Notice that the wait_queue_t callback function pointer is reset in __pollwait instead of using the autoremove_wake_function set in Wait_event_interruptible.

Pollwake calls __pollwake, which calls Default_wake_function directly after setting pwq->triggered 1, to wake up the pwq->polling_task process. This is the process that invokes the select system call to sleep.

Therefore, any file that is readable/writable/error-triggered Pollwake is invoked, and the sleep process continues to execute.

one last question.

After the process is awakened, how do you know which file driver wakes up? Because the implementation of select in the 3.3 kernel

Trace the kernel implementation of Sys_select, which is roughly as follows: Copy user space timeout object to kernel space end_time, and reset time value (standardize processing) to call Core_sys_select. The process is as follows:
Based on the MAXFD value passed in, calculate how many bytes (per FD) are required to hold all FD (1bit), and then determine whether memory is allocated on the stack or allocated in the heap. A total of 6 fdset are required: User incoming in, out, exception, and Res_in,res_out and res_exception to be returned to the user. Copy 3 input Fdset from user space to kernel space, and initialize output fdset to 0; call Do_select, get return value ret. Do_select's job is to initialize the Poll_wqueues object and invoke the driver's poll function. Similar to the simple select we wrote. The procedure looks like this:
Invokes poll_initwait to initialize the Poll_wqueues object table, including its member poll_table, and if the user passed in a timeout that is not NULL but is set at 0, set the poll_table pointer Wait (that is, &table.pt) is null; In,out and exception are performed or calculated to get all_bits, and then iterate over the bit 1 fd in All_bits and find the file pointer fd_table based on the FILP of the process. Then set the key value of the wait (Pollex_set, Pollin_set,pollin_set or OP, depending on the user input) and call Filp->poll (FILP, wait) to get the return value mask. Then check whether the file satisfies the condition according to the mask value, set the Res_in/res_out/res_exception value, execute retval++, and set wait to null if satisfied. After each 32 (depending on the number of digits of a long integer) file, call 1 times cond_resched (), actively seek the dispatch, you can wait for the file has been traversed to wake up, after all the files are traversed, to set a null, And check for a file that satisfies the criteria (whether the retval value is 0), or whether it timed out, or if there is a pending signal, and if so, jump out of the loop and go to step 7, otherwise call Poll_schedule_timeout to cause the process to go to sleep until the timeout (if no timeout is set). Then the schedule () is called directly. If the process continues after the timeout, the setting pwq->triggered is 0, and if the driver for the file is awakened, then pwq->triggered is set to 1 (see section 2nd). Finally, the function calls Poll_freewait, deletes the process from the waiting queue for all files, removes the allocated Poll_table_page object, reclaims the memory, and returns the RetVal value. Copy res_in, Res_out and res_exception into incoming in, out, exception, and return ret. Call poll_select_copy_remaining to copy the remaining timeout time back to user space.

The above is the whole process of the Select, in fact the logic is very simple, very similar to our own implementation of the simple select.

It takes advantage of an important feature that uses NULL as the poll_table pointer to invoke the poll function of the driver. This allows the driver to not add the process to the wait queue. In the process above, this feature was used two times: If a non-null timeout is passed in, but the timeout is 0, no sleep is on any file; Any file is immediately readable/writable and will not continue to sleep on the remaining files.

As can be seen, this implementation efficiency is low, there are several reasons: the number of files can be monitored at the same time limited, up to 1024. This is too few for the server to handle hundreds of thousands of concurrent requests. Every time you call Select, you need to traverse from 0bit to the largest FD and schedule every 32 FD (2 context switches). Just think, if I want to monitor the FD is 1000, then how slow ah. And in the case of multiple FD, if the small FD is always readable, it will cause large fd to never be monitored. Memory replication overhead. You need to copy fd_set back and forth in user space and kernel space, and assign a Poll_table_entry object to each FD. 4 Summary

Select, though inefficient, is largely no longer used in large projects. But the kernel implementation is simpler and easier to understand than the epoll.

On this basis, we can further study the implementation of Epoll, two-phase comparison, a better understanding of why Epoll will be more efficient. That's what I'm going to look at next.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More