Deep analysis of the Linux select mechanism

Source: Internet
Author: User

deep Analysis of the Linux select mechanism     as the implementation of IO multiplexing. Select is the level of abstraction and batch processing, not the traditional way of plugging in real IO read-write system calls. Instead, the descriptive descriptor that waits for our attention is ready to plug in the select system call. Of course, the better way now is epoll, for example, in Java, the epoll is the bottom of the NIO. This article is only for the purpose of understanding the principle of the select mechanism. Without looking at the source code, you cannot understand these IO multiplexing techniques. Also in the interview process experience, not to practice will find that know is always fur. Interview Question : Can I change the maximum descriptive descriptor limit for select? (pending further)
User Layer API syntax:
/* According to POSIX.1-2001 */       #include <sys/select.h>/       * According to earlier standards */       #include & lt;sys/time.h>       #include <sys/types.h>       #include <unistd.h>       int Select (int Nfds, Fd_set * Readfds, Fd_set *writefds,                  fd_set *exceptfds, struct timeval *timeout);       void fd_clr (int FD, fd_set *set);       int  fd_isset (int FD, fd_set *set);       void Fd_set (int FD, fd_set *set);       void Fd_zero (Fd_set *set);       #include <sys/select.h>      int pselect (int Nfds, fd_set *readfds, Fd_set *writefds,                   fd_set *exceptfds, const struct TIMESPEC *timeout,                   const sigset_t *sigmask);

Note : The API here has changed (see UNPv1 P127), the timeout value is agreed to update, which is reflected in the kernel.


The main process of kernel source code called by select System is: Sys_select (), Core_sys_select (), Do_select (), poll_select_copy_remaining. Code can be at a glance.


The role of the/** Syscall_define5 macro is to turn it into a common form of system invocation, * Asmlinkage long sys_select (int n, fd_set __user *inp, Fd_set __user *outp,fd_set          __user *exp, struct timeval __user *TVP); */syscall_define5 (SELECT, int, n, fd_set __user *, INP, fd_set __user *, OUTP,     Fd_set __user *, exp, struct timeval __user *, TVP) {struct Timespec end_time, *to = NULL;     struct Timeval TV;     int ret;          if (TVP) {//assumes that the timeout threshold is set (Copy_from_user (&AMP;TV, TVP, sizeof (TV))) Return-efault;          to = &end_time; Converts from Timeval (sec) to (nanoseconds) and then establishes the timeout if (Poll_select_set_timeout (To, Tv.tv_sec + (tv.tv_usec/u     SEC_PER_SEC), (tv.tv_usec% usec_per_sec) * nsec_per_usec)) Return-einval;     }//core work ret = Core_sys_select (n, INP, OUTP, exp, to);     Core_sys_select processed Fd_set Next update the value of timeout ret = poll_select_copy_remaining (&end_time, TVP, 1, ret); return ret;} /** We can actually return Erestartsys insTead of Eintr, but I ' d* like-to is certain this leads to no problems. So I return* eintr just for safety.** Update:erestartsys breaks at least the Xview clock binary, so* I ' m trying Erestartn                  Ohand which restart only if you want to.*/int core_sys_select (int n, fd_set __user *inp, Fd_set __user *OUTP, Fd_set __user *exp, struct Timespec *end_time) {//Poll.h:fd_set_bits packed 6 long *, representing three descriptive narrative set values-results Fd_set_bi     TS FDS;     void *bits;     int ret, Max_fds;     unsigned int size;     struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and being faster * first pre-allocating 256B of space in most cases can meet the need for special circumstances in the following will allocate space *     /long stack_fds[select_stack_alloc/sizeof (long)];     ret =-einval;     if (n < 0) goto Out_nofds;     /* Max_fds can increase, so grab it once to avoid race */Rcu_read_lock ();     Get open File Descriptive descriptor (pointer disjunction) FDT = files_fdtable (current->files);     Max_fds = fdt->max_fds;     Rcu_read_unlock (); if (n > Max_fDS) n = max_fds;//parameter correction/* * Now the descriptive descriptor to be monitored a number of size*8 for each need 6 bits to indicate whether it can read and write the exception and writes the result in res_in res_out res_e     XP * Therefore constitutes the following memory layout (see Figure 1) */size = Fds_bytes (n);     bits = Stack_fds; if (Size > sizeof (STACK_FDS)/6) {/* Not enough space in On-stack array, must use Kmalloc */ret =          -enomem;          bits = Kmalloc (6 * size, gfp_kernel);     if (!bits) goto Out_nofds;     } fds.in = bits;     Fds.out = bits + size;     Fds.ex = bits + 2*size;     fds.res_in = bits + 3*size;     Fds.res_out = bits + 4*size;     FDS.RES_EX = bits + 5*size;         Get these FD sets if (ret = Get_fd_set (n, INP, fds.in)) from User space | |         (ret = Get_fd_set (n, OUTP, fds.out)) | |          (ret = Get_fd_set (n, exp, fds.ex)))     Goto out;     These result parameters are initialized with 0 zero_fd_set (n, fds.res_in);     Zero_fd_set (n, fds.res_out);     Zero_fd_set (n, FDS.RES_EX); Everything is ready to go here ... ret = Do_select (n, &fds, end_time);    if (Ret < 0) goto out;          if (!ret) {ret =-erestartnohand;          if (signal_pending (current)) goto out;     ret = 0;         }//Do_select correctly returns the description-descriptor-ready result in FDS by Copy_to_user//feedback to User space if (set_fd_set (n, INP, fds.res_in) | |         Set_fd_set (n, OUTP, fds.res_out) | |     Set_fd_set (N, exp, fds.res_ex)) ret =-efault;out:if (bits! = Stack_fds) Kfree (bits); Out_nofds: return ret;}     The core work of select is int do_select (int n, fd_set_bits *fds, struct Timespec *end_time) {ktime_t expire, *to = NULL;     struct Poll_wqueues table;     Poll_table *wait;     int retval, I, timed_out = 0;     unsigned long slack = 0; unsigned int busy_flag = net_busy_loop_on ()?     poll_busy_loop:0;     unsigned long busy_end = 0;     The largest descriptive Descriptor value Rcu_read_lock () to be measured by the select.     retval = MAX_SELECT_FD (n, FDS);     Rcu_read_unlock ();     if (retval < 0) return retval;     n = retval; Poll_initwait (&taBLE);     wait = &table.pt; Timer value (seconds nanoseconds) is 0 if the mark does not wait if (end_time &&!end_time->tv_sec &&!end_time->tv_nsec) {wait-          >_qproc = NULL;     Timed_out = 1;     } if (End_time &&!timed_out) slack = select_estimate_accuracy (end_time);     The following will use this variable to count the number of descriptive descriptors ready, so first clear 0 retval = 0; for (;;)          {unsigned long *rinp, *ROUTP, *rexp, *INP, *OUTP, *exp;          BOOL Can_busy_loop = false; INP = fds->in; OUTP = fds->out;          Exp = fds->ex; RINP = fds->res_in; ROUTP = fds->res_out;          Rexp = fds->res_ex;               for (i = 0; l < n; ++rinp, ++ROUTP, ++rexp) {unsigned long in, out, ex, all_bits, bit = 1, mask, J;               unsigned long res_in = 0, res_out = 0, res_ex = 0; in = *inp++; out = *outp++;               ex = *exp++; All_bits = in | Out |               Ex To poll these bitmaps one at a time we have a range of FD that we care about//otherwise move in 32bits step if (aLl_bits = = 0) {i + = Bits_per_long;               Continue }//The current interval has the FD we care about so in-depth detail tracking (Figure 2) for (j = 0; j < Bits_per_long; ++j, ++i, bit <<= 1)                    {struct FD F;                    if (i >= n) break; if (! (                    Bit & all_bits)) continue;                    It is assumed that a bit of 1 in the current interval indicates that the corresponding FD needs us to process//At this moment I is the file descriptive descriptor value F = fdget (i);                         if (f.file) {const struct file_operations *f_op;                         F_op = f.file->f_op;                         mask = Default_pollmask;  Detailed to the poll function pointer in the file operation result for if (F_op->poll) {Wait_key_set (wait, in,                              Out, bit, busy_flag); Mask = (*f_op->poll) (f.file, wait);//TODO                         }//The above Fdget adds a file reference count so here to restore Fdput (f);                         /* To infer if the descriptive descriptor you are interested in is ready, update to the result parameter * and add the ready number */                              if ((Mask & Pollin_set) && (in & Bit)) {res_in |= bit;                              retval++;                         Wait->_qproc = NULL; } if ((Mask & Pollout_set) && (out & Bit)) {Res_out                              |= bit;                              retval++;                         Wait->_qproc = NULL; } if ((Mask & Pollex_set) && (ex & Bit)) {RES_EX |=                              Bit                              retval++;                         Wait->_qproc = NULL; }/* Got someThing, stop busy polling * stop busy loop */if (retval)                              {Can_busy_loop = false;                         Busy_flag = 0;                         /* * Remember a returned * poll_busy_loop if we asked for it */} else if (Busy_flag & mask) Can_busy_loop = t                    Rue               }}//After the interval of this round is traversed, update the result parameters if (res_in) *rinp = res_in;               if (res_out) *ROUTP = res_out;               if (res_ex) *rexp = RES_EX;          /* Make a schedule to consent to other process execution * Wait Queue wake after * */cond_resched ();          }//After a round of polling wait->_qproc = NULL; Assume that a descriptive descriptor is ready or set a timeout or pending signal to exit the dead loop if (retval | | timed_oUT | |          Signal_pending (current)) break;               if (table.error) {retval = Table.error;          Break }/* Only if found Poll_busy_loop sockets && no out of time */if (Can_busy_loop &&!ne                    Ed_resched ()) {if (!busy_end) {busy_end = Busy_loop_end_time ();               Continue          } if (!busy_loop_timeout (busy_end)) continue;          } busy_flag = 0;               /* Assume set timeout and this is the first loop (to==null) */if (end_time &&!to) {//convert from Timespec to Ktime type (64-bit signed value)               expire = Timespec_to_ktime (*end_time);          to = &expire; }/* Set the process state task_interruptible sleep until timeout * Return to here after the process task_running */if (!poll_schedule_tim     Eout (&table, task_interruptible, to, slack)) timed_out = 1; }//Release the poll wait queue poll_freewait (& table); return retval;}

Fig. 1:

Fig. 2:




References:(1) Linux kernel 3.18 Source code(2) Linux man page(3) UNPv1Time: 3h


Deep analysis of the Linux select mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.