A bloodcase triggered by a lockless Message Queue (6) -- The Art of RingQueue (medium) sleep [continued], queue ringqueue

Source: Internet
Author: User

A bloodcase triggered by a lockless Message Queue (6) -- The Art of RingQueue (medium) sleep [continued], queue ringqueue
Directory

(1) cause (2) mixed spin lock (3) q3.h and RingBuffer

(4) RingQueue (top) spin lock (5) RingQueue (middle) sleep Art

(6) The Art of RingQueue (medium) sleep [continued]

Opening

This is the follow-up of Article 5. This part of content will be updated and added at the end of article 5: RingQueue (medium) sleep art.

Induction

At the end of the previous article, we will summarize the sleep policies in Windows and Linux, for example:

 

Although sched_yield () includes Sleep (0) and SwitchToThread () functions in Windows (the blue box and the part marked by the dotted box in the figure ), however, two gray text functions are missing, that is① SwitchToOtherCoresLowerThreads ()And② SwitchToLocalCoreLowerThreads ()(In fact, this function should include the same priority, but Equal is omitted because the name is too long ). In Windows① SwitchToOtherCoresLowerThreads ()This function. That is to say, Linux does not have the function of switching to a low-priority thread. While Windows provides SwitchToThread (), it can switch to a low-priority thread on the core, but it still lacks the ability to switch to other core low-priority threads, that is① SwitchToOtherCoresLowerThreads ()This function.

 

That is to say, even if the Windows and Linux policies are combined, they are still incomplete. But I thought for a moment, the above statement is actually not correct. Sched_yield () is not actually a thread that cannot be switched to a lower priority. Based on the three scheduling policies mentioned in the previous article on Linux, sched_yiled () may switch to a thread with a lower priority than itself, for example, SCHED_FIFO (first-in-first-out policy) or SCHED_RR (the wheel call policy is a thread with a lower priority ). However, you cannot specify other threads to switch to the core of the current thread, which is similar to the SwitchToThread () function on Windows.

 

On Windows, although it seems to be more complete, the function of switching to a low-priority thread on another core is actually missing, that is SwitchToOtherCoresLowerThreads ().

 

First of all, in Linux, the function of switching only to the waiting thread of the current core is missing. If the switching thread was originally on another core, this may cause the cache used by the switched thread to become invalid and forced to be reloaded, affecting performance. (This does not know how to choose a Linux policy. It may be clear from the source code, but one thing is certain, even if Linux is in some scheduling policies (such as SCHED_OTHER, this is the policy adopted by common threads). The thread originally waiting on the current core will be preferentially selected. However, when there is no suitable waiting thread on the current core, Windows SwitchToThread () will return immediately, while Linux should still choose other wait threads on other cores for execution. If there are two other Scheduling Policies, then, the waiting threads on the current core, especially the SCHED_FIFO policy, may not be preferentially selected ).

But in general, Linux uses three different scheduling policies, which is better than Windows's single round-robin strategy, at least you may use combinations of different policies to implement more effective scheduling solutions.

While Windows does not have the possibility of migrating such threads to other CPU cores to some extent, it lacks the function of switching to lower-priority threads on other cores, as a result, the policy is incomplete. Although Windows may decide whether to switch to these low-priority threads based on the degree of thread hunger, the process must be relatively slow, which may cause a deadlock to some extent in some special circumstances ".

Therefore, the scheduling policies for Linux and Windows have their own merits. In general, Linux seems to be a little better, because after all, there are more options. As long as they are properly designed, the situation may be better. But in general, both have shortcomings and defects, and are incomplete.

 

More complete scheduling

So how can we be more complete? You may also guess the answer. Through the previous analysis, we want to provide more complete details and controllable parameters to make the scheduling more complete and free. That is to say, according to our ideas, it means that we can switch from where to where and to where we want to switch. We need a more powerful interface. This interface should take into account almost all possible situations. It is powerful and flexible, and sometimes it may be aggressive. We just provide an interface, how to play and what to play with. We don't care, it's the programmer's own business to crash (it should not, but it may be confusing ).

 

We temporarily define this interface as scheduler_switch () (originally called scheduler_switch_thread (). The function model is roughly:

 

Int scheduler_switch (pthread_array_t * threads, pthread_priority_t priority_threshold, int priority_type, cpuset_t cpu_mask, int force_now, int slice_count );

 

Threads: indicates a group of threads. From this thread, several parameters are used to determine which thread to switch to. The thread must be in the ready state, the running thread is excluded. If this value is NULL, it indicates selecting from all wait threads of the system, that is, it is consistent with the default behavior of sched_yield.

Priority_threshold: indicates the threshold value of the thread priority. The following priority_type determines that the threshold value is higher than this threshold value, lower than this threshold value, equal to this threshold value, or greater than or equal to this threshold value. If the value is-1, the priority of the current thread is used.

Priority_type: determines the comparison type of priority_threshold, which can be divided into: >,<,=, >=, <=and other types.

Cpu_mask: indicates the value of the CPU core that can be switched to. If the value is 0, it indicates that only the core of the thread is switched to. If the value is not 0, each bit indicates a CPU core, this value is similar to the CPU affinity cpuset_t.

When force_now: 1, it indicates that the specified switching thread immediately obtains the time slice, which is not limited by the system priority and scheduling policy. The number of time slice to be forcibly run is determined by the slice_count parameter; when it is 0, indicates that the thread of the specified switch will wait for the system to determine whether to get the time slice to run immediately. It may be arranged in a very short waiting queue.

Slice_count: indicates the number of time slices that run after switching. If the value is 0, the system determines the specific number of time slices to run.

 

Return Value: If the switch is successful, 0 is returned. If the switch fails,-1 is returned (that is, no thread can be switched ).

 

What I'm talking about is aggressive. You can decide how many time slices will run continuously after the switch without being interrupted, in addition, if you do not immediately obtain the time slice's power through the thread group specified by threads, you can use the force_now parameter to forcibly obtain the time slice, that is to say, let the system have the thread that will get the time slice come in after the specified thread runs the corresponding time slice and then give it the time slice. If the slice_count value you specified is too large, other threads may not be able to run the time slice. Maybe this value should be set to an upper limit, such as 100 or 256 time slice.

As you can see, this interface function covers almost all of the existing scheduling functions in Windows and Linux, and has been enhanced and extended to a certain extent, and has certain flexibility, if you think there is something incomplete, you can also tell me. As for how to implement it, we don't care about it. We only need to know whether it can be implemented theoretically. The specific implementation method can be done by studying the Linux kernel source code. It may not be easy to implement, but theoretically, we may still do it. On Windows, because it is not open-source, we have no way. Maybe ReactOS is a choice, but ReactOS is based on the WindowsNT kernel, and the technology may be somewhat outdated (note: reactOS is an open-source operating system project that imitates Windows NT and Windows 2000. For details, refer to [wikipedia: ReactOS]).

 

Linux Kernel

Download the Linux Kernel source code:

Https://www.kernel.org/

(We recommend that you download 2.6.32.65 and the latest stable version 3.18.4. Android is modified on the basis of version 2.6. If you are interested, you can directly use the Android kernel .)

The kernel version of Ubuntu 14.04 lts is 3.13 (refer to: Ubuntu release list)

 

References:

Linux Kernel scheduling algorithm (1) -- quickly locate the process with the highest priority

Linux Kernel-Process Scheduling (1)

IBM: Linux 2.6 Scheduling System Analysis

 

Note: by reading the Linux kernel source code, the actual execution process of sched_yield () is not as described above. The correct process should be: traverse the appropriate task thread on the core where the current thread is located. If yes, switch to the thread. If no, at present, I am not sure whether it will run from the RunQueue migration task thread on another core to the current core. In principle, this should be the case. If so, the only difference between sched_yield () and Sleep (0) in Windows is that Sleep (0) will not switch to a thread that is lower than its own priority level, while sched_yield () yes. The Sleep (0) policy should be similar to sched_yield (). It should be first found on the core, but not on other core. However, I will not modify the preceding description here. Please pay attention to it.

 

As Linux 3.18.4 has changed a lot, it has become much more complicated and difficult to understand. It takes a lot of effort to locate the schedule () function ......, Therefore, we use version 2.6 as an example.

 

By reading the Linux kernel source code of 2.6.32.65, we know that the system scheduling function is schedule (void), which is similar to our interface and Implements similar functions. The schedule (void) function is to find the next runable task thread from the RunQueue on the current core and switch the MMU, register status, and stack value, that is, Context Switch ).

 

Use/include/linux/smp. h and/arch/x86/include/asm/smp. h or/arch/arm/include/asm/smp. h. We can know that smp_processor_id () returns a number of CPU cores. on x86, it is obtained through percpu_read (cpu_number), and on arm, it is obtained through current_thread_info ()-> cpu. In this way, rq = cpu_rq (cpu); The obtained RunQueue is an independent running (Task thread) queue on each CPU core.

 

/** Schedule () is the main scheduler function. */asmlinkage void _ sched schedule (void) {struct task_struct * prev, * next; unsigned long * switch_count; struct rq * rq; int cpu; need_resched: preempt_disable (); cpu = smp_processor_id ();/* obtain the current CPU core number */rq = cpu_rq (cpu); rcu_sched_qs (cpu); prev = rq-> curr; switch_count = & prev-> nivcsw; release_kernel_lock (prev); need_resched_nonpreemptible: schedule _ Debug (prev); if (sched_feat (HRTICK) hrtick_clear (rq); spin_lock_irq (& rq-> lock); update_rq_clock (rq); then (prev ); if (prev-> state &&! (Preempt_count () & PREEMPT_ACTIVE) {if (unlikely (signal_pending_state (prev-> state, prev) prev-> state = TASK_RUNNING; else deactivate_task (rq, prev, 1); switch_count = & prev-> nvcsw;} pre_schedule (rq, prev);/* prepare for scheduling? */If (unlikely (! Rq-> nr_running) idle_balance (cpu, rq); put_prev_task (rq, prev);/* record the running time of the previous Task thread, update the average running time and average overlap time. */next = pick_next_task (rq);/* traverses from the highest priority thread category until the next runable task thread is found. */if (likely (prev! = Next) {sched_info_switch (prev, next); perf_event_task_sched_out (prev, next, cpu); rq-> nr_switches ++; rq-> curr = next; ++ * switch_count; /* switch the context and release the runqueue spin lock. */context_switch (rq, prev, next);/* unlocks the rq * // ** the context switch might have flipped the stack from under * us, hence refresh the local variables. */cpu = smp_processor_id (); rq = cpu_rq (cpu);} else spin_unlock_irq (& rq-> lock); post_schedule (rq); if (unlikely (current) <0) goto schedule; preempt_enable_no_resched ();/* Whether to reschedule */if (need_resched () goto need_resched;} EXPORT_SYMBOL (schedule );

 

One of the key functions is pick_next_task (rq), which is used to find the next runable task with the highest priority. It is certain that the traversal is prioritized on the RunQueue of the core, but it cannot be seen whether there will be no suitable task thread on the core, the RunQueue migration task from another core may depend on the implementation of pick_next_task () in struct sched_class. This should be a function pointer. The source code of pick_next_task (rq) is as follows:

 

/** Pick up the highest-prio task: */static inline struct task_struct * pick_next_task (struct rq * rq) {const struct sched_class * class; struct task_struct * p; /** Optimization: we know that if all tasks are in * the fair class we can call that function directly: * // * fair_sched_class is a time-based scheduling policy, including SCHED_NORMAL, SCHED_BATCH, three SCHED_IDLE policies are available. * // * It is known from the _ setscheduler () function. */if (likely (rq-> nr_running = rq-> cfs. nr_running) {p = fair_sched_class.pick_next_task (rq); if (likely (p) return p;}/* sched_class_highest = & rt_sched_class; real-time scheduling policy, * // * Including SCHED_FIFO and SCHED_RR, which are known from the _ setscheduler () function. */class = sched_class_highest; for (;) {p = class-> pick_next_task (rq); if (p) return p; /** Will never be NULL as the idle class always * returns a non-NULL p: */class = class-> next ;}}

 

We can stop this problem and study it later, or hand it over to interested students for research. If you have any results, please let me know.

 

Scheduler_switch ()

Let's review the prototype of scheduler_switch (), as follows:

 

Int scheduler_switch (pthread_array_t * threads, pthread_priority_t priority_threshold, int priority_type, cpuset_t cpu_mask, int force_now, int slice_count );

 

Here, pthread_array_t * threads refers to a group of threads. The reason why we want to set this parameter is that q3.h is used to push () and pop, the process of verifying whether the submission is successful requires serialization. If the Thread Scheduling in this process can be scheduled in the order we want and the appropriate sleep mechanism is added, it may reduce competition and improve efficiency.

 

Note the following code for lines 22nd and 23:

 

1 static inline int 2 push (struct queue * q, void * m) 3 {4 uint32_t head, tail, mask, next; 5 int OK; 6 7 mask = q-> head. mask; 8 9 do {10 head = q-> head. first; 11 tail = q-> tail. second; 12 if (head-tail)> mask) 13 return-1; 14 next = head + 1; 15 OK = _ sync_bool_compare_and_swap (& q-> head. first, head, next); 16} while (! OK); 17 18 q-> msgs [head & mask] = m; 19 asm volatile ("": "memory "); 20 21/* This is a blocking lock, which serializes the commit process, that is, it is allowed one by one from the smallest number to the largest. */22 while (unlikely (q-> head. second! = Head) 23 _ mm_pause (); 24 25 q-> head. second = next; 26 27 return 0; 28}

 

Because there is no sleep strategy for lines 22nd and 23 here, they are directly spin in the original place. When the competition is fierce and the interval is very short (there are two parameters to evaluate a competitive situation, one parameter is the number of people competing, and the other parameter is the frequency of competition, that is, the interval at which competition will occur. If there are more competitors and the competition interval is short, it means that the competition is very fierce. We are doing this here). Without a sleep strategy, it is very likely that they will compete fiercely for resources, q3.h will also make q3.h when the total number of push () and pop () threads is greater than the total number of CPU cores, there is an exception between "live lock" and "deadlock. At this time, the queue is very slow, and how slow it seems to look a little bit, sometimes it may take a minute or a few minutes.

At this time, we can sleep the threads that are not yet in turn, and then wake them up in order (we wake them up according to our own sequence number head, rather than the order in which the threads enter the sleep state, because the execution sequence is uncertain from the end of row 16th to row 22nd, it is possible that threads with a large serial number will first execute to row 22nd, while threads with a small serial number will be suspended, therefore, we can only run the 22nd rows at a later time. Therefore, when we wake up, we wake up based on our own serial number head from small to large ). However, this will also cause A problem: when A running thread of A core (we set this CPU core to X and thread to A) executes to 22nd rows, because the serial number head is greater than q-> head. there are many seconds, so you need to switch to another thread or enter the sleep state. If you select switch, head = q-> head. the second thread (we set this thread to B) will be selected and awakened, but the problem is that the CPU core X with time slice is probably not head = q-> head. second is the original core of thread B (we set this core to Y). If we want to wake up, we need to migrate from the original core Y to the current core X, this is of course not what we want to see. If we can execute thread B on core Y, this is the best choice. If possible, we will interrupt the running thread C on core Y and hand over the time slice to thread B for running. In this case, the next wake-up thread may not be on the CPU core that currently has a time slice, which will lead to similar interruptions. If there are too many interruptions, the efficiency will naturally be affected. Therefore, if there is a better overall planning method, it would be better. However, it seems that such a planning strategy is not very well implemented.

Either we allow frequent thread migration, or we can allow both cases to occur in a certain proportion. For example, the probability of 0.4 allows the Awakened thread to be migrated to the current core, 0.6 of the probability is not migrated, and the core of the thread to be awakened is interrupted (the total probability is 1.0 ). Or we can wake up two adjacent threads at A time (that is, the serial number head is adjacent). We set these two threads to thread A and thread B respectively. When A passes, next we should wake up B. Let thread B wake up earlier, then let it spin and wait for passing (this is actually very similar to q3.h. The difference is that we have applied a sleep policy, it can cope with push () and pop () of any number of threads ). In addition, if the core of the two threads is just cross, that is, thread A is originally on core X, and thread B is originally on core Y, at present, the thread of core Y is requesting sched_yield (). Now the threads to be awakened are A and B. Then let B run on core Y and interrupt the thread on core X, run thread. Of course, we will first execute the interrupt process, let thread A run on core X, and then switch to thread B on core Y. If so, if you cannot do this, the two can also be done at the same time. If it is not crossover, it also increases the probability that one of the two threads may be at its original core.

 

Why didn't we use the first-in-first-out queue on the pthread_array_t * threads thread group? First, as mentioned earlier, like q3.h, the order in which threads enter sleep is not necessarily the order we want. The second reason is that the problem like ours is itself a FIFO queue problem. It seems unreasonable to set another FIFO queue in it. However, it is true that this kind of FIFO threads is still required in some cases, and there is no conflict between them. We can make the first input sequence number value smaller than the second input sequence number value. To simplify the logic, we can use a fixed array to store elements (we must first specify the queue size) and use two one-way linked lists to implement insertion and traversal. One is active_list and the other is free_list, during traversal, the sequence number is used as the basis for selection, and thus becomes a queue that is listed in order of the sequence number.

 

Warning: in fact, this part is a bit superfluous, but I am not prepared to delete it after I write it. Some of it is also something I have thought about, but I just wrote it as if I didn't think it was ideal, but it can be used as a way of thinking.

 

In fact, one day before I wrote the previous article and prepared to write this article, I saw this article, the trap and thinking of conditional variables, which can also be said to be timely rain, it seems that I can draw some points with my article. Therefore, I specially checked the Code related to pthread_cond_xxxx () in glibc source code and got a better understanding of the conditional variables. Without in-depth research, I just thought that the implementation of conditional variables is not as simple as I thought. At least its overall mechanism is not the same as the wake-up hexadecimal mechanism I mentioned earlier, we always want to implement a perfect wake-up mechanism. In fact, it may be difficult to implement the most perfect method in our hearts, because some things are irreconcilable in multi-thread programming. At the same time, in order to simplify the implementation of POSIX specifications, the usage is quite different from that of Windows. After reading this article, I found that my original understanding of the pthread conditional variables is not in place. I thought it was very similar to Windows Event. Although I know some differences between the two, the main difference lies in the manual and automatic modes of CreateEvent (), PulseEvent () and pthread_cond_broadcast. Since I have never actually used it, the real difference is that I only know it after reading this article. The biggest difference lies in the implementation logic. The POSIX specification aims to simplify implementation, condition variables must be used with a pthread_mutex_t to take effect correctly. This is somewhat different from Windows WaitForSingleObject () and ResetEvent, event-related functions on Windows already include this mutex operation, which is relatively intuitive and easy to use, but pthread_cond_t is more flexible.

 

Maybe you will say that since we may use scheduler_switch () in the kernel, why does the Parameter definition use the pthread type? In fact, if the principle is implemented, change it to the kernel's kthread, or the kernel and pthread can implement two sets of functions separately, which is not a big problem.

 

Scheduler_switch () is a "heuristic" design. Our goal is to make the sleep policy more complete and serve us better if it can be improved, it can be regarded as an inspiration for you if it cannot be implemented or is not easy to implement.

 

To be continued

Thank you for reading this article. If you think it is well written, please give me some suggestions or comment on it, in this way, you can find this article in "I like" and "I comment" on the homepage.

 

RingQueue

The GitHub address of RingQueue is: Login. I dare say it is a good mixed spin lock. You can download it and check it out. It supports Makefile, CodeBlocks, Visual Studio 2008,201 0, 2013, CMake, and Windows, minGW, cygwin, Linux, Mac OSX, etc.

 

Directory

(1) cause (2) mixed spin lock (3) q3.h and RingBuffer

(4) RingQueue (top) spin lock (5) RingQueue (middle) sleep Art

(6) The Art of RingQueue (medium) sleep [continued]

 

Previous Article: A bloodcase caused by a lock-free Message Queue (V)-the art of RingQueue (medium) sleep

.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.