A murder caused by a lock-free message queue: How to be a real programmer? (v) The Art of--ringqueue (middle) sleep

Source: Internet
Author: User
Tags sleep function

Directory

(i) causes (ii) hybrid spin lock (iii) q3.h and Ringbuffer   

(d) ringqueue (upper) spin lock (v) The Art of Ringqueue (middle) Sleep  

Opening

I have studied disruptor these days. NET version, due to. NET version follow-up, online only v2.10 version. Not carefully researched, but it is certain that with the latest Java version of Disruptor 3.30 is a lot of difference. I also used the 2.10. NET version to write a test program similar to our problem, resulting in the same results as the Java version of Disruptor 3.30. I also downloaded the C + + version of, but looked at, threw aside, one reason is that the version is too low, another reason is to boost, c++11, I am advocating light, rely on small, really want to use me to write a, so I also do not bother to use them to test, I have begun to put The principle of disruptor 3.3 moved to C + +.

For convenience, I uploaded the revised disruptor. Net version to Github:https://github.com/shines77/disruptor.net, where vs2013-net 4.5 is the VS 2013 use version of the. NET Framework 4.5, because the original uses. NET 4.0, but upgrading to 4.5 some files are modified, so they are divided into two versions. After my adjustment, the version of. Net 4.5 is more nuanced, x64 and x86 are strictly separate, and there is no big difference in nature. Disruptor. Net 2.10 The original GitHub address is: https://github.com/disruptor-net/Disruptor-net. If you do not want to download the complete project, you can also view the test code here NPnCRingQueueTest.cs, under the Ringqueue\disruptor\csharp directory.

Ringqueue

We first look at Ringqueue, in the last article in fact some ringqueue test, do not know if there is no careful reader who quickly find who is slow, how much faster? Let's take a look:

As we can see, the hybrid spin lock Spin2_push () is the fastest, while the Mutex_push () with the operating system's mutex version is the slowest, the former is about 3.5 times the latter.

After many tests we found that in fact Spin_push (), Spin1_push (), Spin2_push () speed is similar, but Spin2_push () relatively stable some, because the multi-thread process is affected by a lot of factors, so the actual test, Will find that each time the results are not the same, but overall, Spin2_push () is the most stable inside.

So, let's take a look at Q3.h's test results:

We found thatq3.h is slower than the mutex version of the operating system, and the attentive reader will definitely ask why the q3.h test only used 4 threads. And the front uses 8 threads. This is due to the limitations of the q3.h itself code, when the number of bus threads is greater than the actual number of CPU cores will be very slow and slow, because my CPU is 4 cores, so it can only be used push_cnt = 2, pop_cnt = Second Test, in fact, even if the previous tests with 2 PUSH threads and 2 POP lines process, the result is faster than q3.h, interested friends can try their own. In addition, q3.h through the modification can solve this limitation, this is something.

Fully available, the efficiency of our hybrid spin lock Spin2_push () is good, is the q3.h of 4.18 times,Spin2_push () The actual time of the minimum value is actually better than 716 milliseconds is much smaller, even the fastest time can reach about 580 milliseconds.

Here's a re-description of my test environment:

Cpu:intel Q8200 2.4G/4 nucleus

System: Windows 7 Sp1/64bit

Memory: 4G/DDR2 1066 (Dual channel)

Build platform: Visual Studio Ultimate Update 2 (with Cl.exe)

The above results are measured in x86 mode (64-bit system is compatible with x86 mode), due to the x64 mode sleep (1) efficiency problems, which will cause the system mutex version becomes very slow (to more than 10 seconds or a few 10 seconds), the system mutex is estimated to use a similar to sleep (1) code. We used the Sleep (1) Code Spin_push (), Spin1_push (), Spin2_push () will also be affected a little bit, but the impact is very small and small. As to why this is not very clear, so no x64 mode to test.

Spin2_push ()

Why Spin2_push () so fast? Through these days of research and comparison, I tested the Java version of Disruptor 3.30 and. Net version of Disruptor 2.10, the following gives the test results of both:

Java version of Disruptor 3.30:

. Net version of Disruptor 2.10:

Here you must wonder why disruptor is so slow? In fact, I also want to know ...

We note that Disruptor brings us a new perspective, a variety of scenarios (1P + 1c,1p + C, multi-p + 1C), and a variety of message processing methods that are worthy of a good thing for us to learn. Especially in single-producer or single-consumer mode, it can be simplified, which we have not considered, and here is not considered, we still have a multi-producer + multi-consumer model to discuss, because the single producer or single consumer mode is relatively simple.

Disruptor in a single producer + single consumer (1 producer + 1 Comsumer), or very fast, this time is indeed lock-free, in fact, this time is not only lock-free, but also wait-free, in addition to the queue full or Wait outside when you are empty (each of these methods must wait). However, since disruptor is not fully optimized for 1P + 1C mode, it only considers 1P + C, so its consumers are not wait-free. Although the disruptor 1P + 1C mode is fast, it can actually be faster. and the actual test, 1P + NC mode (single producer + Multi-consumer) is not necessarily fast, measured 2P + 2C is more than 1P + 3C also faster, unexplained.

Here you can refer to the article "disruptor Use guide ", Disruptor provides us with a lot of scenes, the most common is eventhandler,eventhandler is to let each eventprocessor The message is removed from the queue and processed again, that is, the message is processed repeatedly, because we want to implement a multi-producer, multi-consumer FIFO (first-in, first-out) Message Queuing, so EventHandler does not meet. Referring to the narrative of this article, we know that only the Workerpool + Workhandler pattern fits our scene:

One thing to note here is that while writing this article, I was looking at the publish () function of single producer Singleproducersequencer.java (where the cursor is a sequence class) and found:

Go to Sequence.java, the Sequence::set () function is defined as follows: (in fact, I always thought that set is simply a write operation, and also do not understand Putorderlong () specifically what meaning)

Through the online search, I accidentally found that in fact Unsafe.putorderedlong (this, value_offset, VALUE); This function internally uses a global spin lock to ensure the write/store order, can refer to the " source code Anatomy of the Sun.misc.Unsafe."

Not only that, Java all Unsafe atomic operations (COMPAREANDSWAP, etc.) are used in this spinlock, in particular, can refer to this article. This spinlock is a static variable, so all of these operations may result in a certain spin, although the spin-waiting time may be short and short, but this is not the same as a single CAs local loop, because a single CAs and other atomic operations as long as the address is not the same, locked The Cache line is likely to be different from the address of atomic operations elsewhere, and the likelihood of being blocked and interacting with each other is relatively small. The spinlock lock is the same Cache line, blocking all blocks together. But the impact is difficult to assess, at least not a good practice.

The code for Spinlock is this:

So, no matter how good lock-free disruptor come up with, wait-free algorithm, its internal implementation must include the "lock", can never be "no lock". In fact, in the Disruptor 3.30 (Java edition), even in the multi-producer + multi-consumer mode, Disruptor also really implemented the Lock-free method (if not the lock of unsafe), but more use a with buffer_size An array of the same size to record the Flag, and each time the producer has to find the smallest one in an array of ordinal numbers containing all the consumers (the serial number that each consumer has read), the consumption is not necessarily very small, although it may be faster than the non-Lock-free method.

As to why it is so slow, it may be partly related to the language itself. And disruptor still has a problem, that is, the more threads are slower, and the more Spin2_push () threads the faster, but there is a limit, but this is consistent with the number of threads we often say is best set to twice times the total number of CPU cores experience value. In addition, I will write an additional article about disruptor to discuss in detail.

Compare with Q3.h.

We are in the fourth article(d) ringqueue (upper) spin lockIt is mentioned that we chose the spin lock because q3.h can be thought of as consisting of two lock-free structures and two "locks", so we may as well use a "spin lock" directly. such as push (), the CAS loop to collect the serial number, all the push () threads will compete on this, the competition is mainly in q->Head.firstCached line is locked and the cache fails, andhead = q->head.first; tail = q->tail.second; The cache will also be invalidated when the value is written elsewhere. Then there is the serialization process that confirms the success of the commit, which is a true "lock" that causes other threads to be blocked. And Q3.h's mistake is not to do a reasonable sleep in this place, which leads to when the total number of threads over the core number of time, a thread blocked other threads, and this thread is not a time slice, resulting in livelock, although eventually the thread can still get the time slice, But other threads have been waiting for a long time, and there is no sleep, the CPU is always very high, the whole process is very slow. So to solve this problem is to be in the "lock" process, the appropriate spin, yield () or sleep a bit. This "lock" itself does not invalidate the cache line, but if Q->head.second is written to the new value, it will still cause false sharing (false sharing). Similarly, the analysis of Pop () is similar. Here with Spin2_push (), a bit less than half of the number of competing threads (assuming the producer and consumer threads are the same), the disadvantage is that there is a lot of competition, the point of the spin lock competition is only one, is the lock cycle, as long as the lock, because only one thread can get write and read power, There is no pseudo-sharing (false sharing) problem with ringbuffer internal operations, because "Wan Fumo, Fumo opens".

Another problem with Q3.h is that in the third article, it is head.first,head.second,tail.first,tail.second that four variables should be on 4 different cache lines, To reduce the problem of cache invalidation (that is, pseudo-sharing). This disruptor is done very well, perhaps q3.h solve these problems will be faster, but do not know how much faster, I will add in the future to try, but certainly more than the current version of Disruptor to C + + is to be slow, this can be sure.

Spin2_push ()

So, let's take a look at our hybrid spin lock Spin2_push ()to see the spin2_push_ () function in RingQueue.h :

At the end of the previous article, I mentioned Thread.yield (), yielding is the meaning of concession, bow, and sell, that is, the CPU's initiative to other threads. Before also said Denghe.net (Lao Deng) once posted with the Reflection tool to view the C # source, in fact, is not Thread.yield (), but should be system.threading below the SpinWait () class. Note here that not thread.spinwait (n), thread.spinwait (n) is actually a spin n cycle, while Thread.yield () is defined as Windows API switchtothread (), spin2_ The Jimi_yield ( ) inside the push () is defined as switchtothread ()on Windows, which is a very special function, in the comments of the Jimi_yield () I have written, Its function is to give in to other waiting threads on the CPU core on which the thread resides (but not to wait threads on other CPU cores), which we will discuss in detail later.

and the SpinWait () class we're talking about looks like our spin2_push (), but we imitate it. SpinWait.cs source code can be found on the Internet, I have uploaded to GitHub:SpinWait.cs, under the \douban directory in the Ringqueue project.

Let's see what SpinWait.cs looks like. Let's look at two more important places:

There are also Spinone ():

As you can see, I've made a few improvements to Spinone (), Sleep_0_every_how_many_times changed to 4,sleep_1_every_how_many_times to 64, which is the execution of sleep (0) and sleep ( 1) of the interval value, which is to calculate the use of a bit operation efficiency of a point, in fact, this place is not much different from the%, that is, with 5 and 20 is also possible, but according to my experience 20 can be slightly larger, especially in the x64 mode, I also slightly mentioned the reason, this is a lot of testing after the results Experience value, specific can be adjusted for their own trial situation.

In fact, the key place is the spin frequency setting, you can see the SpinWait.cs with a threshold of 10, and Spin2_push () there are 1 (set to 2 also can), in fact, this is a very important place, this place determines the performance of the hybrid spin lock, It is by the area you want to lock the operation lasted how long and decided, the lock hold the time is short, we should spin the number of times should be less, and hold a long time, you can set a larger, but not too big, too big will always spin and waste CPU time. Because we let it spin for a long time, it's better to look at other threads that have no need, if necessary, to switch to another thread and give the time slice to it. If another thread does not need a time slice and repeats a bit, we let the thread go into hibernation, that is, sleep (1). Here Sleep (0) is not dormant, but instead switches to a thread that has the same priority or higher priority as itself, and later we will say that the function of Spin2_push () has a detailed comment and description.

We call this strategy of the Spinone () function in SpinWait.cs a " Sleep Strategy ", and how to sleep better is a "art of sleep ."

Let's take a look at the process/thread scheduling principle of the operating system.

Process/Thread Scheduling

There are many strategies for CPU process/thread scheduling in the operating system, and the Unix system uses a time-slice algorithm, while Windows is preemptive.

Linux

In the time slice algorithm, all the processes are queued. The operating system assigns a period of time to each process in their order, that is, the time that the process is allowed to run. If the process is still running at the end of the time slice, the CPU is stripped and assigned to another process. If a process hangs or ends before the end of the time slice, the CPU switches immediately. What the scheduler has to do is maintain a list of ready processes that will be moved to the end of the queue when the process runs out of its time slice.

Windows

The so-called preemptive operating system, that is, if a process gets CPU time, unless it voluntarily abandon the use of the CPU, otherwise it will completely occupy the CPU. Therefore, it can be seen that in the preemptive operating system, the operating system assumes that all processes are "good personality", will voluntarily let the CPU.

In preemptive operating systems, assume that there are several processes in which the operating system calculates a total priority based on their priority, hunger time (which has not been used by the CPU for a long time). The operating system will give the CPU the highest priority for this process. When the process finishes executing or when it is actively suspended, the operating system recalculates the total priority of all processes and then takes the CPU control to him with the highest priority.

Split Cake

We use the split-cake scenario to describe the two algorithms. Suppose there is a steady stream of cakes (a steady stream of time), a pair of knives and forks (a CPU), and 10 people waiting to eat the cake (10 processes).

If the Unix/linux operating system is responsible for splitting the cake, then he will rule: everyone up to eat 1 minutes, time to replace one. The last person finishes eating and starts from scratch. So, regardless of whether the 10 people have different priorities, different levels of hunger and different appetites, everyone can eat for 1 minutes when they come up. Of course, if someone is not very hungry, or eat small, after eating for 30 seconds after eating full, then he can talk to the operating system: I have eaten (hang). So the operating system will let the next person come, and just eat full will be arranged to the end of the team.

If the Windows operating system is responsible for splitting the cake, then the scene is very interesting. He will rule: I will calculate a priority for each of you based on your priorities and hunger level. The person with the highest priority can come up to eat the cake-until you don't want to eat it. When this guy is finished, I'll prioritize the priorities, the hunger level, and then give the highest priority to the person.

In this sense, the scene is interesting-some people are young mm, and beautiful, so born with high priority, so she can often come to eat cakes. And another person may be a poor cock silk, but also a slow, so the priority is particularly low, so long time to his turn (because over time, he will become more and more hungry, so the overall priority will be more and more high, so one day will be his turn). Moreover, if accidentally let a big fat man get a knife and fork, because he appetite big, maybe he will occupy the cake for a long time to eat continuously, causing the person next to swallow saliva there ...

Also, this can happen: the operating system now calculates the result, 5th beautiful mm Total priority is highest, and high out of others a big cut. So call number 5th to eat the cake. Number 5th ate for a little while, felt less hungry, and said, "I'm not eating" (hang). As a result, the operating system recalculates the priority of everyone. Since number 5th had just been eaten, her hunger became smaller, so her total priority became smaller, and the rest of the people became more hungry because they waited a little longer, so the overall priority was also bigger. But it's still possible that number 5th has a higher priority than anything else, but it's only a little bit higher now-but she's still the top priority. So the operating system will say: number 5th mm up to eat cake ... (No. 5th mm Heart depressed, this has not just eaten ... People want to lose weight ... Who told you to look so beautiful and get that high priority?

The above reference from: " understanding Thread.Sleep function " http://www.cnblogs.com/ILove/archive/2008/04/07/1140419.html

(The landlord PostScript: Originally is in writing this article searches "C # SpinWait", did not expect to see such an article, just used to explain the operating system of the scheduling principle, originally I on the Linux and Windows scheduling difference is not particularly clear, read this article, just make up for this problem. )

Thread.Sleep (N)

We mentioned that 5th mm said "I am full, not eat first" (hang), how is this achieved? In C #, which is implemented using Thread.Sleep (n), a similar Windows API is: Sleep (n); , the Linux API is: Sleep (n); Usleep (n); And, for Java, it is Thread.Sleep (n);. So what does Thread.Sleep (n) mean? Under Windows, it means that I take a break for n milliseconds and not compete with the CPU. Projection to the scene of the cake is, you eat first, I eat enough, the next half an hour I do not want to eat, I give up the position, first rest 30 minutes to come. And in Linux, the meaning is the same, the only difference may be the sleep after the completion of how to rejoin the problem.

When the process/thread sleeps enough time, when re-participating in the CPU competition, under Windows, whether the time slice is immediately allocated to this from hibernation to rejoin the process/thread, or according to its hunger (because sleep for a long time) and the priority of the thread, That is, the total priority of the recalculation after hibernation is completed, competing with other processes/threads to elect a total priority, and then allocating the time slice to the process/thread with the highest total priority, not necessarily the process/thread that was woken up. However, from a reasonable point of view, the former is more rational, because the latter may cause even if you end up dormant, but because it may be that your total priority level can not rob the process/thread, and lead to sleep after the end, may be delayed to regain time slices, which does not seem very scientific. However, Windows may have chosen the latter, and MSDN mentions that even Sleep (0) may not necessarily be guaranteed to be executed immediately, and the process/thread is simply set to ready state. Getting ready means declaring that I'm going to be back in the race for CPU.

For this detail, let's take a look at what the msdn:sleep () function says in MSDN:

The main idea in Chinese is:

The thread is ready to run after the interval of Sleep has elapsed. If you specify a sleep time of 0 milliseconds, the thread discards the remaining time slices but remains in a ready state. Note that this ready-to-use thread is not guaranteed to be running immediately. Therefore, threads may not be run immediately until after a certain interval of hibernation. For more detailed information, refer to scheduling priorities (scheduling priority) .

The scheduling priorities (scheduling priority) mentions:

The main idea in Chinese is:

  The operating system treats threads of the same priority as equal, and the operating system allocates time slices to the highest priority threads in a round robin manner. If this priority level does not have a ready thread, the system allocates the time slice to the next priority thread in a round robin fashion. If a high-priority thread becomes operational, the system terminates the low-priority thread (that is, it does not allow it to complete its own time slice), and allocates a full time slice to the higher-level thread. For more information, please participate in context switches (contextual switching) .

Noun Explanation:round-robin scheduling (round call scheduling): Perform the corresponding task in a rotational manner at a certain time interval.

Reference:http://en.wikipedia.org/wiki/Round-robin_scheduling

So, " understand Thread.Sleep function " in the article is not exactly correct, in fact, Windows also has a time slice concept, but it is not like Linux is evenly distributed, preemptive means that, High-priority threads, if necessary, can be called low-priority threads to make time slices, comparing "overbearing". And the whole system is still polling at a time interval to poll (round robin), to decide to choose that thread to run. Generally speaking, the time slice of Windows is about 10-15 Ms. This is also the minimum precision under the default settings of the Sleep () function, which can be modified by the timebeginperiod () function.

The last paragraph of scheduling priorities (scheduling priority) also mentions:

The main idea in Chinese is:

However, if one thread waits for other low-priority threads to complete certain tasks, it is important to block the running of those high-priority threads that are waiting. For this purpose, you can use the wait function (wait series functions),critical section (critical area), or Sleep () function,SleepEx () function, or the switchtothread () function. These are some of the most preferable scenarios for a thread to run a loop. Otherwise, the processor may become a deadlock state (deadlocked), because a low-priority thread may never be dispatched.

In this paragraph, Microsoft hinted that we want good sleep and thread switching management, using the sleep () and switchtothread () functions separately, while Jimi_yield () is equivalent to SwitchToThread () under Windows. This is the same as the SpinWait.cs we mentioned earlier, we used sleep (0), sleep (1), and SwitchToThread () respectively. and Wait Functions and critical area, then did not study, the critical area is not good to study, but WaitForSingleObject () these functions and Sleep () The difference is worth studying, but it seems that there is no place to use this technique, So for the time being disregarded.

Sleep (0), sleep (1) and SwitchToThread () have to say the story

Cond......

Rest

First write to this, found to write a long time, write it today, I will continue to write later in this article, do not open a new post, a lot of words are not written in the past, I slowly add, like to see friends please refresh themselves. In fact, people who understand will understand, I just want to say a little more clearly.

Ringqueue

Ringqueue's github address is: Https://github.com/shines77/RingQueue, can also download UTF-8 encoded version: Https://github.com/shines77/RingQueue-utf8. I daresay is a nice hybrid spin lock, you can download it yourself back to see, support makefile, support Codeblocks, support Visual Studio 2008, 2010, 2013, etc., also support cmake, support windows, MinGW, Cygwin, Linux, Mac OSX and so on.

Directory

(i) causes (ii) hybrid spin lock (iii) q3.h and Ringbuffer   

(iv) ringqueue (upper) spin lock (v) The Art of Ringqueue (middle) sleep  

Previous: A lock-free message queue caused by a murder: how to be a real programmer? (four)--month: Ringqueue (upper) spin lock

(not to be continued ...) )

A murder caused by a lock-free message queue: How to be a real programmer? (v) The Art of--ringqueue (middle) sleep

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.