A murder caused by a lock-free message queue: How to be a real programmer? (four)--month: About Ringqueue (ON)

Last Update:2015-01-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Directory

A murder caused by a lock-free message queue: How to be a real programmer? (a)--ground: Cause

A murder caused by a lock-free message queue: How to be a real programmer? (b)--month: Spin lock

A murder caused by a lock-free message queue: How to be a real programmer? (c)--to: Q3.h and Ringbuffer

A murder caused by a lock-free message queue: How to be a real programmer? (four)--month: About Ringqueue (ON)

Opening words

These two days is not very good, I even put the latest "Deer Ding Kee" (present version) almost finished, basically lying on the bed to see, in fact, not really good-looking, may be too boring. However, I still have a general understanding of the disruptor, but also modified the disruptor test code, basically close to the Ringqueue test. However, the current test results with ringqueue than a little bit, but not too big. Anyway disruptor not imagined so fast, do not know what caused the problem of language itself? Maybe I should try to find Disruptor's C + + version. And the more disruptor in the producer and consumer, the slower it is, anyway. At the same time also updated a bit of ringqueue, but also similar to disruptor ops/sec such a throughput rate added.

Test code for Disruptor in: Https://github.com/shines77/RingQueue/blob/master/disruptor/RingBufferPerfTest.java

I will write a detailed analysis of disruptor (mainly analysis of multi-producer and multi-consumer, limited capacity, but the new version of the article with the Internet to see some changes), have experience can also help me to test, help me find a problem, in addition to disruptor sleep strategy than mine, Everything else looks good. At the same time, I found two articles on Wait-free algorithms for multi-producer and multi-consumer in the Wikipedia address provided by Korall, studied for a while, did not know if it was correct, even if it was correct, would it be faster after implementation?

Lock-free

According to the third comment, the Netizen Korall's reminder, we discuss the definition of lock-free (no lock), we define it this way:

If a method is Lock-free, then it will ensure that the thread can call this method in a finite number of steps, but not because the other thread is blocked and the thread cannot complete within a finite step, that is, "no lock."

The opposite of Lock-free is lock (locked), blocking (blocking), synonym: non-blocking (non-blocking).

Reference from: http://ifeve.com/lock-free-and-wait-free/

Wait-free

Definition of Wait-free (no Wait):

If a method is wait-free, it will ensure that each invocation can end in a limited number of steps, or it can be understood as "no loop", or "loop count" as constant or finite.

Reference from: http://ifeve.com/lock-free-and-wait-free/

Q3.h Analysis

On an analysis of the principle of Sinclair q3.h, and according to the user Korall reminder, we can see that q3.h push () The first half of the process is Lock-free, here Lock-free (no lock) is defined as if there is a thread in the number of the The process is either dormant indefinitely or crashes (assuming the likelihood of a crash), nor does it cause other threads to be blocked in the process of picking the number. The second half of the push (), which commits a successful confirmation process, is not lock-free, but is blocked, because q->head.second = next; if thread A is dormant or crashed indefinitely after the completion of the lead and before it takes effect. Then other threads will be blocked in the process of confirming the commit, because the previous sequence number is not successfully submitted, and the subsequent thread must be "death", resulting in a "deadlock", so this process is indeed a "lock". Similarly, pop () is the same, the first half is Lock-free (), and the subsequent confirmation process is a "lock".

As a q3.h, the entire push () and pop () are composed of two lock-free structures and two "locks", although the two "locks" are both push () and pop () separate, that is, the "lock" competition for push () will only occur in the push () thread , the "lock" competition for POP () also occurs only between the pop () threads, and the Lock-free section is similar. But two "lock" + two lock-free structure, overall, the competition does not seem to be too small.

So why don't we just simply use a "lock" to solve it? At least this is a worthwhile attempt, which is why the hybrid Spin lock Ringqueue::spin2_push () was born.

A failed attempt

But at first I did not think so, at first, according to the general idea of Q3.h, as far as possible to consider CAS to achieve, that is, the so-called Lock-free, this code is include\ringqueue\ringqueue.h ringqueue::p ush (), Ringqueue::p op (), the code is as follows:

Soon, we find that there is a problem, such as when the CAs are finished, the threads are likely to be dormant, and the problems mentioned above may occur once you hibernate. The main reason is that although CAs guarantees that the head being received is unique, it cannot be core.queue[head & kmask] = Item; Also updated together, if possible, it is perfect. And then I thought about it. Using double CAs, update core.info.head with double CAs Core.queue[head & Kmask] No, it's all right. However, soon, I also found that the double CAs on the x86 require that the updated memory address must be contiguous, meaning that Core.info.head and Core.queue[head & Kmask] are not contiguous in memory and cannot be used double CAs, at least not on the current x86 CPU, and because of the logical problem, we have no way to make them into contiguous memory. Unless Intel implements the enhanced Double CAS we want in the future ...

So this method is not going to work. Although this method is simple, but the number of conflicts, so the speed is not very fast, and through the rationality of verification can also prove that there are bugs, as follows. Interested friends can put this sentence in main ()://ringqueue_test (0, true); The comment is removed and the # define Test_func_type in test.h is defined as 0.

Spin lock

So I began to write "lock", take it for granted, we constructed such a structure spin_mutex_t, the reason is called Spin_mutex because TBB tubes lock called Spin_mutex, because it is logically a "mutex", so called the mutex is also appropriate, But the general habit is called Spin_lock.

In fact, the really useful data is only one uint32_t locked, for 1 o'clock is the "lock" state, for 0 o'clock is the "unlocked" state. Padding1, Padding2, Padding3 is to be filled in order to align, because we do not have the structure itself to declare the byte alignment (I am too lazy to get), so the beginning had to add a padding1 so, waste a bit of memory does not matter. Behind those spin_counter,recurse_counter,thread_id what is used to extend, here is no use, write fun.

The reason is that I have never written a spin lock, so I wrote this:

Jimi_lock_test_and_set32 () is actually interlockedexchange () or __sync_lock_test_and_set (), starting with while (spin_mutex.locked! = 0 ) to determine whether it is 0 (unlocked) state, if it is 0 to enter the lock area, and then atomic operation to the "lock" status update to 1 (lock) state, finally out of the lock area, and then set the lock status of 0 (unlocked). Everything looks normal, but it will live locked (livelock), may occasionally put a message come in, but very very slow, and may even become a deadlock (deadlock), anyway the state is almost unknown. Analyze the reason, notice where the red line is drawn, although the lock state that may be detected before is really 0 (unlocked) state, but if the thread is dormant in the red line, then because the lock state is still 0, then other threads can enter the lock area, then the lock state is set to 1, and so on after the thread is re-awakened, It does not know that the lock is already in the 1 (lock) state, that it is also in the lock-protected area, and that the lock status is also reset to 1, which causes two threads to be in the lock area at the same time, which causes the shared resource to synchronize. On the other hand, Jimi_lock_test_and_set32 () is likely to fail because the cache line is locked by another thread, so sometimes it is not possible to successfully write to State 1, which causes more frequent problems for multiple threads to enter the lock area at the same time. When the queue is full or empty, the state becomes unpredictable.

This code is also kept in the Ringqueue::spin9_push () Ringqueue::spin9_pop (), and can also be seen in the compiled version Ringqueue::spin8_push () Ringqueue::spin8_pop (), the two are the same.

Compare-and-swap

However, soon, we used CAS (COMPARE-AND-SWAP) to improve the code for the previous error. A description of CAS can be found in the third chapter of the relevant introduction.

int val_compare_and_swap (volatileint *dest_ptr,                         intint new_value) {    int orig_value = *dest_ptr;     if (*dest_ptr = = Old_value        ) *dest_ptr = new_value;     return Orig_value;}

What we need is to determine that the lock status is 0 while the lock status is set to 1,cas just to achieve our effect, because CAS operations are atomic, so there is no problem with more than one thread that we encountered before entering the lock area.

Ringqueue::spin_push ()

This is the first version of the spin Lock, Ringqueue::spin_push (), as you can see, actually it does not spin, but Jimi_wsleep (0) a bit, Jimi_wsleep (0) is equivalent to sleep (0) under Windows, Under Linux is equivalent to Sched_yield (). Instead of sleeping, it switches to another thread.

Another problem is that in the previous line of the two "spin_mutex.locked = 0;" Statements, I commented out the code for one line of atomic write operations. That is what I originally wrote, but then carefully analyzed, into the lock with CAS closed the door, the last "unlock" when it is not necessary to use atomic operation to update the lock state, set 0 of the moment on behalf of the understanding of the lock. This will be more efficient, why, if you think about it, you will understand that the only thing you need is to add a memory barrier/compiler memory barrier before updating the lock state, such as Jimi_readwritebarrier () in the code. This is also later read a lot of other code confirmed, in fact, nothing unusual, but if you do not understand, still want to mention.

But I also wrote the version of the spin, specifically, you can see the Code of Ringqueue::spin_push (), you need to test.h the macro Use_spin_mutex_counter defined as 1 to open the spin version, spin loop control parameters for the macro mutex_ Max_spin_count (the default setting is 1).

This strategy is Spin_mutex with Intel's multi-line libraries TBB.

Test results

Is such a small improvement, the speed has been good, but the disadvantage is not stable enough.

Sometimes this is the case:

Test environment

Here's a look at my test environment:

Cpu:intel Q8200 2.4G 4-Core

System: Windows 7 SP1

Memory: 4G/DDR2 1066

Compiler: Visual Studio Ultimate Update 2

Using 4 push () threads (producers), 4 pop () threads (consumers), the total number of messages is 8 million (8 million), the queue capacity is 1024x768 (buffer_size), x86 mode, does not turn on CPU affinity.

System Mutex Lock

If you have no concept of this speed, we find a thing to use as a benchmark, with a mutex (mutex) more suitable, windows it called the Critical section (criticalsection), Linux under it is called pthread_mutex_t:

This is the result of push_cnt = 2, pop_cnt = 2, but for the mutex of the system, this value is much different.

The version of the system mutex is: Ringqueue::mutex_push ().

Test results for Q3.h

The following is the q3.h of the Watercress Sinclair test results, you can see, not very fast, even more than the system mutex (mutex) is slower, because it only when (push_cnt + pop_cnt) <= CPU core number to work properly, otherwise it will be very slow, so the next The data for the polygon is also obtained under push_cnt = 2, pop_cnt = 2:

Thread.yield ()

To tell the truth, Ringqueue::spin_push () is still a little bit too simple, even if the spin version of TBB is not much more complicated, it is not enough to feel where.

I was in the c++1y Boost Communication group (QQ Group: 296561497) once saw Denghe.net posted a C # with the Reflection tool to decompile the Thread.yield () source (sent to the group is a picture), in fact, similar things where seems to have seen, Just when you're not really going to use it, you won't find the problem. I later in two group chat record for one hours, the results still did not find, in fact, I know that online can search, but still feel like denghe paste of that seems a bit special, later he told me is anti-compilation SpinWait.cs, actually that time I also found SpinWait.cs, because I am not very familiar with C #, I did not think of the reflection tool.

In fact Thread.yield () is how to or roughly know, but there is a certain version will be more reference point, C # Thread.yield () In addition to the parameters are fixed can not be modified, other settings are more reasonable, this is ringqueue:: The embryonic form of Spin2_push (). Of course, I have made some minor adjustments, because the hybrid spin lock to be efficient, depends on how you adjust these parameters, this is a sleep strategy. Or a dormant art, and I'll describe it in more detail later. Let's sell a Xiaoguanzi today.

About the use of Ringqueue

Ringqueue source of the most critical place in the include\ringqueue\test.h inside, where the macro definition is the compilation of the various parts of the switch, I basically write a note, if you do not understand can own research, or ask me, basically I think of things calculated more, The most important is the definition of push_cnt and pop_cnt, as the name implies, PUSH_CNT is the number of threads of PUSH (), pop_cnt is the number of threads of POP (), respectively, corresponding to the producer (producer) and the number of consumers (consumer). If you want to test q3.h, then your (push_cnt + pop_cnt) must be less than or equal to your CPU core number, otherwise it will be slow to get odd. For example, your CPU is dual-core, you know that is defined as Push_cnt = 1, pop_cnt = 1, if you contingency q3.h, then it is recommended that you set push_cnt and push_cnt as your CPU core number, that is push_cnt = 2, PO p_cnt = 2. Because generally speaking, the number of bus path set to twice times the number of CPU cores is compared to improve the utilization of the CPU, which is a better solution, although not necessarily the optimal solution, other settings please see test.h.

Dormancy

Write here today, originally wanted to write in one breath, if possible, then merge together.

In fact, there is a problem delayed writing, is that I do not know which is wrong, in the Redui group propaganda my blog post, but also with some people a little noisy up, after two days, although some friction, eventually they still accept me about ringqueue some ideas and ideas, and finally discussed together, is a little bit happy, at least better than the Skynet group. In fact, there is not much to say, but a lot of details, I also found some strange problems, in fact, I would like to put those details, but I found that the control is not very good, perhaps I should hold back a bit longer (article), or you directly see the source faster.

Ringqueue

Ringqueue's github address is: Https://github.com/shines77/RingQueue, can also download UTF-8 encoded version: Https://github.com/shines77/RingQueue-utf8. I daresay is a nice hybrid spin lock, you can download it yourself back to see, support makefile, support Codeblocks, support Visual Studio 2008, 2010, 2013, etc., also support cmake, support windows, MinGW, Cygwin, Linux, Mac OSX and so on, of course, may not support arm, no test environment.

(not to be continued ...) Please look forward to ... ）

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A murder caused by a lock-free message queue: How to be a real programmer? (four)--month: About Ringqueue (ON)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support