A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (4) -- month: About RingQueue (upper), queue ringqueue

Last Update:2015-01-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (4) -- month: About RingQueue (upper), queue ringqueue
Directory

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (1) -- location: Cause

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (2) -- month: spin lock

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (3) -- location: q3.h and RingBuffer

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (4) -- month: About RingQueue (upper)

Opening Speech

These two days are not very good. I even finished reading the latest "luding Kee" (liangdong version), which is basically in bed. In fact, it is not very nice-it may be too boring. However, I still understand disruptor and modified the test code of disruptor, which is close to RingQueue testing. However, there is still a gap between the current test results and RingQueue, but it is not big. Anyway, disruptor is not as fast as imagined. I don't know why it is caused by the language itself? Maybe I should try the C ++ version of disruptor. In addition, disruptor is slower when there are more producers and consumers, which is a bit incredible. At the same time, we updated the RingQueue and added the throughput of ops/sec similar to disruptor.

Test code for disruptor in: https://github.com/shines77/RingQueue/blob/master/disruptor/RingBufferPerfTest.java

I will write a detailed analysis on disruptor later (mainly analyzing multiple producers and consumers with limited capabilities, but the new version has changed somewhat with the articles I saw on the Internet ), if you have experience, you can test the test for me first and help me find the problem. Besides disruptor's sleep policy is not as good as mine, others seem to be good. At the same time, I found two wait-free algorithm papers on multiple producers and consumers in the wikipedia address provided by korall. After studying them for a while, I don't know if they are correct, even if it is correct, will it be faster after implementation?

Lock-free

According to korall, a netizen in the third comment, let's discuss the definition of lock-free. We define it as follows:

If a method is lock-free, it will ensure that the thread can call this method infinitely within a limited step, this thread cannot be completed within a limited step because other threads are blocked, that is, "No lock ".

The opposite of lock-free is lock and blocking. The synonym is non-blocking ).

Reference: http://ifeve.com/lock-free-and-wait-free/

Wait-free

Definition of wait-free (no waiting:

If a method is wait-free, it ensures that each call can end in a limited step, or it can be understood as "no loop ", or "number of cycles" is a constant or limited number of times.

Reference: http://ifeve.com/lock-free-and-wait-free/

Q3.h Analysis

The previous article analyzed the q3.h principle of Sinclair and, according to the reminder of korall, we can see that the first half of q3.h's push () process is lock-free, here, lock-free (lockless) is defined to assume that a thread is infinitely sleep or crashed during the reception process (assuming there is a possibility of a crash ), it will not cause other threads to be blocked in the process of receiving the number. The second half of push (), that is, the confirmation process of successful submission, is not lock-free, but blocked, because it is assumed that in thread A, after the receipt is completeq->head.second = next;If thread A is infinitely sleep or crashed anywhere before this statement takes effect, other threads will be blocked when confirming the submission process, because the previous sequence number is not submitted successfully, the subsequent threads must all be "dead", resulting in a "deadlock". Therefore, this process is indeed a "Lock ". Similarly, pop () is the same. The first half is lock-free (), and the subsequent confirmation process is a "lock ".

The whole push () and pop () of q3.h are composed of two lock-free structures and two "locks", although the two "locks" are push () and pop () are independent, that is, the "Lock" competition of push () only occurs between the push () thread, pop () the lock competition only happens between pop () threads, and the lock-free part is similar. But the two "locks" and two lock-free structures, in general, the competition seems not small.

So, why don't we simply use a "Lock" to solve it? At least this is a worthwhile solution, which is why the mixed spin lock RingQueue: spin2_push () was born.

Failed attempts

But I didn't think so at first. At first, we first considered CAS implementation based on the general idea of q3.h, that is, the so-called lock-free, this code is include \ RingQueue. in h, RingQueue: push (), RingQueue: pop (). The Code is as follows:

Soon, we found a problem, for example, because after CAS is complete, the thread may be sleep, and the problem mentioned above may occur once it is sleep. The main reason is that although CAS ensures that the received head is unique, the core is not allowed at the same time. queue [head & kMask] = item; is also updated. If you can, it will be perfect. Later I thought about using Double CAS to update core.info. head Using Double CAS, and updating core. queue [head & kMask] together. However, soon I also found that the Double CAS on x86 requires that the updated memory address must be continuous, that is, core.info. head and core. if the queue [head & kMask] is not contiguous in memory, Double CAS cannot be used, at least on the current x86 CPU, but due to logical problems, we cannot turn them into continuous memory. Unless Intel will implement the enhanced Double CAS we want in the future ......

Therefore, this method does not work. Although this method is simple, there are many conflicts, so the speed is not very fast, and the rationality verification can prove that there is a bug, as shown below. If you are interested, you can remove the comments of // RingQueue_Test (0, true); in main () and test. # define TEST_FUNC_TYPE in h is defined as 0.

Spin lock

So I started to write the "Lock". We took it for granted that we constructed such a struct, spin_mutex_t. The reason why it is called "spin_mutex" is that in tbb, the spin lock is called "spin_mutex, because it is actually a "mutex" logically, it is also appropriate to call mutex, but it is generally used to be called spin_lock.

In fact, only one uint32_t locked is really useful data. When it is 1, it is the "Lock" status, and when it is 0, it is the "Unlocked" status. Padding1, padding2, and padding3 are filled for alignment, because we do not declare byte alignment for the struct itself (I am too lazy to do it), so we have to add padding1 at the beginning, it's okay to waste a little memory. Which of the following are spin_counter, recurse_counter, and thread_id used for expansion? It is not used here. It is fun to write.

The reason for taking it for granted is that I have never written a spin lock, so I wrote it like this:

Jimi_lock_test_and_set32 () is actually InterlockedExchange () or _ sync_lock_test_and_set (). Start with while (spin_mutex.locked! = 0) determines whether it is in the 0 (unlocked) status. If it is 0, it enters the lock area. Then, update the "Lock" status to the 1 (locked) status through atomic operations, finally, the lock area is out, and the lock status is set to 0 (unlocked ). Everything looks normal, but it will run livelock, and may occasionally put a message in, but it is very slow and may even become a deadlock ), the status is almost unknown. Analyze the cause and pay attention to the REDLINE. Although the lock status that may be detected earlier is indeed 0 (unlocked), if the thread is sleep at the REDLINE, because the lock status is still 0, other threads can also enter the lock area and set the lock status to 1. After the thread is awakened again, it does not know that the lock is already in the 1 (LOCK) status, it is also in the lock protection area, and the lock status is reset to 1, which will cause both threads to be in the lock area at the same time, this will cause synchronization of shared resources. On the other hand, jimi_lock_test_and_set32 () may fail because the Cache Line is locked by other threads. Therefore, sometimes it cannot be written to status 1, this will make the problem of multiple threads entering the lock area at the same time more frequent. When the queue is full or empty, the status becomes unpredictable.

This code is also stored in RingQueue: spin9_push () RingQueue: spin9_pop (). You can also check the sorted version of RingQueue: spin8_push () RingQueue: spin8_pop (), the two are the same.

Compare-And-Swap

However, soon we improved the previous error code with CAS (Compare-And-Swap. For more information about CAS, see section 3.

int val_compare_and_swap(volatile int *dest_ptr,                         int old_value, int new_value){    int orig_value = *dest_ptr;    if (*dest_ptr == old_value)        *dest_ptr = new_value;    return orig_value;}

What we need is to set the lock status to 1 when the lock status is 0. CAS is able to achieve our results because CAS operations are atomic, therefore, we will not encounter the problem that more than one thread enters the lock area at the same time.

RingQueue: spin_push ()

This is the first version of the spin lock. RingQueue: spin_push (). As you can see, it does not actually spin, but jimi_wsleep (0). jimi_wsleep (0) in Windows, it is equivalent to Sleep (0) and sched_yield () in Linux (). It does not sleep, but switches to another thread.

Another problem is that I commented out the code for an atomic write operation in the previous line of the two "spin_mutex.locked = 0;" statements. I wrote it, but later I analyzed it carefully. When I entered the lock, I closed the door with CAS. When I finally "Unlocked", there was no need to update the lock status with atomic operations, an instant of 0 represents unlocking. This is more efficient. Why? You can think about it and understand that the only thing you need is to add a memory barrier/compiler memory barrier before updating the lock status, for example, Jimi_ReadWriteBarrier () in the code (). This is also what I 've seen from other code. It's actually nothing unusual, but if you don't understand it, I 'd like to mention it.

However, I have also written the spin version. For details, refer to the code of RingQueue: spin_push. in h, the macro USE_SPIN_MUTEX_COUNTER is defined as 1 to enable the spin version. The spin loop control parameter is macro MUTEX_MAX_SPIN_COUNT (1 by default ).

This policy is based on the Intel multi-thread library tbb's spin_mutex.

Test Results

This is a small improvement, and the speed is good, but the disadvantage is that it is not stable enough.

Sometimes it is like this:

Test Environment

Here we will talk about my test environment:

CPU: Intel Q8200 2.4G 4-core

System: Windows 7 sp1

Memory: 4 GB/DDR2 1066

Compiler: Visual Studio 2013 Ultimate update 2

Four push () threads (producers) and four pop () threads (consumers) are used, and the total number of messages is 8000000 (8 million ), the queue capacity is 1024 (buffer_size), x86 mode, CPU affinity is not enabled.

System mutex lock

If you have no idea about the speed, we can find something as the benchmark. mutex (mutex) is no longer suitable. In Windows, it is called the CriticalSection ), in Linux, it is called pthread_mutex_t:

This is the result obtained under PUSH_CNT = 2 and POP_CNT = 2. However, for the system mutex lock, the value is slightly different.

The version of the system mutex lock is: RingQueue: mutex_push ().

Q3.h test results

The following is the test result of q3.h of Douban Sinclair. We can see that it is not very fast, or even a little slower than the Mutex, because only when (PUSH_CNT + POP_CNT) <= the number of CPU cores can work normally, otherwise it will be slow, so the following data is also obtained under PUSH_CNT = 2, POP_CNT = 2:

Thread. Yield ()

To tell the truth, RingQueue: spin_push () is still a little too simple. It is not complicated to imitate the spin version of tbb, and it doesn't feel too complicated.

I used to see DengHe in the c ++ 1y boost group (QQ group: 296561497. net post a C # decompile with a reflection tool about Thread. yield () source code (images are sent to the group). In fact, similar things seem to have been seen everywhere, but you won't find any problems if you don't actually need them. I found the chat record for more than an hour in the two groups, but the result was still not found. In fact, I can find it on the Internet, but I still think it seems a little special to the one posted by DengHe, later, he told me that it was a decompiled SpinWait. cs. In fact, at that time, I also found SpinWait. cs, because I am not very familiar with C #, I did not expect to use a reflection tool.

Actually Thread. yield () is a general idea, but there is indeed a definite version that will be more informative, C # Thread. in addition to fixed parameters, Yield () cannot be modified. Other settings are reasonable. This is also the prototype of RingQueue: spin2_push. Of course, I made some minor adjustments, because if you want to make the mixed spin lock more efficient, it depends on how you adjust these parameters. This is a sleep strategy. Or a sleep art, which I will describe in detail later. I will sell a token first today...

Usage of RingQueue

The most important part of RingQueue source code is include \ RingQueue \ test. h. The macro definition here is the compilation switch of each part. I have written comments. If you do not understand it, you can study it on your own or ask me, basically, I have considered many things. The most important thing is the definition of PUSH_CNT and POP_CNT. As the name suggests, PUSH_CNT is the number of push () threads, and POP_CNT is pop () the number of threads corresponding to the number of producers and consumers respectively. If you want to test q3.h, your (PUSH_CNT + POP_CNT) must be smaller than or equal to the number of your CPU cores, otherwise it will be surprisingly slow. For example, if your CPU is dual-core, it is defined as PUSH_CNT = 1, POP_CNT = 1. If you do not expect q3.h, we recommend that you set PUSH_CNT and PUSH_CNT to the same number as your CPU core, that is, PUSH_CNT = 2, POP_CNT = 2. Generally, setting the total number of threads to twice the number of CPU cores can improve the CPU utilization. This is a group of optimal solutions, although not necessarily the optimal solution, for other settings, see test. h.

Sleep

Let's write it here today. I wanted to write it in one breath. If I can, I will merge it later.

In fact, there is another problem that delayed writing articles, that is, I don't know which tendon is wrong. I posted my blog in the group I and had a little quarrel with some people. After two days, despite some friction, they finally accepted some of my ideas and ideas about RingQueue, and finally discussed it together. It was a little fun, at least better than the SkyNet group. In fact, there are not many things to talk about, but there are a lot of details. I also found some strange problems. In fact, I want to explain the details well, however, I found that the current control is indeed not very good, maybe I should try again later (Article), or you can directly view the source code faster.

RingQueue

The GitHub address of RingQueue is: Login. I dare say it is a good mixed spin lock. You can download it and check it out. It supports Makefile, CodeBlocks, Visual Studio 2008,201 0, 2013, CMake, and Windows, minGW, cygwin, Linux, Mac OSX, etc. Of course, ARM may not be supported and no testing environment is available.

(To be continued ...... Coming soon ......)

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (4) -- month: About RingQueue (upper), queue ringqueue

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (4) -- month: About RingQueue (upper), queue ringqueue

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support