A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (3) -- location: q3.h and RingBuffer, q3.hringbuffer

Last Update:2015-01-05 Source: Internet

Author: User

Tags lock queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (3) -- location: q3.h and RingBuffer, q3.hringbuffer
Directory

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (1) -- location: Cause

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (2) -- month: spin lock

A bloodcase caused by a lock-free Message Queue: how to be a real programmer? (3) -- location: q3.h and RingBuffer

Lock-free queue

In the first article, we mentioned the implementation of lock-free queues (Chen Hao (hà o). In this article, we mentioned "implement lock-free Queues with arrays ", lockless queue implemented using RingBuffer:

RingBuffer is a good thing. It is great to use it in a lock-free/lock queue. As mentioned in this article, RingBuffer uses sequence numbers (or indexes ), in addition, arrays are used to store queues. Compared with linked list storage queues, the advantage is that ABA problems can be avoided (for more information about ABA problems, refer to this Article or Google or Baidu to search for them ), use a linked list And pointer to construct a FIFO Queue (first-in-first-out Queue). The ABA problem can be avoided only when Double CAS (Double Compare And Swap) is used. Although the article "Implementation of lock-free queues" is mentioned, its description is messy. As a result, it is only a porter of nature and does not understand it at all. However, things have arrived.

Including the preceding descriptions of RingBuffer, there are many errors and errors. To update head and tail in RingBuffer, you do not need to use Double CAS. We do not need to update (TAIL, EMPTY) to (x, you only need to use CAS to update the HEAD or TAIL. But it is still helpful for us. At least you know this thing, but you need to make up your mind to read it. It is also a merit, otherwise, it's not fun to be confused at all. Haha.

Compare-and-swap

Compare and Swap or Compare and Set (CAS) are common techniques in lock-free (lockless programming). For details, refer to Compare-and-swap in wikipedia:

int val_compare_and_swap(volatile int *dest_ptr,                         int old_value, int new_value){    int orig_value = *dest_ptr;    if (*dest_ptr == old_value)        *dest_ptr = new_value;    return orig_value;}

This means that the value of * dest_ptr is saved first (dest_ptr is a pointer), and then the value of * dest_ptr is equal to the old_value of the old address. If yes, the new value new_value is written to * dest_ptr, otherwise, do nothing. The final return value is the original value of * dest_ptr. CAS operations guarantee Atomicity in the CPU, that is, the entire CAS operation will not be interrupted by the CPU or other reasons. In x86, the corresponding Assembly command is CMPXCHG.

Compare and Swap also has a form:

bool bool_compare_and_swap(volatile int *dest_ptr,                           int old_value, int new_value){    if (*dest_ptr == old_value) {        *dest_ptr = new_value;        return true;    }    return false;}

This is actually a variant of val_compare_and_swap (), that is, if * dest_ptr is equal to old_value, the new value new_value is written, and true is returned. If not, false is returned.

RingBuffer

Let's take a look at what RingBuffer is. First, forget the previous article. We take the implementation of q3.h as an example:

File address: https://github.com/shines77/RingQueue/blob/master/douban/q3.h

Definition

We define RingBuffer as follows:

The above definition is a Capacity (Capacity) of the 8 RingBuffer, generally the Capacity of the RingBuffer is 2 N power, so that the index can be calculated using index = sequence & mask; (mask = capacity-1;) to improve code efficiency. The index indicates the subscript of the array. The data stored in the array is A, B, C, D, and Empty. Head indicates the header pointer of the queue (that is, the position where the first data can be pushed in), and Tail indicates the Tail pointer of the queue (that is, the position where the first data can be popped up ), head and Tail both exist in the form of serial numbers and increase monotonically, and can be greater than or equal to Capacity (8 ), if Head = 8 indicates the first element of the array (index = 8 & (8-1) = 8 & 7 = 0 ;), head = 9 indicates the second element stored in the array.

The actual Length of the element in the queue is: Length = Head-Tail, where (Head-Tail) must be <= Capacity (maximum Capacity), because this indicates that the queue is full, if (Head-Tail) = 0, the team column is empty. RingBuffer is called because it is a ring, and the cursor moves to the end of the array and returns to the header of the array.

It is not important to define which side is the head and which side is the end. simply change it logically. However, this definition is more intuitive. Just think about it. If you put the figure above up, which side do you think is the header and the end? Another reason is that q3.h is also defined in this way, while the mq. c file written by Yunfeng is exactly defined in turn. But how to define it is really not important, but it is not convenient to discuss it.

Boundary Problems

The boundary issue is discussed here. The q3.h boundary determination is not very accurate. if (mask + tail-head) <1U) return-1; this write method will cause the actual maximum length of the queue to be Capacity-1 rather than Capacity. This is not the biggest problem. We know from the above that the actual Length of RingBuffer is: Length = Head-Tail, as long as Head, Tail is an unsigned integer, this formula is still true even when the Head rounds from 4294967295 to 0. For example: Head = 2, Tail = 4294967295, Length = 2-4294967295 =-4294967293, and 4294967293 = 0 xFFFFFFFD. In a negative computer, the complement code is used, that is, add 1, 0 xFFFFFFFD to get the inverse value 0x00000002, plus 1, is 3, so the actual length is 3, correct. Therefore, as long as the definition of RingQueue is determined, as long as Head and Tail are represented by unsigned integers, the actual Length formula Length = Head-Tail is always true.

So let's take a look at (mask + tail-head) <1U means what, tail-head is actually equal to-length, there are :( mask-length) <1U, because both sides are unsigned integers, it can be set only when (mask-length) = 0. It cannot be set in other cases, that is, the condition is set only when length = mask, mask = capacity-1; so when length = capacity-1, q3.h considers that the queue is full. Therefore, the maximum actual length of q3.h can only be capacity-1 element.

And he has another problem. You may also see it, because only length = mask is considered to be full. When length <mask is better, once the length> (mask + 1), the queue is actually full, but q3.h does not know, it still thinks it can be pushed (), why is it not wrong? The reason is that, after analyzing the push () Code of q3.h, you can know that because head and tail are monotonically increasing, the thread may be interrupted at any time, so head, tail will only be smaller than the current actual head and tail value, while the head is not equal to the current actual value, it cannot pass the atomic test of CAS, so we can think that the head is always equal to the actual current value, so we only consider one case, that is, tail may be smaller than the current actual tail value, because length = head-tail, this means that the calculated length value is greater than the actual length, that is, the queue is not full, and q3.h may be considered full, resulting in push () failure. If it is not full but the push () fails, there is no harm. If it is too big to continue the push () operation, there is no problem. But logically, it is not strict.

The above analysis shows that the logic for determining whether the queue is full is: actual queue length> = maximum queue capacity, that is, if (head-tail)> = capacity) return-1; Because mask = capacity-1, it can also be simplified to: if (head-tail)> mask) return-1 ;.

Similarly, the logic for determining the queue as NULL in q3.h is: if (tail-head) <1U) return NULL; there is also a problem, because they are all unsigned integers, so it is equivalent to: if (tail-head) = 0) return NULL; similarly, it does not have problems in q3.h, but the more rigorous judgment logic should be: if (tail = head) | (tail> head & (head-tail)> mask) return NULL ;.

Struct queue

Let's take a look at q3.h. The address is:

Https://github.com/shines77/RingQueue/blob/master/douban/q3.h

In q3.h, struct queue is defined as follows:

struct queue {    struct {        uint32_t mask;        uint32_t size;        volatile uint32_t head;        volatile uint32_t tail;    } p;    char pad[CACHE_LINE_SIZE - 4 * sizeof(uint32_t)];    struct {        uint32_t mask;        uint32_t size;        volatile uint32_t head;        volatile uint32_t tail;    } c;    char pad2[CACHE_LINE_SIZE - 4 * sizeof(uint32_t)];    void        *msgs[0];};

To be honest, this writing is not very accurate and difficult to understand. The defined struct p and struct c should be in strict sense, p should be called head, c should be called tail, and p, the head defined in c should be called first and second. They are actually a pair, which is a bit similar to std: pair, So you know why I call them first and second. (Note: In disruptor, first and second are called next, and cursor (next is the first place where data can be pushed in. cursor (cursor) is the latest one that has been successfully submitted (publish) data Location. Here we still use my name ).

In disruptor, next, cursor is as follows:

Therefore, the more accurate definition of struct queue should be as follows:

struct queue {    struct {        uint32_t mask;        uint32_t size;        volatile uint32_t first;        volatile uint32_t second;    } head;    char pad1[CACHE_LINE_SIZE - 4 * sizeof(uint32_t)];    struct {        uint32_t mask;        uint32_t size;        volatile uint32_t first;        volatile uint32_t second;    } tail;    char pad2[CACHE_LINE_SIZE - 4 * sizeof(uint32_t)];    void        *msgs[0];};

In this case, mask and size are actually constants, so you don't need to put them in. However, this is not too important and is ignored for the moment.

Therefore, push () and pop () in q3.h can be rewritten as, and we have modified the boundary judgment:

static inline intpush(struct queue *q, void *m){    uint32_t head, tail, mask, next;    int ok;    mask = q->head.mask;    do {        head = q->head.first;        tail = q->tail.second;        if ((head - tail) > mask)            return -1;        next = head + 1;        ok = __sync_bool_compare_and_swap(&q->head.first, head, next);    } while (!ok);    q->msgs[head & mask] = m;    asm volatile ("":::"memory");    while (unlikely((q->head.second != head)))        _mm_pause();    q->head.second = next;    return 0;}static inline void *pop(struct queue *q){    uint32_t tail, head, mask, next;    int ok;    void *ret;    mask = q->tail.mask;    do {        tail = q->tail.first;        head = q->head.second;        if ((tail == head) || (tail > head && (head - tail) > mask))            return NULL;        next = tail + 1;        ok = __sync_bool_compare_and_swap(&q->tail.first, tail, next);    } while (!ok);    ret = q->msgs[tail & mask];    asm volatile ("":::"memory");    while (unlikely((q->tail.second != tail)))        _mm_pause();    q->tail.second = next;    return ret;}

This should be clearer and easier to understand. Among them, "asm volatile (" ":" memory ");" is the compiler memory barrier. If you don't understand it, you can ignore it to the general idea that when the compiler is doing optimization, all write or read operations prior to this barrier cannot cross this barrier, and vice versa.

This file has been uploaded to: https://github.com/shines77/RingQueue/blob/master/douban/q3_new.h

Push ()

Next, let's analyze how push () works? The Code is as follows:

static inline intpush(struct queue *q, void *m){    uint32_t head, tail, mask, next;    int ok;    mask = q->head.mask;    do {        head = q->head.first;        tail = q->tail.second;        if ((head - tail) > mask)            return -1;        next = head + 1;        ok = __sync_bool_compare_and_swap(&q->head.first, head, next);    } while (!ok);    q->msgs[head & mask] = m;    asm volatile ("":::"memory");    while (unlikely((q->head.second != head)))        _mm_pause();    q->head.second = next;    return 0;}

Let's take a look at head. first, head. second in head:

According to the next and cursor of the disruptor mentioned above, similar to, head. first is the first place where data can be pushed in, head. second is the latest data submitted successfully. Head. first, through the do while () and CAS operations, each thread obtains a unique sequence number (sequence). Because of the atomicity of CAS, it can ensure that only one thread has a head at a time. first. This is a bit like when you go to the bank to get the money, you have to first go to a machine to get a number, and then the bank will serve the customer according to the order of the number. We call this number sequence. For example, if you receive the number 5, for example, the bank only processes the customers before the number 2 (including the number 2 customer, that is, head. second), 3, 4 customer service is in process, and 5 is your new receipt. It is not the same as that of a real bank. Here, as long as you get the serial number, the window will immediately start to serve you, and the processing time for each window (each thread) is not fixed, window 4 may be completed first, then window 3, and the last window 5, that is, the number you received. Another difference from the real bank is that, no matter who completes the process in window 3, 4, or 5, the process must be completed according to the serial number, that is to say, even if Window 4 is complete, it will be completed after window 3 is complete; otherwise, it can only be completed after window 3 is complete. If window 5 is completed first, you must wait for Window 4 to complete the calculation. Because only in this order can we move the head. the location from second to the latest successfully submitted data. If it is not in this order, move the head. second, so head. second is messy.

The data submitted in order is through while (unlikely (q-> head. second! = Head) _ mm_pause (); implemented. As long as the location of the last successfully submitted data is different from the sequence you receive, you must wait until it is equal to your sequence number. In this way, data is submitted in order (in the order of serial numbers received ). Here I will mention the boundary judgment. Here, the tail can be considered to be less than or equal to the real-time tail. first value (tail = tail. second, while tail. second <= tail. first, and tail can be smaller than real-time tail. second value), and the head can be considered to be always equal to the head. the first value (because if it is not equal to, CAS is not accessible and must be reused). Therefore, (head-tail) is greater than the real-time (head-tail) value, all pushes () may be considered to be full when the queue is not full, so they exit early. This is harmless, and it is a big deal to re-push ().

In addition, the submitted data is implemented by q-> msgs [head & mask] = m;, that is, you can write the data according to the serial number you receive, because the serial number is unique, as long as the queue does not overflow or has a negative overflow, each independent sequence number will have a unique storage location in the array.

Let's analyze the usage of the Cache Line. We can see that in the previous do while () + CAS, the CAS operation locks the head. the cache row where the first parameter is located (that is, it is invalidated ). In the subsequent confirmation submission cycle, while (unlikely (q-> head. second! = Head) There is a pair of head. second memory reference, and according to the definition of q3.h and q3_new.h, the two are on the same Cache Line (the memory Line) (although not necessarily 100% on a Cache Line, however, most compilers now use 8 bytes of memory alignment by default, so the chance of one row is almost 100% ). The following sentence: q-> head. second = next; for head. the write operation of second also takes the head in the do while () + CAS loop of other threads. the first cache is invalid. Therefore, this is a mistake in q3.h design. Although Cache Line Padding (Cache row filling) is considered, False Sharing (pseudo-Sharing) still exists ). These problems are well avoided in disruptor, because it applies Cache Line Padding to every Sequence number variable, that is, the Sequence class. In fact, False Sharing is not a very serious issue, but the more threads there are, the shorter the running time of the locked area, and the more obvious the adverse effects of False Sharing.

From the point of view of the loop, the previous do while () + CAS can be considered as a spin lock with a spin count of 0, while the subsequent one confirms the submitted loop, it is a real spin. Only when conditions are met can exit. Both loops depend on or wait for other threads to ensure the exit of the loop. Therefore, we generally think that they are two spin operations, this is also mentioned in "Article 2: spin lock.

Q3.h another problem is that livelock may be used. Although there is no complete deadlock (deadlock), push () and pop () are executed slowly and slowly, beyond the conventional efficiency, the reason is not clear yet. If you have time, think about it. The condition for this situation is (the two descriptions below are a bit forgotten and I will correct them in time if they are not correct): If CPU affinity is not enabled, when PUSH_CNT + POP_CNT is greater than the actual number of CPU cores, it will be slow. If CPU affinity is enabled, when PUSH_CNT + POP_CNT is greater than the actual number of CPU cores, it is slower than the former. In addition, the execution efficiency is normal.

Pop ()

Let's take a look at pop (). The Code is as follows:

static inline void *pop(struct queue *q){    uint32_t tail, head, mask, next;    int ok;    void *ret;    mask = q->tail.mask;    do {        tail = q->tail.first;        head = q->head.second;        if ((tail == head) || (tail > head && (head - tail) > mask))            return NULL;        next = tail + 1;        ok = __sync_bool_compare_and_swap(&q->tail.first, tail, next);    } while (!ok);    ret = q->msgs[tail & mask];    asm volatile ("":::"memory");    while (unlikely((q->tail.second != tail)))        _mm_pause();    q->tail.second = next;    return ret;}

Tail. first, tail. second in tail:

As you can see above, tail. first is the first position where data can be popped up, and tail. second is the latest position where data has been successfully popped up. The pop-up data statement is: ret = q-> msgs [tail & mask]; The final confirmation pop-up statement is: q-> tail. second = next; other analyses are similar to push.

Problem

Throughout push () and pop (), you will find head. first, head. second, tail. first, tail. the four variables of second are interlocking and affect each other. push () References head at the beginning. first and tail. second, and pop () References tail at the beginning. first and head. second, but head. first and head. second is a Cache Line, while tail. first and tail. second is also on a Cache Line. No matter who updates the value or enters the CAS operation, it will cause False Sharing (pseudo-Sharing) to invalidate the Cache and affect each other. It's full of the style of the Chibi war, Cao Jun, Water Army, and serial ships ...... (Fire a chain ship ). This is a big mistake in q3.h design. If you want to improve efficiency, you can put these four variables on different Cache lines, which will be much better, disruptor has a comprehensive consideration in this regard.

The previous article about the principle of disruptor is: http://blog.codeaholics.org/2011/the-disruptor-lock-free-publishing/, this is not the only disruptor article I have read, I have read a lot, but this article allows me to better write this article. For details about the principles of disruptor, Google or Baidu. The disruptor version is upgraded quickly, and the code or structure described in each article is not necessarily the same as the latest version, however, the principle is roughly the same. I have also seen various versions and various speeches, and I need to combine them myself.

Sleep

Now, let's take a rest. Good night, Earth .......

RingQueue

The GitHub address of RingQueue is: Login. I dare say it is a good mixed spin lock. You can download it and check it out. It supports Makefile, CodeBlocks, Visual Studio 2008,201 0, 2013, CMake, and Windows, minGW, cygwin, Linux, Mac OSX, etc. Of course, ARM may not be supported and no testing environment is available.

(To be continued ...... Coming soon ......)

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More