Multi-thread sharing of non-volatile variables in parallel programming will it lead to endless while thread loops?

Source: Internet
Author: User

Background

We all know that shared variables between threads use the volatile keyword. However, if volatile is not used for identification, will it lead to an endless thread loop? For example, the following pseudo code:

static int flag = -1;void thread1(){  while(flag > 0){    //wait or do something  }}void thread2(){  //do something  flag = -1;}

Thread 1 and thread 2 run at the same time. After thread 2 exits, will thread 1 always have an endless loop due to caching and other reasons?

The first pitfall in the Real World: Unreliable Compilers

Directly run the Code:

#include 
 
  #include 
  
   #include 
   
    static int vvv = 1;void* thread1(void *){sleep(2);printf("sss\n");vvv = -1;return NULL;}int main() {pthread_t t;int re = pthread_create(&t, NULL, &thread1, NULL);if(re < 0){perror("thread");}while(vvv > 0){//sleep(1);}return 0;}
   
  
 

A thread thread1 is started in the main function. thread1 will wait for a while and change vvv =-1. When vvv> 0, the main thread will wait for a while loop.

Ideally, this is the case:

The main thread waits for an endless loop. After 2 seconds, thread1 outputs "sss", thread1 exits, and the main thread exits.


Save as thread-study.c file, optimized directly with gcc-O3:

gcc thread-study.c -O3  -pthread -gstabs
Run./a. out again. You can find that after the console outputs "sss", it will wait until you check the CPU usage. If one core is full, it indicates that the main thread is in an endless loop.

As shown in the preceding figure, the vvv variable read by the main thread is always old due to caching, leading to an endless loop.

But is it true?

After testing, all O0-level (I .e., completely unoptimized) endless loops, such as O1, O2, and O3, will be endless.

Check the O3-level assembly code (generated using the gcc-S thread-study.c). The main function is like this:

For ease of viewing, annotations are added manually.

main:.LFB56:.cfi_startprocsubq$24, %rsp.cfi_def_cfa_offset 32xorl%ecx, %ecxxorl%esi, %esimovl$_Z7thread1Pv, %edxmovq%rsp, %rdicallpthread_create                              //int re = pthread_create(&t, NULL, &thread1, NULL);testl%eax, %eaxjs.L9.L4:movl_ZL3vvv(%rip), %eax         //while(vvv > 0){testl%eax, %eaxjle.L5.L6:jmp.L6.p2align 4,,10.p2align 3.L5:xorl%eax, %eaxaddq$24, %rsp.cfi_remember_state.cfi_def_cfa_offset 8ret.L9:.cfi_restore_statemovl$.LC1, %edicallperror                               //perror("thread");jmp.L4.cfi_endproc

It is strange to use the L6 label:

. L6:
Jmp. L6

Here it is obviously an endless loop, and there is no attempt to read the xxx value.So what is the L4 label? The L4 Code reads the vvv variable and then judges it. But why is it not in a loop?

Using gdb for assembly debugging, we found that the main thread indeed executes an endless loop:

   0x0000000000400609 <+25>:    mov    0x200a51(%rip),%eax        # 0x601060 <_ZL3vvv>   0x000000000040060f <+31>:    test   %eax,%eax   0x0000000000400611 <+33>:    jle    0x400618 
 
  => 0x0000000000400613 <+35>:    jmp    0x400613 
      0x0000000000400615 <+37>:    nopl   (%rax)
 

A jmp command jump in situ is naturally an endless loop, which corresponds to the L6 part of the assembly code above.

It is equivalent to generating such code:

if(vvv > 0){goto return}for(;;){}

It can be seen that there is a problem with the code generated by gcc, and it does not generate the correct assembly code at all. Although this kind of optimization is in line with the specifications, I personally dislike this kind of serious violation of intuition optimization.

Then our problem has not been solved. Next we will modify the assembly code to make it really work as expected. Just jump the jmp of L6 to L4:

.L4:movl_ZL3vvv(%rip), %eaxtestl%eax, %eaxjle.L5.L6:jmp.L4.p2align 4,,10.p2align 3
This is the code we really expect.

Then test the modified Code:

gcc thread-study.s -o test -pthread -gstabs -O3./test
2 seconds later, the system exits.

Note: The main thread does not read the value of the old shared variable all the time, as expected.

Add volatile

Add volatile to the variable "vvv", that is:

volatile static int vvv = 1;

After re-coding, run again and find that the process is normal. 2 seconds later, the process exits.

Check the Assembly Code as follows:

.L5:movl_ZL3vvv(%rip), %eaxtestl%eax, %eaxsetg%altestb%al, %aljne.L5
This assembly code is as expected.

However, this is still a bit incorrect. What is the particularity of volatile? The generated assembly does not have any special commands. How does it "Prevent" the thread from caching shared variables?

A widely used saying on the internet is that after the volatile keyword is used, the data read must be read from the memory.

This statement is both true and wrong. The volatile keyword prevents the compiler from being optimized. Therefore, variables are not stored in registers or optimized. However, volatile does not prevent the CPU from reading data from the Cache.

What is the so-called "cache "?

The CPU has registers, including Cache, L1, L2, and L3. Let's take a look at how the thread shared variables will be put into the CPU registers or Cache at all levels.

Volatile prevents the compiler from placing variables in registers.Read the shared variables in the thread is the direct memory access.

CPU Cache

The CPU Cache puts exactly the memory data, like

Movl _ ZL3vvv (% rip), % eax

Such a command will be first searched from the CPU Cache, if not, then read from the memory through the bus.

Modern CPUs have multiple cores. Generally, L1 and L2 caches of each core are not shared, and L3 caches are shared.

Then the problem becomes:Thread A modifies the content in the Cache. Will thread B always read old data?

MESI Protocol

Since the Cache data is inconsistent, there is a mechanism to ensure consistency between them. The classic Cache consistency protocol is the MESI protocol.

The MESI protocol uses the Write Back policy, that is, when the Cache in one core is updated, it only modifies the internal core, not synchronously modified to other cores.

In the MESI Protocol, each row of Cache Line can have four states:

Modified the Cache Line data is Modified, which is inconsistent with that in the memory. The data is only stored in the Cache Line. Exclusive the Cache Line data is consistent with that in the memory, and the data only exists in the Cache Line. Shared the Cache Line data is consistent with that in the memory. The data exists in multiple Cache lines and changes to the Invalid State at any time. Invalid the Cache Line data is Invalid (that is, it will not be used again)

In the MESI protocol, the state transition is complex, but consistent with human intuition. For our research, we only need to know:

When the Cache is in the Shared state, you must broadcast notifications to other cores in the Request For Ownership (RFO) method before modifying the content of the Cache Line, and set the Cache Line to Invalid.

When it is in the Modified status, the Cache controller intercepts access from other memory addresses corresponding to the Cache Line and inserts data in the current Cache Line in response. And write the content of this Cache Line back to the memory, and the status is changed to Shared.

Therefore, there will be no modifications to the Cache data in one core, and the other core will not be aware of the situation.

That is, thread A does not modify the content in the Cache, and thread B always reads old data. Considering that internal CPU communication is fast, I estimate that thread A has modified the shared variable, and thread B should be able to read the new value within A matter of seconds.

There is also a pitfall: CPU execution in disorder

Many modern CPUs have disordered execution capabilities. From the assembly code generated after volatile is added, there is nothing special about it. So it is powerless for execution of CPU out of order. For example:

volatile static int flag = -1;void thread1(){  ...  jobA();  flag = 1;}void thread2(){  ...  while(1){    if(flag > 0)      jobB();  }}

For these two threads, jobB () may be executed first than jobA!

In thread1, the flag = 1 is executed first and then jobA () is executed because the CPU runs in disorder ().

So how to prevent this situation? This problem is caused by the CPU, which is also a solution provided by the CPU.

GCC has built-in atomic memory access functions, such:

Http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Atomic-Builtins.html

Type _ sync_fetch_and_add (type * ptr, type value ,...)
Type _ sync_fetch_and_sub (type * ptr, type value ,...)
Type _ sync_fetch_and_or (type * ptr, type value ,...)
Type _ sync_fetch_and_and (type * ptr, type value ,...)
Type _ sync_fetch_and_xor (type * ptr, type value ,...)
Type _ sync_fetch_and_nand (type * ptr, type value ,...)

These functions actually imply memory barrier.

For example, add memory barrier to the Code discussed earlier:

while(true){__sync_fetch_and_add(&vvv,0);if(vvv < 0 )break;}
Check the generated assembly code:

.L4:lock addl$0, _ZL3vvv(%rip)movl_ZL3vvv(%rip), %eaxshrl$31, %eaxtestb%al, %alje.L5jmp.L8.L5:jmp.L4
We can see that a lock addl command is added.

This lock is actually a command prefix. It ensures that the Cache Line of the current operation is in the Exclusive State and ensures the order of the commands. This command may be implemented by locking the bus, but if the bus has been locked, it will only consume the Time of the suffix command.
In fact, the volatile in Java is implemented by adding a lock add command. Write again when you are free.

Other things, some scenarios, do not use volatile.

Aside from the above discussion, volatile may not be used in some scenarios. For example, this random resource acquisition code:

ramdonArray[10];int pos = 0;Resource getResource(){  return ramdonArray[pos++%10];}

Such code pos is non-volatile, but it is no problem to call the getResource () function with multiple threads.

C11 and C ++ 11

Why does C11 and C ++ 11 not upgrade volatile to the same semantics as java/C? I guess it may be a question of "compatibility .. Egg pain

C ++ 11 provides Atomic-related operations, with the semantics similar to volatile in Java. However, C11 still has no good solutions. It seems that only built-in functions of GCC can be used, or macros of similar compilation can be written.

Http://en.cppreference.com/w/cpp/atomic

GCC optimized stuff

In fact, in the Code discussed, if there is more code in the while LOOP, GCC may not be able to identify whether it can be optimized.

Some optimized things:

For example, in most languages (especially dynamic languages), the first code is much more efficient than the second one.

//1int len = array.length;for(int i = 0; i < len; ++i){}//2for(int i = 0; i < array.length; ++i){}


Summary:

Back to the original question: Will multi-thread sharing of non-volatile variables lead to endless while threads?

In fact, it depends on the faces of many other things .. For the compiler, CPU, and language specifications ..

For codes that have not been optimized by the compiler, the Cache consistency protocol (typical MESI) of the CPU ensures that no endless loops occur. This is not a credit for volatile, but a normal mechanism inside the CPU.

For multi-threaded synchronization programs, be careful to add a memory barrier (memory barrier) in the appropriate place ).

Refer:

Http://en.wikipedia.org/wiki/Volatile_variable

Http://en.wikipedia.org/wiki/MESI

Http://en.wikipedia.org/wiki/Write-back#WRITE-BACK

Http://en.wikipedia.org/wiki/Bus_snooping

Http://en.wikipedia.org/wiki/CPU_cache#Multi-level_caches

Http://blog.jobbole.com/36263/ Every programmer should understand the CPU Cache

Http://stackoverflow.com/questions/4232660/which-is-a-better-write-barrier-on-x86-lockaddl-or-xchgl

Http://stackoverflow.com/questions/8891067/what-does-the-lock-instruction-mean-in-x86-assembly

Http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Atomic-Builtins.html

Http://en.cppreference.com/w/cpp/atomic

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.