[Original] Lock & Lock vs. instruction Atomic operation & how to achieve the fastest multi-threaded queue?

Last Update:2015-11-07 Source: Internet

Author: User

Tags cas java reference

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lock

Locks and semaphores are very familiar to most people, especially common mutexes. There are many kinds of locks, mutual exclusion locks, spin locks, read and write locks, sequential locks, and so on, here are just a few things to see,

　　　Mutual exclusion Lock

This is the most common, Win32:createmutex-waitforsingleobject-releasemutex,linux pthread_mutex_lock-pthread_mutex_unlock,c. #的lock和Monitor, Java Lock, these are mutexes. The role of mutexes everyone knows, is to let a piece of code at the same time only one thread to run,

Spin lock

not commonly used, Linux Pthread_spin series functions are spin locks (many of the online spin locks written by atomic manipulation), roles and mutexes are similar.

Signal Volume

win under the CreateSemaphore, OpenSemaphore, ReleaseSemaphore, Waitforsingleobject,linux also have the same semaphore series, as well as C # of AutoResetEvent or semaphore. This is used a lot, signal two states, blocking and passing, the role is to ensure the business order of multithreaded code!

First of all, the principle of these locks, (why do I put the semaphore is also due to the lock?) ）

First the mutex, the mutex is actually implemented by atomic operation,

　　　　For example, when the variable A is 0 is a non-lock, 1 is the lock, when the first thread changes a variable a from 0 to 1 (atomic operation) succeeds, it is equivalent to acquire the lock succeeds, another thread acquires the lock again when a is 1, (or two threads simultaneously acquires the lock-and-atomic operation, One fails), which means that the acquisition lock fails, and when the first thread is exhausted, the lock is released and the A=0 (atomic operation) is taken.

The mutex feature is that when the lock acquisition fails, the current code context (thread) sleeps, and the current thread is added to the list of mutex locks maintained by the kernel, and when the subsequent lock fails again, the current thread and execution information is placed in the list. The mutex currently occupied by the person ran out of locks, the kernel will extract the mutex to wait for the next thread on the list to wake up to continue execution, when the kernel linked list is empty, is no one to grab the lock, the lock state is set to non-lock, and so on ~

Then, we talk about the spin lock, the spin lock is very simple, he and the mutex is similar, the difference is not sleep, when the acquisition of the lock failed, has been while (get), until the success, so, the spin lock in most scenarios are not applicable, because the time to acquire the lock, the CPU has been 100%!!

Finally, the semaphore, which asks why I'm also attributing the semaphore to the lock category?

Because the semaphore is also atomic operation to achieve! the same as the mutex ,　　　　 the semaphore also has a linked list, when waiting for the signal, the system also put the current thread hibernate, the thread and code execution information stored in the list of the semaphore, when the kernel received the signal, Just activate all the waiting threads on this semaphore, that's the semaphore!

　　　　See above, do you understand the use between the mutex and the spin lock scene?

Atomic operation

What exactly is atomic operation?

Baidu Encyclopedia The so-called atomic operations are operations that are not interrupted by the thread scheduling mechanism; Once this operation starts, it runs until the end, without any context switch (switching to another thread).

So, atomic operations ensure that multiple threads are worth the accuracy of memory operations! So How does the atomic operation work?

the first is the Inter CPU, people familiar with the Assembly know that the inter instruction set has a lock, if a command set is preceded by a lock, then in the multi-core State, a nuclear execution to the front of the lock command, the Inter will let the bus lock, when the core to execute this command, Turn the bus on again! This is one of the most basic locks!!

For example , lock Cmpxchg dword ptr [Rcx],edx cmpxchg This command is locked!

Inter Directive reference can be consulted http://www.intel.cn/content/www/cn/zh/processors/architectures-software-developer-manuals.html

From IA-32 Voucher 3:

The HLT instruction (stop processor) stops the processor until it receives an enable interrupt (such as NMI or SMI, which is normally turned on), debug exception, binit# signal, init# signal, or reset# signal. The processor generates a special bus cycle to indicate the Enter stop mode. The hardware responds to this signal in several ways. The indicator light on the front panel lights up, generating an NMI interrupt to record diagnostic information, calling the reset initialization process (note that the binit# pin was introduced in the Pentium Pro processor). If non-wake events (such as a20m# interrupts) are not processed during the outage, they are processed after the wake-up shutdown event is processed.

When modifying memory operations, use the LOCK prefix to invoke locking's read-Modify-write operation (atomic). This mechanism is used for reliable communication between processors in multiprocessor systems, as described below: in Pentium and earlier IA-32 processors, the lock prefix causes the processor to generate a lock# signal when executing the current instruction, which always causes an explicit bus lock to occur. In Pentium 4, Intel Xeon and P6 series processors, the lock operation is handled by either a cache lock or a bus lock. If memory access is cached and affects only a single cache line, the cache lock is invoked in the operation, and the actual memory area in the system bus and system memory is not locked. At the same time, other Pentium 4, Intel Xeon, or P6 series processors on the bus write back all the modified data and invalidate their cache to ensure system memory consistency. If memory access does not have a cache and/or it crosses the boundary of the cache line, the processor generates a lock# signal and does not respond to bus control requests during the lock operation.

The IA-32 processor provides a lock# signal that is automatically activated during certain critical memory operations to lock the system bus. When this output signal is emitted, bus control requests from other processors or bus agents will be blocked. The software can specify other occasions that require lock semantics by adding a lock prefix prior to the instruction. In Intel386, Intel486, Pentium processors, the explicit locking of instructions will cause the lock# signal to be generated. Hardware designers to ensure the availability of lock# signals in the system hardware to control internal IA-32 architectures between processors software developer Guide Volume 3: System Programming Guide 170 Save access. For Pentium 4, Intel Xeon, and P6 series processors, if the area of memory being accessed is cached inside the processor, the lock# signal is usually not emitted; instead, the lock is applied only to the processor's cache (see 7.1.4.LOCK The effect of the operation on the internal cache of the processor).

can refer to Inter's IA-32 voucher 3 The seventh chapter of the first section!

　　　　Of course inter there are other ways to ensure atomic operation!

　　　　then arm CPU, arm mainly by two instructions to ensure the atomic operation, LDREX and Strex

LDREX
LDREX can load data from memory.

If the physical address has a shared TLB property, LDREX marks the physical address as exclusive access by the current processor and clears any exclusive access tokens to any other physical address of the processor.

Otherwise, the execution processor is marked with a physical address, but the access is not complete.

Strex
Strex can store data to memory under certain conditions. The conditions are as follows:

If the physical address does not share the TLB property, and the execution processor has a marked but not yet accessed physical address, it will be stored, cleared, and returned with a value of 0 in Rd.

If the physical address does not share the TLB property, and the executing processor does not have a physical address that has been flagged but has not yet been accessed, the store will not be stored and the value 1 will be returned in Rd.

If the physical address has a shared TLB property and has been marked for exclusive access by the processor, it will be stored, cleared, and returned with a value of 0 in Rd.

If the physical address has a shared TLB property but is not marked for exclusive access by the processor, it is not stored and returns a value of 1 in Rd.

Reference: http://blog.csdn.net/duanlove/article/details/8212123

Atomic CAS Operations

First look at a piece of code

int Compare_and_swap (int* reg, int oldval, int newval) {    int old_reg_val = *reg;    if (Old_reg_val = = oldval)         *reg = newval;    return old_reg_val;}

　　CAs is Compare-and-swap, the first comparison, that is, change, that is, when Reg and other oldvalue, the Reg is set to newval, this code in the non-atomic case (multithreading) is useless, but if this code is atomic operation, Then he's very powerful, and the mutex is related to this CAs,

Above we also see Inter this command,lock Cmpxchg,cmpxchg function is the function of CAs , compare and Exchange operands , this is the CAS atomic operation, Magic Bar, The function of one of the above functions, was inter a command, and then cmpxchg the front plus a lock, then this is a real power cas!

There is a InterlockedCompareExchange function in the Win32 kernel, this function is the CAS function, the implementation on the Inter CPU is this instruction = "Lock cmpxchg!!"

there are __sync_bool_compare_and_swap and __sync_val_compare_and_swap under Linux!

There are interlocked.compareexchange! under the dotnet. Java reference Sun.misc.Unsafe Class!

What is the power of CAS operations?

If you want to sliding scale a variable, in multi-threaded, should be locked, the code is this

int num = 0;void Add () {lock (); num = num + 1;unlock ();}

But if you do not lock, CAs to operate??

int num = 0;void Add () {int temp;do{temp = num;} while (CAS (num, temp, temp+1) ==true)}

　　We see Using a do while to determine the results of CAS, if the modification is complete, then success +1, if the CAS has not been modified successfully, continue to While,temp will get the latest num, CAS operation again!

When a thread, num a person to operate, no error, when two people, a person to do CAS atomic operation, num+1, the second thread holding the old value to add operation, return is false, so re-copy temp to get the latest num, this is the core value of CAs! No lock!

CAs in fact, this is a kind of lock, optimistic lock! Same as spin lock also loop!

The above two operations are divided into the number of threads on the thread, if the number of <CPU threads open, the above two operations, CAS win! , CAS will complete multiple add operations in the shortest possible time!! , but if there are too many threads, the above code should belong to the mutex win because the mutex has hibernate! The shortcomings of CAs lock are obvious, is not set fire!! Will cause cpu100%.

Paste the code of the mutex (written by yourself),

int i = 0;//0 non-lock, 1 lock//try to acquire lock when CAS return failed, acquire lock failed, return true, get lock successful get failed on hibernate, wait for system wake BOOL Lock () {return cas (i, 0, 1);} BOOL Unlock () {return cas (I, 1, 0);}

CAS Lock-free queue

Since CAs and spin lock a nature, why use CAs? We in the actual use, the extreme set fire access to a variable situation is very few, is generally accessed after the execution of a business re-access, so, the CAs in most of the concurrency scenario is available, but CAS is generally not used in ordinary business thread programming lock, CAS programming has many traps! and burn your brains!!

Speaking of concurrent, multi-threaded queue, is a lot of great God studied, many people use single-threaded or mutual exclusion lock, are not very ideal! Mutual exclusion lock has sleep and new this link, consumption of human power! No lock, no sleep, CAS is able to play the extreme tension of multicore CPU!

Simply send me to write the CAs ring queue, very simple!

. H#pragma once#ifndef _cas_queue#define _cas_queue#ifndef c_bool#define c_booltypedef int cbool; #define FALSE 0 #defin e true 1#endif////typedef struct _cas_queue//{//int size;//} cas_queue; #define Queue_size 65536#ifdef __cplusplusextern "C" {#endif/*compare and Swap:cas (*ptr,outvalue,newvalue); return bool*/cbool compare_and_swap (void * * Ptr,long Outvalue,long newvalue); void cas_queue_init (int queue_size); void Cas_queue_free (); CBool cas_queue_try_enqueue (void * p); CBool cas_queue_try_dequeue (void * * p); #ifdef __cplusplus} #endif #endif//.c#include "cas_queue.h" #ifdef _msc_ver# Include <windows.h> #else #endifvolatile unsigned long read_index = 0;volatile unsigned long write_index = 0;long* rea d_index_p = &read_index;long* write_index_p = &write_index;void** Ring_queue_buffer_head;int ring_queue_size = Queue_size;cbool is_load = 0;cbool compare_and_swap (void * ptr, long outvalue, long NewValue) {#ifdef _msc_ver//Vslong R Eturn_outvalue = InterlockedCompareExchange (PTR,NewValue, Outvalue); return return_outvalue = = Outvalue;/*interlockedcompareexchange64 No success!! *///#ifndef _win64////32 bit//long return_outvalue = InterlockedCompareExchange (PTR, newvalue, outvalue);//return Return_outvalue = = outvalue;//#else////64 bit//long return_outvalue = InterlockedCompareExchange64 (PTR, newvalue, Outvalue);//return Return_outvalue = = outvalue;//#endif #else//linux#endif}void cas_queue_init (int queue_size) {if ( Queue_size > 0) ring_queue_size = queue_size;int size = sizeof (void**) *ring_queue_size;ring_queue_buffer_head = malloc (size); memset (ring_queue_buffer_head, 0, size); is_load = 1;read_index = 0;write_index = 0;} void Cas_queue_free () {is_load = 0;free (ring_queue_buffer_head);} CBool cas_queue_try_enqueue (void * p) {if (!is_load) return False;long index;do{//queue fullif (read_index! = write_index & amp;& read_index%ring_queue_size = = write_index%ring_queue_size) return false;index = Write_index;} while (Compare_and_swap (&write_index, index, index + 1)! = True); Ring_queue_buffer_head[index%ring_queue_size] = P;return true;} CBool cas_queue_try_dequeue (void * p) {if (!is_load) return False;long index;do{//queue emptyif (read_index = = Write_ Index) return false;index = Read_index;} while (Compare_and_swap (read_index_p, index, index + 1) = true); *p = Ring_queue_buffer_head[index%ring_queue_size]; return true;}

　　　　Specifically I tested, in 4 threads case, 800,000 messages, at the same time into and out, only 150 milliseconds to complete!! Of course, too many threads and set fire will certainly slow!!

[Original] Lock & Lock vs. instruction Atomic operation & how to achieve the fastest multi-threaded queue?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More