Atomic operation (atomic operation) _c

Source: Internet
Author: User
Tags cas function prototype modifier volatile intel core i7

Deeply analyze the realization principle of volatile
Introduction

Synchronized and volatile play an important role in multithreaded concurrent programming, and volatile is a lightweight synchronized that ensures the "visibility" of shared variables in multiprocessor development. Visibility means that when a thread modifies a shared variable, another thread can read the modified value.

It is less expensive than synchronized in some cases, and this article will delve into how the inter processor is implemented volatile at the hardware level, which helps us use volatile variables correctly. Terminology definition

Terms

English words

Describe

Shared variables

Variables that can be shared among multiple threads are called shared variables. A shared variable includes all instance variables, static variables, and array elements. They are all stored in heap memory, volatile only for shared variables.

Memory barrier

Memory barriers

is a set of processor directives that enable you to limit the order of memory operations.

Buffer Line

Cache Line

The smallest storage unit that can be allocated in the cache. When the processor fills the cache line, it loads the entire cache line and requires multiple primary memory read cycles.

Atomic operation

Atomic operations

One or a series of operations that are not interrupted.

Cache row Padding

Cache line Fill

When the processor recognizes that the read operand from memory is cacheable, the processor reads the entire cache line to the appropriate cache (L1,L2,L3 or all)

Cache Hits

Cache Hit

The processor reads the operand from the cache, not from memory, if the memory location for the cache row fill operation is still the next processor-accessible address.

Write hit

Write hit

When the processor writes the operand back to a memory-cached area, it first checks whether the cached memory address is in the cached row, and if there is a valid cache line, the processor writes the operand back to the cache instead of writing back to the memory, which is called a write hit.

Write missing

Write misses the cache

A valid cache row is written to a nonexistent memory area.

Official definition of volatile

The Java language Specification defines volatile in the third edition as follows: The Java programming language allows threads to access shared variables, and in order to ensure that shared variables are updated accurately and consistently, threads should ensure that the variable is obtained separately from exclusive locks. The Java language provides volatile, which in some cases is more convenient than locks. If a field is declared as Volatile,java the thread memory model ensures that all threads see the value of the variable as consistent. Why to use volatile

The volatile variable modifier is less expensive to use and execute than synchronized, because it does not cause the thread context to switch and schedule. The realization principle of volatile

So how is the volatile to ensure visibility? Under the x86 processor, get the JIT compiler-generated assembly instructions from the tool to see what the CPU will do for the volatile write operation.

Java code:

Instance = new Singleton ();//instance is volatile variable

Assembly Code:

0x01a3de1d:movb $0x0,0x1104800 (%esi);

0x01a3de24:lock Addl $0x0, (%ESP);

Shared variables decorated with volatile variables are written with a second line of assembly code, and by looking at the IA-32 architecture software Developer's Manual, the lock prefix directive can cause two things under the multi-core processor. The data for the current processor cache row is written back to the system memory. This write-back operation can cause data that is cached in the other CPU to be invalid for the memory address.

To improve processing speed, the processor does not communicate directly with memory, but first reads the system memory data to the internal cache (L1,L2 or others) before operating, but does not know when it will be written to memory after the operation, if the volatile variable is declared to write operations, The JVM sends a lock prefix to the processor and writes the data from the cached row of the variable to the system memory. But even if it's written back to memory, if the other processor's cached value is old, there is a problem in performing the calculation again, so the cache consistency protocol is implemented under multiprocessor to ensure that the cache of each processor is consistent, and each processor checks that its cached value is expired by sniffing the data propagated on the bus. , when the processor discovers that its own cache row corresponding memory address is modified, the current processor's cache line is set to an invalid state, when the processor to the data to modify the operation, will force the data from the system memory to read into the processor cache.

These two things are described in detail in the Multiprocessor Management Chapter (chapter eighth) in the third volume of the IA-32 software Developer's Architecture manual.

The lock prefix instruction causes the processor cache to write back to memory. The LOCK prefix instruction causes the processor's lock# signal to be claimed during execution of the instruction. In a multiprocessor environment, the lock# signal ensures that the processor can monopolize any shared memory during the time that the signal is claimed. (because it will lock the bus, causing other CPUs can not access the bus, can not access the bus means no access to system memory), but in the nearest processor, the lock# signal generally does not lock the bus, but the lock cache, after all, the lock bus overhead is relatively large. The 8.1.4 section details the effect of the locking operation on the processor cache, and for the Intel486 and Pentium processors, always claim the lock# signal on the bus during lock operation. However, in P6 and the most recent processors, the lock# signal is not claimed if the area of memory that is accessed is already cached inside the processor. Instead, it locks the cache of this area of memory and writes back to memory, and uses the caching consistency mechanism to ensure modified atomicity, known as "cache locking", which prevents the simultaneous modification of memory area data that is cached by more than two processors.

Caching of one processor to memory can cause the cache of other processors to be invalid. The IA-32 processor and the Intel 64 processor use Mesi (modify, exclusive, shared, invalid) control protocols to maintain the consistency of the internal cache and other processor caches. When operating in multi-core processor systems, the IA-32 and Intel 64 processors can sniff other processors for access to system memory and their internal caches. They use sniffer technology to ensure that its internal cache, system memory and other processors of the cached data on the bus consistent. For example, in the Pentium and P6 family processors, if a sniffer processor is used to detect that another processor is trying to write a memory address, and the address is currently processing a shared state, the sniffer processor will invalidate its cache line, forcing the cache row population to be enforced the next time the same memory address is accessed. Optimization of the use of volatile

The famous Java concurrent programming master Doug Lea adds a queue collection class LinkedTransferQueue to the JDK7 concurrency package, and when using the volatile variable, he uses a byte-append to optimize teams and team performance.

Add Word energy-saving optimization performance. This approach looks magical, but understanding the processor architecture will help you understand the mysteries. Let's take a look at the LinkedTransferQueue class, which uses an inner class type to define the queue header (head) and tail node (tail), And this inner class paddedatomicreference the shared variable to 64 bytes, atomicreference only one thing, relative to the parent class. We can calculate that an object's reference is 4 bytes, it appends 15 variables to 60 bytes, plus the parent class's value variable, a total of 64 bytes.

/** head of the queue
/private transient final paddedatomicreference < Qnode > head;

/** tail of the queue

/private transient final paddedatomicreference < qnode > tail;


Static Final class Paddedatomicreference < T > extends Atomicreference < T > {

    //enough padding for 64byt Es with 4byte refs 
    Object p0, p1, p2, p3, P4, P5, P6, P7, P8, P9, PA, Pb, PC, PD, PE;

    Paddedatomicreference (T R) {

        super (R);

    }

}

public class Atomicreference < V > implements java.io.Serializable {

    private volatile v value;

    Omit other code}

Why append 64 bytes can improve the efficiency of concurrent programming. Because for Intel Core i7, core solo and Pentium M processor l1,l2 or L3 cache rows are 64 bytes wide, the cache line is not supported for partial padding, which means that if the header and tail nodes of the queue are less than 64 bytes , the processor reads all of them into the same cache line. With multiple processors, each processor caches the same head and tail nodes, and when a processor attempts to modify the header to lock the entire cache line, the caching consistency mechanism will cause the other processors to not be able to access the tail node in their cache. While the queue and team operations are the need to constantly modify the head contact and tail node, so in the case of multiprocessor will seriously affect the queue and team efficiency. Doug Lea fills the cache line in the cache by appending to 64 bytes, avoiding the header and tail nodes being loaded into the same cache line, so that the head and end nodes do not lock each other when they are modified.

So you should append to 64 bytes when using the volatile variable. No. This approach should not be used in both scenarios. First: Cache Rows not 64-byte wide processors, such as the P6 series and Pentium processors, whose L1 and L2 cache lines are 32 bytes wide. Second: Shared variables are not frequently written. Because the use of Append bytes requires the processor to read more bytes to the buffer zone, which in itself will result in a certain amount of performance consumption, if the shared variable is not frequently written, the probability of locking is very small, there is no need to append bytes to avoid locking each other.

The principle of atomic operation

1. Introduction

Atom (Atom) is meant to be "the smallest particle that cannot be further divided", whereas atomic manipulation (atomic operation) means "one or a series of operations that cannot be interrupted". It becomes a bit more complicated to implement atomic operations on multiple processors. In this article, let's talk about how atomic operations are done in Intel processors and Java.
2. Terminology definition
Interpretation of the terms in English the minimum unit of operation for cache line cache the comparison and exchange of Compare and swap CAS operations requires two numeric values to be entered. An old value (the desired value before the operation) and a new value that compares the old value to the previous one, and if no change occurs, the Exchange Depreciation value, changes are not exchanged. CPU assembly line CPU pipeline CPU line of work like assembly lines in the industrial production, in the CPU by 5~6 a different function of the circuit unit to form an instruction processing line, and then a X86 instruction into 5~6 steps and then by these circuit units to execute, This enables the completion of one instruction in a CPU clock cycle, thus increasing the CPU's computational speed. Memory sequence conflicts Memory order violation memory sequence conflicts are generally caused by false sharing, which means that multiple CPUs simultaneously modify the different parts of the same cache line and cause one of the CPU operations to be invalid, and when this memory sequence conflict occurs, the CPU must empty the assembly line.

3. How the processor implements atomic operations

The 32-bit IA-32 processor implements atomic operations between multiprocessor using a locking or bus lock on the cache. 3.1 Processor automatically guarantees the atomicity of basic memory operations

First, the processor automatically guarantees the atomic nature of the basic memory operation. The processor guarantees that reading or writing a byte from the system memory is atomic, meaning that when a processor reads a byte, the other processor cannot access the byte's memory address. Pentium 6 and the latest processors can automatically guarantee that a single processor is atomic to the 16/32/64 bit in the same cache line, but a complex memory processor does not automatically guarantee its atomicity, such as cross bus widths, across multiple cache rows, and access to a page table. However, the processor provides two mechanisms for bus locking and cache locking to ensure the atomicity of complex memory operations. 3.2 Use of the bus lock to ensure atomicity

The first mechanism is to guarantee atomicity by means of a bus lock. If multiple processors simultaneously read and overwrite shared variables (i++ is the classic read overwrite) operation, the shared variable is then manipulated by multiple processors at the same time, so the read rewrite operation is not atomic, and the value of the shared variable will be inconsistent with the expectation after the operation, for example: if I=1, we do two times I + + operation, we expect the result to be 3, but it is possible that the result is 2. The following figure

(Example 1)

The reason is that it is possible for multiple processors to read the variable i from their respective caches at the same time, and then add them separately, and then write to the system memory separately. So if you want to make sure that reading overwrites the shared variable is atomic, you must ensure that CPU1 reads overwrite the shared variable, CPU2 cannot manipulate caching of the shared variable's memory address.

The processor uses a bus lock to solve the problem. The so-called bus lock is a lock# signal provided by the processor, and when a processor outputs this signal on the bus, the other processor's request is blocked and the processor can use the shared memory exclusively. 3.3 Use of cache locks to guarantee atomicity

The second mechanism is to guarantee atomicity through cache locking. At the same time we only need to ensure that the operation of a memory address is atomic, but the bus lock to the CPU and memory communication between the lock, so that during the lock, other processors can not operate other memory address data, so the cost of the bus lock is relatively large, The nearest processor uses cache locking to optimize in some cases instead of a bus lock.

Frequently used memory is cached in the processor's l1,l2 and L3 cache, so atomic operations can be done directly in the processor's internal cache, without the need to declare a bus lock, and the "cache lock" method is used in the Pentium 6 and most recent processors to achieve complex atomicity. The so-called "cache lock" is if the cache in the processor cache line memory area is locked during the lock operation, when it performs a lock operation to write back memory, the processor does not claim the lock# signal on the bus, but modifies the internal memory address and allows its caching consistency mechanism to guarantee the atomic nature of the operation. Because the caching consistency mechanism prevents the simultaneous modification of memory area data that is cached by more than two processors. The cache row is invalid when another processor writes data to a cached row that has been locked, and in Example 1, when CPU1 modifies I in the cache row using a cache lock, then CPU2 cannot cache I cache rows at the same time.

However, there are two cases where the processor does not use cache locking. In the first case, the processor invokes the bus lock when the operation's data cannot be cached inside the processor, or the operation's data spans multiple cache lines (cache line). The second scenario is that some processors do not support cache locking. For Inter486 and Pentium processors, a bus lock is invoked even if the locked memory area is in the processor's cache line.

The above two mechanisms can be implemented by inter processors that provide a number of lock prefix directives. such as bit test and modify instruction BTS,BTR,BTC, Exchange instruction Xadd,cmpxchg and other operands and logical instructions, such as add, or (or), etc., the memory area that is manipulated by these instructions will be locked, causing the other processors to not access it at the same time. 4. How Java Implements atomic operations

In Java, atomic operations can be done by locking and looping CAs. 4.1 Atomic operations using cyclic CAS

CAS operations in the JVM are implemented using the CMPXCHG instructions provided by the processor mentioned in the previous section. The basic idea of the spin-CAS implementation is to cycle the CAS operation until it succeeds, the following code implements a CAS thread-safe counter method Safecount and a non-thread-safe counter count.

   
public class Counter {private Atomicinteger Atomici = new Atomicinteger (0);
    private int i = 0;
        public static void Main (string[] args) {final Counter cas = new Counter ();
        list<thread> ts = new arraylist<thread> (600);
        Long start = System.currenttimemillis ();
                for (int j = 0; J < + j) {Thread t = new Thread (new Runnable () {@Override
                        public void Run () {for (int i = 0; i < 10000; i++) {cas.count ();
                    Cas.safecount ();
            }
                }
            });
        Ts.add (t);
        for (Thread t:ts) {T.start ();
            //wait for all threads to perform completion for (thread T:ts) {try {t.join ();
            catch (Interruptedexception e) {e.printstacktrace ();
      } System.out.println (CAS.I);  System.out.println (Cas.atomicI.get ());

    System.out.println (System.currenttimemillis ()-start); /** * Implement thread safety counters with CAS/private void Safecount () {for (;;)
            {int i = Atomici.get ();
            Boolean suc = Atomici.compareandset (i, ++i);
            if (suc) {break;
    }}/** * Non-thread-safe counter/private void count () {i++;
 }
}

There are some concurrent frameworks in the Java package that also use a spin-CAS approach to atomic operations, such as the Xfer method for LinkedTransferQueue classes. Although CAS is an efficient solution to atomic operations, there are still three major problems with CAs. ABA problem, long cycle time overhead and can only guarantee the atomic operation of a shared variable. ABA problem. Because CAs need to check that the value is not changed when manipulating values, if the change is not changed, but if a value turns out to be a, b, and a, then the check with CAS will find that its value has not changed, but in fact it has changed. The solution to the ABA problem is to use the version number. Append the version number to the variable, each time the variable is updated with the version number plus one, then the a-b-a will become 1a-2b-3a.
A class of atomicstampedreference is provided to address the ABA problem from the beginning of the JDK's atomic package in Java1.5. The Compareandset method of this class is to first check whether the current reference equals the expected reference, and whether the current flag is equal to the expected flag, and if all equal, set the reference and the value of the flag to the given update value in atomic form.

public boolean Compareandset
        (v      expectedreference,//expected reference
         v      newreference,//updated reference
        int    Expectedstamp,//Expected logo
        int    newstamp)//Updated logo

Long cycle time overhead. A spin-CAS is a very large execution cost to the CPU if it is unsuccessful for a long time. If the JVM can support the pause instructions provided by the processor, then there will be a certain increase in efficiency, the pause directive has two functions, first it can delay the pipelining instruction (de-pipeline), so that the CPU will not consume too much execution resources, the delay depends on the specific implementation of the version, The latency time on some processors is zero. Second, it can avoid the CPU pipeline being emptied (CPU pipeline flush) due to the memory sequence conflict (memory order violation) when exiting the loop, thereby increasing the CPU execution efficiency.

Only the atomic operation of a shared variable can be guaranteed. When an operation is performed on a shared variable, the We can use cyclic CAS to guarantee atomic operation, but when operating on multiple shared variables, cyclic CAS cannot guarantee the atomic nature of the operation, this time you can use a lock, or a tricky way is to combine multiple shared variables into a shared variable to operate. For example, there are two shared variables i=2,j=a, combine ij=2a, and then use the CAs to manipulate IJ. Starting with Java1.5 The JDK provides a atomicreference class to guarantee atomicity between referenced objects, and you can put multiple variables in one object for CAS operations. 4.2 Using the lock mechanism to achieve atomic operation

The lock mechanism ensures that only the thread that gets the lock can manipulate the locked memory area. The JVM implements a number of locking mechanisms, biased locks, lightweight locks and mutexes, what's interesting is the cyclic CAS used in addition to locking, the way the JVM implements the lock, when a thread wants to enter the sync block, uses a cyclic CAS to get the lock, and uses a cyclic CAS release lock when it exits the sync block. For more information, see the synchronized in the article Java SE1.6. 5. Synchronized Intel 64 and IA-32 Architecture Software Developer's Handbook in reference Java SE1.6 in-depth analysis volatile implementation principle author introduction

Fang Take-off, nickname Qing-UK, Taobao senior Development engineer, concerned about concurrent programming, at present in the Advertising technology department engaged in Wireless Advertising Alliance development and design work. Personal blog: http://ifeve.com Weibo: Http://weibo.com/kirals welcome through my microblog for technical exchanges.

Http://www.infoq.com/cn/articles/atomic-operation


Linux Atomic Operations

The atomic operation is that the operation is never interrupted by any other task or event before it is executed, and it is said that its smallest unit of execution cannot have a smaller unit of execution, so the atom is actually using the concept of matter particles in physics.

Atomic operations require hardware support, so it is schema-dependent, and its API and atomic type definitions are defined in the kernel source tree's include/asm/atomic.h files, which are implemented in assembly language because the C language does not implement such an operation.

Atomic operations are primarily used to implement resource counts, and many reference counts (REFCNT) are implemented by atomic operations. The atomic type is defined as follows:

typedef struct  {  volatile int counter;  }  atomic_t;


The volatile modifier field tells GCC not to optimize the type of data, access to it is access to memory, not to registers.

The atomic operations APIs include:

Atomic_read (atomic_t * v);


This function is an atomic read operation on a variable of an atomic type, which returns the value of the variable V of the atomic type.

Atomic_set (atomic_t * v, int i);


This function sets the value of the variable V of the atomic type to I.

void Atomic_add (int i, atomic_t *v);


The function gives the variable V of the atomic type an increment of the value I.

atomic_sub (int i, atomic_t *v);


The function subtracts the i from the variable V of the atomic type.

int atomic_sub_and_test (int i, atomic_t *v);


The function subtracts I from the atomic type's variable V and determines whether the result is 0, and if 0, returns true, otherwise it returns false.

void Atomic_inc (atomic_t *v);


This function increases the atomic type variable v by 1.

void Atomic_dec (atomic_t *v);


The function of the atomic type of the variable v Atomic 1 minus.

int atomic_dec_and_test (atomic_t *v);


The function of the atomic type of the variable v atomic ground minus 1, and determine whether the result is 0, if 0, return true, otherwise return false.

int atomic_inc_and_test (atomic_t *v);


The function adds 1 to the atomic type of the variable V, determines whether the result is 0, and if 0, returns true, otherwise it returns false.

int atomic_add_negative (int i, atomic_t *v);


The function adds I to the variable V of the atomic type and determines whether the result is negative and, if it is, returns true, otherwise it returns false.

int Atomic_add_return (int i, atomic_t *v);


This function adds an i to the variable V of the atomic type and returns a pointer to V.

int Atomic_sub_return (int i, atomic_t *v);


The function subtracts the i from the variable V of the atomic type and returns a pointer to V.

int Atomic_inc_return (atomic_t * v);


The function adds 1 to the variable V of the atomic type and returns a pointer to V.

int Atomic_dec_return (atomic_t * v);

This function minus 1 for the variable V of the atomic type and returns a pointer to V.

Atomic operations are commonly used to implement reference counts of resources, in IP fragmentation of the TCP/IP protocol stack, the reference count is used, and the fragment queue structure struct IPQ describes an IP fragment, and the field refcnt is the reference counter, its type is atomic_t, When an IP fragment is created (in the function ip_frag_create), the Atomic_set function is used to set it to 1, and when the IP fragment is referenced, a function atomic_inc is used to add a reference count of 1.

When you do not need to reference the IP fragment, use the function ipq_put to release the IP fragment, ipq_put use the function atomic_dec_and_test to subtract the reference count by 1 and to determine if the reference count is 0, and if so, to release the IP fragment. The function Ipq_kill removes the IP fragment from the IPQ queue and drops the reference count of the deleted IP fragment by 1 (implemented by using a function Atomic_dec).

Atomic Operation function Prototype


Atomic operations are performed only once, are not interrupted during execution, and are not dormant; they are the smallest unit of execution, and they can be used to solve competing problems, given the nature of atomic manipulation.
In the future, other synchronization mechanisms are extended on the basis of atomic operations.
Atomic operation has integral type atomic operation, 64-bit atomic operation and Bit atom operation.

1 Integral atomic operation (Atomic integer Operations)
To use atomic operations, you need to define an atomic variable and then use the interface provided by the kernel to operate it atomically.
The integral type atom variable structure is as follows
#include <linux/type.h>
typedef struct {
int counter;
} atomic_t; It can be seen that an integral type atom variable is essentially a 32-bit integer variable.
An integral type of atomic variable operation interface, the implementation of which is related to the specific architecture. #include <asm/atomic.h>
Atomic_init (int i)//define an atomic variable, assign its value to I
int Atomic_read (atomic_t *v)//Read V value
void Atomic_set (atomic_t *v, int i)//set the value of V to I
void Atomic_add (int i, atomic *v)//V increases the value I
void Atomic_sub (int i, atomic *v)//V reduces the value of I
void Atomic_inc (Atomic *v)//V value plus 1
void Atomic_dec (Atomic *v)//V value minus 1
The value of int atomic_sub_and_test (int i, atomic_t *v)//V is reduced by I and the result is 0 o'clock returns true
int atomic_add_negative (int i, atomic_t *v)//V increases the value I and returns True when the result is negative
int Atomic_add_return (int i, atomic_t *v)//V increases the value I and returns the result
int Atomic_sub_return (int i, atomic_t *v)//V reduces the value of I and returns the result
The value of int atomic_inc_return (atomic_t *v)//V plus 1, and returns the result
The value of int atomic_dec_return (atomic_t *v)//V is reduced by 1 and the result is returned
The value of int atomic_dec_and_test (atomic_t *v)//V is minus 1, and the result is 0 o'clock returns true
The value of int atomic_inc_and_test (atomic_t *v)//V is plus 1, and the result is 0 o'clock returns true
2 64-bit atomic operation (64-bit Atomic Operations)
64-bit atomic variable structure
typedef struct {
U64 __aligned (8) counter;
} atomic64_t; The 64-bit atomic variable operator interface is similar to an integer variable operation interface, as long as the "Atomic" of the integer variable interface name is changed to "ATOMIC64".

3-bit atomic operation (Atomic bitwise Operations)
Bit atomic operator interface
#include <asm/bitops.h>
void Set_bit (int nr, void *addr)//addr NR position 1
void Clear_bit (int nr, void *addr)//addr nr position 0
void Change_bit (int nr, void *addr)//Reverse addr nr bit value
int test_and_set_bit (int nr, void *addr)//Will addr NR position 1 and return that bit before value
int test_and_clear_bit (int nr, void *addr)//will addr nr position 0 and return that bit before value
int test_and_change_bit (int nr, void *addr)//will reverse the Addr nr bit and return the bit value before the
int test_bit (int nr, void *addr)//Returns the value of addr nr bit


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.