First, the source by
Our program logic often encounters such sequences of operations:
1. Read the value of a variable in memory into the register
2. Modify the value of the variable (that is, modify the value in the register)
3. Write the values in the register back to the variable value in memory
If this sequence of operations is serialized (executed serially in a thread), then everything is OK, however, the world cannot always be what you want. In a multi-CPU architecture, two kernel control paths running on two CPUs concurrently execute the above sequence of operations, the following scenario may occur:
cpu1 |
CPU2 actions |
read operation |
  |
  |
read operation |
Modify |
Modify |
write |
  |
  |
write operation |
Multiple CPUs and memory chip are interconnected via the bus, at any time, only one bus master device (such as CPU, DMA Controller) can access the slave device (in this scenario, the slave device is RAM chip). As a result, read memory operations from two CPUs are serialized and the same old values are obtained respectively. After the modification is complete, two CPUs want to write, write the modified value back to memory. However, the hardware arbiter limit makes the CPU writeback must be serialized, so CPU1 first gain access, write back action, and then CPU2 complete the writeback action. In this case, CPU1 's change to memory is covered by the CPU2 operation, so the execution result is wrong.
Not only is multi-CPU, but also on a single CPU due to the interleaving of multiple kernel control paths, the above described errors are caused. A concrete example is as follows:
Control path for system calls |
Interrupt Handler Control Path |
Read operation |
|
|
Read operation |
|
Modify |
|
Write operations |
Modify |
|
Write operations |
|
On the control path of the system call, when the read operation is completed, the hardware triggers the interrupt and begins the execution of the interrupt handler. In this scenario, the write-back operation of the interrupt handler control path is overwritten by the writeback on the system call control path, and the result is also incorrect.
Second, the countermeasures
For variables that have multiple kernel control paths for Read-modify-write, the kernel provides a special type atomic_t, which is defined as follows:
typedef struct {
int counter;
} atomic_t;
From the definition above, atomic_t is actually an int type of counter, but defining such a special type atomic_t is something to think about: The kernel defines a number of ATOMIC_XXX interface API functions that only receive ATOMIC_ The parameter of type T. This ensures that atomic_xxx's interface functions only manipulate data of the atomic_t type. Similarly, if you define a variable of type atomic_t (you expect to manipulate it with the Atomic_xxx API function), these variables will not be accepted by the API functions of ordinary, non-atomic variable operations.
The specific interface API functions are organized as follows:
interface functions |
Describe |
static inline void Atomic_add (int i, atomic_t *v) |
Give an Atom variable V to add I |
static inline int Atomic_add_return (int i, atomic_t *v) |
As above, just return the latest value of the variable V |
static inline void atomic_sub (int i, atomic_t *v) |
Give an atom variable v minus i |
static inline int Atomic_sub_return (int i, atomic_t *v) |
As above, just return the latest value of the variable V |
static inline int Atomic_cmpxchg (atomic_t *ptr, int old, int new) |
Compare the values in the old and atomic variables ptr, and if they are equal, assign the new value to the atomic variable. Returns the value of the old atomic variable ptr |
Atomic_read |
Get the value of an atomic variable |
Atomic_set |
Set the value of an atomic variable |
Atomic_inc (v) |
The value of the atomic variable plus one |
Atomic_inc_return (v) |
As above, just return the latest value of the variable V |
Atomic_dec (v) |
The value of the atomic variable minus one |
Atomic_dec_return (v) |
As above, just return the latest value of the variable V |
Atomic_sub_and_test (i, v) |
Subtract i from an atomic variable V and determine if the latest value of the variable v equals 0 |
Atomic_add_negative (I,V) |
Add I to an atom variable V and determine if the latest value of the variable v is negative |
static inline int atomic_add_unless (atomic_t *v, int A, int u) |
As long as the atomic variable v is not equal to u, then the operation of the atomic variable v plus A is performed. If V is not equal to U, returns a value other than 0, otherwise returns a value of 0 |
Third, the implementation of ARM
We take Atomic_add as an example to describe the specific code implementation details of the Linux kernel sub-operation:
#if __linux_arm_arch__ >= 6----------------------(1)
static inline void Atomic_add (int i, atomic_t *v)
{
unsigned long tmp;
int result;
PREFETCHW (&v->counter); -------------------------(2)
__asm__ __volatile__ ("@ atomic_add\n"------------------(3)
"1:ldrex%0, [%3]\n"--------------------------(4)
"Add%0,%0,%4\n"--------------------------(5)
"Strex%1,%0, [%3]\n"-------------------------(6)
"Teq%1, #0 \ n"-----------------------------(7)
"BNE 1b"
: "=&r" (Result), "=&r" (TMP), "+qo" (v->counter)---corresponds to%0,%1,%2
: "R" (&v->counter), "Ir" (i)-------------corresponds to%3,%4
: "CC");
}
#else
#ifdef CONFIG_SMP
#error SMP not supported on Pre-armv6 CPUs
#endif
static inline int Atomic_add_return (int i, atomic_t *v)
{
unsigned long flags;
int Val;
Raw_local_irq_save (flags);
val = v->counter;
V->counter = val + = i;
Raw_local_irq_restore (flags);
return Val;
}
#define ATOMIC_ADD (I, v) (void) Atomic_add_return (i, v)
#endif
(1) The CPU prior to ARMV6 does not support SMP, and then the ARM architecture is SMP-enabled (for example, we are familiar with armv7-a). Therefore, for ARM processing, its atomic operation is divided into two camps, one is the CPU after the ARMV6 that supports SMP, and the other is the CPU with a single-core architecture before ARMv6. For up, atomic operations are done by shutting down the CPU interrupt.
(2) The code here is related to the preloading cache. The memory content that will be manipulated before the Strex instruction is loaded into the cache can significantly improve performance.
(3) for completeness, I repeat the syntax for compiling embedded C code: The syntax format for the embedded assembly is ASM (code:output operand list:input operand list:clobber list). The output operand list and input operand list are the interfaces of the C code and the embedded assembly code, and the Clobber list describes the modification of the assembly code to the Register. Why should I have a clober list? Our C code is handled by GCC, and when embedded assembly code is encountered, GCC sends the text of the embedded assembly to gas for subsequent processing. In this way, GCC needs to know about the changes to the registers of the embedded assembly code, otherwise it can cause great trouble. For example: GCC handles C code, saves some variable values in a register, and if the embedded assembly modifies the value of the register without notifying GCC, then GCC will not reload the variable into the register until it holds the value of the previous variable. Instead of directly using the register modified by the embedded assembly, the only thing we can do is to wait silently for the program to crash. Fortunately, the registers involved in the output operand list and the input operand list do not need to be reflected in the clobber list (GCC allocates these registers, and of course knows that the embedded assembly code modifies its contents), so Most of the clobber lists in the embedded assembly are empty, or only one cc, notifying GCC, and the embedded assembler code updates the condition code register.
You can separate the contents of each paragraph with the code above. The @ symbol identifies the line as a comment.
The __volatile__ here are primarily designed to prevent compiler optimizations. That is, when compiling the C code, if using the optimization option (-O) to compile, for those who do not declare the __volatile__ of the embedded assembly, the compiler may be embedded in the C code to optimize the assembly, the result may not be compiled by the original assembly code you wrote, But if your embedded assembler uses the syntax format of __asm__ __volatile__ (embedded assembler), then that is to tell the compiler not to move my embed assembler code casually.
(4) We first look at the use of the two assembly instructions Ldrex and Strex. LDR and Str are all very familiar with these two instructions, and the suffix ex represents exclusive, which is provided by ARMV7 in order to achieve synchronous assembly instructions.
LDREX <rt>, [<rn>]
<Rn> is the base register, which holds the memory Address,ldrex instruction to get the memory address from the base register and loads the memory content into <Rt> ( Destination register). These operations are the same as LDR operations, so how do you embody exclusive? In fact, in the execution of this instruction, also released two "dog" to be responsible for observing the specific address of the access (is stored in [<rn>] the address), the two dog one is called the local monitor, one is called the global Monitor.
Strex <rd>, <rt>, [<rn>]
Similar to the Ldrex directive,,<rn> is the base register, which holds the memory's Address,strex instruction to obtain the memory address from the base register and <Rt> (source Register) is loaded into the memory. Here <Rd> saved the Memeory update success or failure results, 0 indicates that the memory update succeeds, 1 indicates failure. The successful execution of the Strex directive is related to the state of local monitor and global monitor. For non-shareable memory (the memory is not shared between multiple CPUs, it will only be accessed by one CPU), only the local monitor of the CPU will be released, the dog is OK, the following table can describe this situation
Thread 1 |
Thread 2 |
Status of local monitor |
|
|
Open Access State |
LDREX |
|
Exclusive Access State |
|
LDREX |
Exclusive Access State |
|
Modify |
Exclusive Access State |
|
Strex |
Open Access State |
Modify |
|
Open Access State |
Strex |
|
Executing the Strex instruction in open Access state will cause the instruction execution to fail |
|
|
Keep open Access state until the next Ldrex directive |
At the beginning, local monitor is in the state of open Access, and after thread 1 executes the Ldrex command, the state of the local monitor is migrated to the exclusive access State (flag the local CPU to ldrex the XXX address), at this time, the interrupt occurred, in the interrupt handler, another execution of the Ldrex, the status of the local monitor remains unchanged until the Strex command successfully executed, local The state of monitor migrates to the state of open Access (clears the Ldrex tag on the xxx address). When you return to thread 1, under open Access State, executing the strex instruction will cause the instruction execution to fail (no Ldrex, no Strex), indicating that there are other kernel control paths inserted.
For shareable memory, all local monitor and global monitor in the system are required to work together to complete exclusive access, which is similar in concept and is not described here.
The approximate principle has been described, the following back to the specific implementation surface.
"1:ldrex%0, [%3]\n"
Where%3 is the "R" (&v->counter) in the input operand list, R is the limiter (constraint), which tells the compiler that GCC, you can help me choose a general register to save the operand. %0 corresponds to "=&r" (result) in the output Openrand list, = indicates that the operand is write only,& means that the operand is a earlyclobber operand, what does it mean? The compiler tends to use as few registers as possible when processing the embedded assembly, and the input and output operands in the assembly instruction will use the same register if the output operand is not & modified. Therefore,,& ensures that%3 and%0 use different registers.
(5) After completing step (4),%0 this output operand has been assigned to the old value of the atomic_t variable, there is no doubt that the operation here is to give the old value plus I. Here%4 corresponds to "Ir" (i), where the "I" this limiter corresponds to the ARM platform, indicating that this is an immediate number with a specific limit, and that the number must be an integer between 0~255 and a 32bit immediate number obtained by rotation operation. This is related to how the arm's data-processing instructions parse the immediate number. Each instruction has 32 bits, of which 12 bits are used to represent the immediate number, where 8 bits are real data and 4 bits are used to indicate how to rotation. Please refer to Arm arm documentation for more detailed information.
(6) This step saves the modified new value in the atomic_t variable. The status token for the correct operation is saved in the%1 operand, which is the "=&r" (TMP).
(7) Check the memory update operation is completed correctly, if OK, happy, if there is a problem (there are other kernel path insertion), then you need to jump to lable 1 there, a new read-modify-write operation.
Linux kernel synchronization-atomic operation