Intel System Programming Guide Chapter 8-8.1 lock atomic operations

Last Update:2018-12-03 Source: Internet

Author: User

Tags intel core 2 duo

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A 32-bit IA-32 processor supports locking atomic operations on locations in system memory. These operations are generally used for shared data structures (such as semaphores, segment descriptors, system segments, and page tables ). For these shared data structures, there may be two or more processors trying to modify their same domain at the same time (Translator's note: it is equivalent to a data member of the struct variable in C) or sign. The processor uses three interdependent mechanisms to perform the lock atomic operation:

1. Guaranteed atomic operations

2. Lock the bus and use the lock # signal and lock command prefix.

3. Ensure that the atomic operation can execute the cache consistency protocol (Cache lock) on the cached data structure. This mechanism has been implemented in Pentium 4, P6 family and Intel strong processor.

The following methods are mutually dependent on each other. Some basic memory transactions (such as reading and writing one byte in the system memory) can always be automatically processed. That is, once started, the processor ensures that the operation is completed before other processors or bus proxies are allowed to access the location of the memory. The processor also supports the lock bus that performs the selected memory operations (such as read-Modify-write operations in a shared storage area. These operations generally need to be automatically processed, but will not be automatically processed in this way. (Note: When performing a read-Modify-write operation on a multi-core multi-thread shared variable, make sure this operation is atomic, that is, it cannot be interrupted during the operation, and other threads will be blocked by the bus if they want to operate on the shared variable. However, this is not automatically handled by the processor, but by using atomic operation commands .) Because frequently-used memory locations are often cached to the L1 or L2 cache of the processor, atomic operations can often be performed inside the cache of a processor without the need for bus locks. Here, the cache consistency protocol of the processor ensures that when the atomic operation is executed on the cached storage location, other processors that are about to cache the same storage location will be properly managed.

NOTE: Where there is a competitive lock access, the software may need to implement algorithms to ensure fair access to resources to prevent lock hunger. The hardware does not provide resources, which ensures the fairness of participating agents. The fairness of the Management semaphores and mutex lock functions is the responsibility of the software.

The mechanism for processing locking atomic operations has evolved with the complexity of the IA-32 processor. Recent IA-32 processors (such as Pentium 4, Intel strong and P6 family processors) and Intel 64 provide a more refined lock mechanism than earlier processors. These mechanisms are described in the following section.

8.1.1 atomic operations

The intelease processor (and updated processor) ensures that the following basic memory operations are always executed automatically:

1. read or write a byte

2. read or write a 16-bit boundary alignment word

3. read or write a 32-bit Gemini boundary alignment

The Pentium processor (and updated processor) ensures that the following additional memory operations will always be performed automatically:

1. read or write a 64-bit boundary alignment of four words

2. 16-bit access to a 32-bit data bus that is not cached memory location

The P6 family processor (and updated processor) ensures that the following additional memory operations will always be performed automatically:

1. Non-Aligned 16-bit, 32-bit, and 64-bit access to a cache row's cached memory

Access to cache-enabled memory is split across cache lines and page boundaries. The following processor cannot guarantee that the access is atomic: intel Core 2 Duo, Intel lingdong, Intel Core Duo, Pentium M, Pentium 4, Intel Zhiqiang, P6 family, Pentium, and intel processor. Intel Core 2 Duo, Intel lingdong, Intel Core Duo, Pentium M, Pentium 4, Intel Zhiqiang, and P6 family processors provide bus control signals that allow external memory subsystems to make split access an atom; however, non-aligned data access will seriously affect the performance of the processor and should be avoided.

Commands like an x87 command or an SSE command that access data larger than a four-character length can be implemented through multiple memory accesses. If such an instruction is used to store the memory, some of the accesses can be completed (written to the memory), while others are for architectural reasons (for example, the operation fails because the page table entries are marked as "nonexistent ). In this case, the effect of completed access can be visible to the software, even if the entire command causes a fault. If a TLB failure has been delayed (see section 4.10.4.4), such a page failure may occur, even if all access requests are for the same page.

8.1.2 lock Bus

The intel 64 and IA-32 processors provide a lock # signal that is automatically asserted during certain critical memory operations to lock the system bus or equivalent connections: for example, a dedicated transmission channel between a processor and a special destination processor ). When this output signal is asserted, requests from other processors or bus proxies to control the bus will be blocked. The software can specify other scenarios. For example, if the lock prefix is added before an instruction, the instruction has the lock syntax.

In the case of intel386, intel.pdf, and Pentium processor, explicit locking commands will lead to the assertions of the lock # signal. The hardware designer is responsible for making the lock # signal available in the system hardware to control memory access between various processors.

For P6 and updated processors, if the memory area being accessed is cached inside the processor, the lock # signal is usually not asserted; instead, the lock is only applied to the cache of the processor (see section 8.1.4 ).

8.1.2.1 automatic lock

The operations for the processor to automatically follow the lock semantics are as follows:

1. When executing an xchg command that references the memory.

2,When the B (busy) flag of a TSS descriptor is set-- When switching to a task, the processor tests and sets the busy mark in the Type field of the TSS descriptor. To ensure that both processors do not switch to the same task at the same time, when the processor tests and sets this flag
Follow the lock syntax.

3,When the segment descriptor is updated-- When a segment descriptor is loaded, the processor sets the accessed flag in the segment descriptor if it has been cleared. During this operation, the processor follows the lock syntax so that the descriptor will not be modified by another processor when it is updated. To make this action effective, follow these steps to update the descriptor operating system:

-- Use a lock operation to modify the access permission byte to indicate that the segment descriptor is currently unavailable and specify a value for the Type field that implies that the descriptor is being updated.

-- Update the field of the segment descriptor. (This operation may require several memory accesses. Therefore, the lock operation cannot be used .)

-- Use a lock operation to modify the access permission byte to indicate that the segment descriptor is valid and is currently available.

4. The intel386 processor always updates the accessed flag in the segment descriptor, whether or not it is cleared. Pentium 4, Intel Zhiqiang, P6 family, Pentium, and intelease processors are updated only when they are not set.

5,When updating the page Directory and page table entries-- When updating the page Directory and page table entries, the processor uses the lock cycle to set the accessed and dirty labels in the page Directory and page table entries.

6,Response interrupted-- After an interrupt request, an interrupt controller can use the data bus to send the interrupt vector corresponding to the interrupt to the processor. The processor follows the lock syntax during this period to ensure that no other data appears on the data bus when the interrupt vector is transmitted.

8.1.2.2 software-controlled lock Bus

To explicitly force lock semantics, the software can prefix the lock prefix before the commands described below when these commands are used to modify a memory location. An invalid operation code (# ud) exception occurs when the lock prefix is used with any other commands or when no write operation is performed on the memory (that is, when the destination operand is in the register) will be generated. (Translator's note: The lock prefix can only be used for some specific commands, which will be described below; and the destination operands of these commands must be of the memory type; otherwise, the instruction is invalid .)

1. Bit test and modify the command (BTS, BTR, BTC)

2. Switch commands (xadd, cmpxchg, cmpxchg8b)

3. The lock prefix automatically assumes the xchg command

4. The following single-operand Arithmetic Logic commands: Inc, Dec, not, and neg.

5. The following two-operand Arithmetic Logic commands: add, ADC, sub, SBB, And, or, and XOR.

A locking command ensures that only the memory region defined by the destination operand is locked, but it may also be interpreted as a lock that can be applied to a larger block of memory.

The software should use the same address and the same operand length to access semaphores (shared memory used to send signals between multiple processors ). For example, if a processor uses the length of one word to access a semaphore, other processors should not use the length of one byte to access the semaphore.

Note: Do not use the WC memory type to implement semaphores. Do not perform non-temporary storage for a cache row containing the position used to implement a semaphore.

The integrity of a bus lock is not affected by the alignment of the memory domain. The compliance of lock semantics requires enough cycles to update the entire operand. However, it is recommended that lock access be aligned at their natural boundaries for better system performance.

The locking operation is atomic for all other memory operations and all external visible events. You can use the lock command only to obtain the command and access the page table. The lock instruction can be used for synchronization
Data written by one processor and read by other processors.

8.1.3 process self-modified code and cross-modified code

A processor writes data to the code segment currently being executed so that the data can be executed as code.Self-modified code. The IA-32 processor shows model-specific behavior when executing self-modified code, depending on how far the code being modified is before the current execution pointer.

As the processor microarchitecture becomes more and more complex and starts to execute code (in P6 and newer processors) on a speculative basis before the retreat, for the rules, before, or after which the Code should be executed, blurred. To write self-modified code and ensure it is compatible with the current or future IA-32 architecture, use one of the following encoding options:

(* Option 1 *)

Store modified code (as data) into code segments; jump to new code or a mediation location; execute new code

(* Option 2 *)

Store modified code (as data) into code segments; execute a serialized command (such as a cpuid command); execute new code

If you want to run a program on the intelease or Pentium processor, you do not need to use one of these options. We recommend that you do this to ensure compatibility with the P6 family or newer processors.

The self-modified code is executed at a lower performance level than the non-self-modified code or normal code. The degree of performance degradation depends on the frequency of modification and code-specific features.

One processor writes data to the current Execution Code segment of another processor, so that the second processor executes the data as code.Cross-modify code. As with self-modified code, when executing a cross-modify code, the IA-32 processor presents model-specific behavior, depending on how far the modified Code is before the current execution pointer of the execution processor.

To write cross-modification code and ensure it is compatible with the current and future IA-32 architectures, the following processor synchronization algorithms must be implemented:

(* Modify processor behavior *)

Memory_flag <-0; (* Set memory_flag to a value other than 1 *)

Store the modified Code (as data) to the code segment;

Memory_flag <-1;

(* Act as the processor *)

While (memory_flag! = 1)

Wait for code updates;

Elihw;

Execute serialized commands (* For example, cpuid *)

Start executing the modified Code;

(If the program runs on the intelease processor, the above options are unnecessary. But to ensure compatibility with the update processor, this is recommended .)

Like self-modified code, cross-modified code is executed at a lower performance level than normal code, depending on the frequency of modification and specific features of the Code.

The limitations of Self-modification code and cross-modification Code are also applied to Intel 64 architecture.

8.1.4 effect of a lock operation on the internal processor cache

For intelease and Pentium processors, during the lock operation, the lock # signal is always asserted on the bus, even if the memory area being locked is cached in the processor.

For P6 and the nearest Processor family, if the memory area being locked is cached in the processor during a lock operation, and the processor is executing the lock operation as a write back memory, and completely included in a cache line, the processor will not assert the lock # signal on the bus. Instead, it will modify the memory location internally and allow its cache consistency mechanism to ensure that the operation is automatically executed. This operation is called a cache lock ". The cache consistency mechanism automatically prevents two or more processors from simultaneously modifying data in that region when the same storage region has been cached.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More