Chapter 8 of Intel System Programming Guide-8.10 Management of idle and blocking situations

Source: Internet
Author: User
Tags exit in

When a logical processor (including a multi-core processor or a processor supporting intel hyper-Threading Technology) in an MP system is idle (no work is available) or congested (waiting for a lock or semaphore, you can use HLT, pause, or monitor/mwait commands to manage additional core execution engine resources.

 

8.10.1 hlt command

 

The HLT command stops the execution of the logic processor that is executing it, and places the logic processor in a terminated State until further notifications are sent. When a logical processor is terminated, the active logical processor continues to hold all access to the shared resources in the physical package. Here, shared resources that are being used by aborted processors become available to active Logic processors, allowing them to execute more efficiently. When the terminated logical processor resumes execution, the shared resource is shared again among all active logical processors. (See section 8.10.6.3)

 

8.10.2 pause command

 

The pause command can improve the performance of processors that support Intel hyper-Threading Technology. When executing the "Rotating wait loop" and other routines, in those routines, A thread is accessing a shared lock or semaphore in a compact round-robin loop. When a rotating wait loop is executed, the processor suffers severe performance penalty. When the loop is exited, because the processor detects a possible memory order violation and scrubbed the core processor pipeline. The pause command provides a hint to the processor that the code sequence is a rotating wait loop. The processor uses this suggestion to avoid Memory order violations and prevent pipeline erosion. In addition, the pause command removes the rotation wait pipeline to avoid excessive consumption of execution resources and unnecessary power. (See section 8.10.6.1)

 

8.10.3 monitoring/mwait commands supported

 

Stream SIMD extension 3 introduces two commands (Monitor and wait) to help multi-threaded software improve thread synchronization. In the initial implementation, monitor and wait are available in the ring 0 pair of software. Commands are available in a conditional manner at a level greater than 0. Follow these steps to test the availability of monitor and mwait:

1. Use cpuid to query the monitor bit (cpuid.1.ecx [3] = 1 ).

2. If cpuid indicates yes, execute monitor in a try/try t Exception Handling and within an exception trap. If an exception occurs, monitor and mwait are not supported at a privilege level greater than 0. See example 8-23:

 

Example 8-23 Verify monitor/mwait support

 

Boolean monitor_mwait_works = true; <br/> try {<br/> _ ASM {XOR ECx, ECx <br/> XOR edX, EDX <br/> mov eax, memarea <br/> Monitor <br/>}< br/> // use Monitor <br/>}< br/> memory T (unwind) {<br/> // if we get here, monitor/mwait is not supported <br/> monitor_mwait_works = false; <br/>}

 

8.10.4 Monitor/mwait command

 

The operating system usually implements idle loops to achieve thread synchronization. In a typical idle loop scenario, there may be several "busy loops" and they will use a set of memory locations. An affected processor waits in a loop and polls a memory location to determine whether there is available work execution. Work delivery is generally a write operation on the memory (waiting for the processing queue of the processor. The time at which a job is started and scheduled is in the order of several bus cycles.

From the perspective of resource sharing (the logical processor of resources), the use of hlt commands in an OS idle loop is required, but has meaning. Execute the HLT command on an idle logic processor to set the target processor to a non-execution state. This requires another processor (when delivering work for a terminated logical processor) to wake up the aborted processor with one inter-processor interrupt. For such an interrupted delivery and service, a delay is introduced in the service of the new work request.

In a shared memory configuration, exit from the busy cycle, often because of a state change that can be applied to a specific memory location; such a change tends to be triggered when another Proxy (usually a processor) writes to the memory location.

 

Monitor/mwait complements the use of hlt and pause to allow sharing of resources between logical processors that share physical resources. Monitor establishes an effective addressing range for monitoring Write activity to the memory; mwait places the processor in an optimized State (which may vary between different implementations ), until a write to the monitored address range occurs.

In the initial implementation of monitor and mwait, they can only be available under CPL = 0.

Both commands depend on the hardware status of the processor monitor. The monitor hardware can be equipped (by executing the monitor command) or triggered (depending on various events, the monitor hardware is in the triggered State ). If the monitor hardware is in a triggered state when mwait is executed, mwait acts like a NOP and continues to execute the next command in the stream. The status architecture of the monitor hardware is invisible unless it is performed through mwait.

Multiple events, instead of writing a trigger address range, will wake up a processor that executes mwait. These include those that will cause automatic or non-automatic context switching, such:

1. External interruptions, including NMI, SMI, init, binit, mcerr, and a20m #

2. error, abort (including machine check)

3. The architecture TLB is invalid, including writing to Cr0, Cr 3, Cr4, and some MSR; executing lmsw (occurs before mwait is released, but after the monitor is set)

4. Automatic changes caused by rapid system calls and remote calls (before mwait is released, but after the monitor is set)

Power management-related events (such as thermal monitor 2 or stpclk # assertions driven by chipset) will not clear the pending monitor event flags. The error will not clear the pending identifier of the monitor event.

The software should not allow automatic context switching between monitor and mwait in the command flow. Note that mwait does not re-equip the monitor hardware. This means that the monitor/mwait status may be caused by a condition, rather than a write to the trigger address. The software should explicitly check the trigger data location to determine whether the write has occurred. The software should also check the value of the trigger address following the execution of the monitor command (and before the execution of the mwait command. This check is used to identify any write of the trigger address that occurred during the execution of the monitor.

The address range provided to the monitor command must be the write-back cache type. Only the storage of the write-back memory type to the monitored address range will trigger the monitor hardware. If the address range is not in the write-back type memory, the address monitor hardware may not be properly established, or the monitor hardware is not equipped. The software also has the responsibility to ensure that:

1. You do not want to exit a busy cycle or write it to a location in the address area being monitored by the monitor hardware,

2. Write the exit of a busy cycle to the location in the monitored address area.

Otherwise, it will cause more errors (a exit from the mwait status is not caused by a write to the destination data location ). These have negative execution meanings. For software, it may be necessary to use padding to prevent false wakeup. Cpuid provides a mechanism to determine the size of the monitored data location and a filling mechanism.

 

8.10.5 Monitor/mwait address range determined

 

To use the monitor/mwait command, in a multi-processor system, the software should know the length of the region monitored by the monitor/mwait command and the size of the corresponding lines in the cache to snoop traffic. This information can be queried using the cpuid monitoring sub-function (eax = 05 h. You will need the minimum and maximum monitor row sizes:

1. To avoid missed wake-up: Determine the data structure used for monitoring write to adapt to the minimum monitor row size. Otherwise, the processor may not be able to wake up after one Write Attempt to trigger an exit from mwait.

2. To avoid false wakeup: Use the maximum monitor row size to fill the data structure used for monitoring write. The software must ensure that no irrelevant data variables exit in the mwait trigger area on this data structure. One fill may be required to avoid this situation.

The above two do not assume the size of the cache lines in any system, and the software should not make any assumptions about that effect. In a single cluster system, the two parameters should be the same by default (the size of the monitoring trigger area is the same as the size of the system-related rows ).

 

Based on the size of the monitor row returned by the cpuid, the OS should dynamically allocate a structure with appropriate padding. If the static data structure must be used by an OS, try to adapt to the data structure and use a dynamically allocated data cache for thread synchronization. When the latter technology is not available, consider not using Monitor/mwait when using static data structures.

 

To correctly create a data structure for monitor/mwait on a multi-cluster system: the interaction between the processor, chipset, and BIOS is required (the system-related line size may depend on the chipset used in the system; this size may not be used in the monitor trigger area of the processor ). BiOS uses ia32_monitor_filter_line_size MSR to set the correct value for the system-related row size. Depending on the monitoring trigger area, the relative size of the MSR value written to ia32_monitor_filter_line_size is reportedMinimum monitor row size. A large value is reportedMaximum monitor row size.

 

8.10.6.1 use the pause command in the rotation wait Loop

 

Intel recommends that a pause command be placed in all rotating wait loops running on Intel processors that support Intel hyper-Threading Technology and multi-core processors.

Software routines that use a rotation wait loop include the multi-processor synchronization primitive (spin lock, semaphore, and mutex variable) and idle loop. Such routines keep the processor core busy executing a load-compare-branch loop, when a thread waits for a resource to become available. Adding a pause command in such a loop can greatly improve the efficiency (see section 8.10.2 ). The following routine provides an example of a rotation wait loop using a pause command:

Spin_lock: <br/> CMP lockvar, 0 // check if lock is free <br/> je get_lock <br/> pause // short delay <br/> JMP spin_lock <br/> get_lock: <br/> mov eax, 1 <br/> xchg eax, lockvar // try to get lock <br/> CMP eax, 0 // test if successful <br/> JNE spin_lock <br/> critical_section: <br/> <critical section code> <br/> mov lockvar, 0 // (Translator's note: Release the lock) <br/>... <br/> continue: 

The preceding rotating wait loop uses a "Test, test and set" technique to determine the availability of synchronization variables. This technique is recommended when writing a rotation wait loop.

In a IA-32 processor earlier than the Pentium 4 processor, the pause command is treated as a NOP command.

 

8.10.6.2 potential usage of Monitor/mwait in C0 idle Loop

 

An operating system can perform different processing for different idle states. A typical OS idle loop on an ACPI compatible OS is shown in Example 8-24:

Example 8-24: a typical OS idle Loop

// Workqueue is a memory location indicating there is a thread <br/> // ready to run. A non-zero value for workqueue is assumed to <br/> // indicate the presence of work to be scheduled on the processor. <br/> // The idle loop is entered with interrupts disabled. <br/> while (1) {<br/> If (workqueue) Then {<br/> // schedule work at workqueue. <br/>}< br/> else {<br/> // no work to do-Wait in appropriate C-state handler depending <br/> // on idle time accumulated <br/> If (idletime> = idletimethreshhold) then {<br/> // call appropriate C1, C2, C3 state handler, c1 handler <br/> // shown below <br/>}< br/> // C1 handler uses a halt instruction <br/> void c1handler () <br/>{< br/> STI <br/> hlt <br/>} 

 

The monitor and mwait commands can be used in the C0 Idle State loop. If these commands are supported.

Example 8-25 an OS idle loop with monitor/mwait IN THE C0 idle Loop

// Workqueue is a memory location indicating there is a thread <br/> // ready to run. A non-zero value for workqueue is assumed to <br/> // indicate the presence of work to be scheduled on the processor. <br/> // The following example assumes that the necessary padding has been <br/> // added surrounding workqueue to eliminate false wakeups <br/> // The idle loop is entered with interrupts disabled. <br /> While (1) {<br/> If (workqueue) Then {<br/> // schedule work at workqueue. <br/>}< br/> else {<br/> // no work to do-Wait in appropriate C-state handler depending <br/> // on idle time accumulated. <br/> If (idletime >=idletimethreshhold) Then {<br/> // call appropriate C1, C2, C3 state handler, c1 <br/> // handler shown below <br/> monitor workqueue // setup of eax with workqueue <br/> // Linearaddress, ECx, EDX = 0 <br/> If (workqueue! = 0) then {<br/> mwait <br/>}< br/> // C1 handler uses a halt instruction. <br/> void c1handler () <br/>{< br/> STI <br/> hlt <br/>} 

 

8.10.6.3 pause idle logic processor

 

If one of the two Logic processors is idle or in a long rotation wait loop, use an hlt command to explicitly suspend the logic processor.

In an MP system, the operating system can place idle processors in a loop of a queue that continuously checks for running software tasks. A considerable number of core execution resources are used by logical processors that execute idle loops. These resources may be used by other logical processors in the physical package. For this reason, suspending the idle logic processor will optimize the performance. If all logical processors in a physical package are paused, the processor enters a power-saving state.

 

8.10.6.4 potential use of Monitor/mwait in C1 idle Loop

 

An operating system can also use monitor/mwait to replace hlt in its C1 idle loop. In Example 8-26, an example is provided:

 

Example 8-26: an OS idle loop with monitor/mwait in C1 idle Loop

 

// Workqueue is a memory location indicating there is a thread <br/> // ready to run. A non-zero value for workqueue is assumed to <br/> // indicate the presence of work to be scheduled on the processor. <br/> // The following example assumes that the necessary padding has been <br/> // added surrounding workqueue to eliminate false wakeups <br/> // The idle loop is entered with interrupts disabled. <br /> While (1) {<br/> If (workqueue) then {<br/> // schedule work at workqueue <br/>}< br/> else {<br/> // no work to do-Wait in appropriate C-state handler depending <br/> // on idle time accumulated <br/> If (idletime> = idletimethreshhold) then {<br/> // call appropriate C1, C2, C3 state handler, c1 <br/> // handler shown below <br/>}< br/> // C1 handler uses a halt instructi On <br/> void c1handler () <br/> {<br/> monitor workqueue // setup of eax with workqueue linearaddress, ECx, edX = 0 <br/> If (workqueue! = 0) Then {<br/> STI <br/> mwait // eax, ECx = 0 <br/>}< br/>} 

 

8.10.6.5 guidelines for scheduling threads on a logical processor that shares execution Resources

 

Because of the logic processor, the sequence in which threads are distributed to the logic processor will affect the overall efficiency of a system. The following guidelines are recommended for scheduling thread execution.

1. Assign the thread to one logic processor of each processor core before allocating the thread to another logic processor that shares the execution resources of the same processor core. (Note: When allocating threads, allocate threads to all the processor cores of a physical processor and assign them to one of the logical processors, instead of focusing only on two logical processors at one core .)

2. In an MP system with two or more physical packages, the threads are dispatched to all physical processors instead of focusing on only one or two physical processors.

3. Use affinity to assign a thread to a specified processor core or package, depending on the cache sharing topology. This practice increases the chance that the processor's cache will contain some thread code and data when it is suspended and then dispatched for execution.

 

8.10.6.6 eliminate execution-based periodic Loops

 

Intel does not encourage time measurement using scheduled cycles that depend on the processor's execution speed. There are several reasons:

1. When a scheduled cycle is calibrated to a clock speed running on a IA-32 processor, a problem occurs when one processor runs at another clock speed.

2. calibrated execution-based scheduled cycle routines that generate unpredictable results when running on a IA-32 processor that supports intel hyper-Threading Technology. This is because execution resources are shared between logical processors in a physical package.

 

In order to avoid the problem described above, the scheduled cycle routine must use a cyclic timing mechanism that does not depend on the execution speed of the logical processor in the system. The following sources are usually available:

1. A high-resolution system timer (for example, intel8254 ).

2. A high-resolution timer (such as a local APIC timer or timestamp counter) within the processor ).

 

8.10.6.7 place locks and semaphores in aligned, 128-byte storage Blocks

 

When software uses locks or semaphores to synchronize processes, threads, or other code segments, Intel recommends that only one lock or semaphores appear in a cache row (or 128-byte sector, if the 128-byte sector is supported ). In an Intel netburst-based Processor (supporting a 128-byte sector consisting of two cache rows), follow this suggestion, each lock or semaphore should be included in a 128-byte storage block, which starts at a 128-byte boundary. This practice minimizes the bus traffic required by the Service lock.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.