Intel64 and IA-32 Architecture Optimization guide-3.7 prefetch

Last Update:2018-12-05 Source: Internet

Author: User

Tags compact prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3.7 prefetch

Recently, the intel processor family introduced several prefetch mechanisms to accelerate data or code migration and improve performance:

● Hardware instruction prefetch

● Data prefetch for Software

● Hardware prefetch for the cache row of data or commands

3.7.1 hardware instruction prefetch and software prefetch

In an Intel netburst-based processor, the hardware instruction prefetch reads 32 bytes of instruction at a time to the 64-byte instruction stream cache. Instructions for Intel Core microarchitecture prefetch are discussed in section 2.2.2.

Software prefetch requires a programmer to use prefetch to imply instructions and to anticipate some suitable timing and cache failure locations.

In the Intel Core microarchitecture, the software prefetch command can be used to prefetch across page boundaries and traverse one or four pages. The released software prefetch command completes page traversal and DCU fails to be detected, and then fills in the cache to allocate and retire. The software prefetch command can trigger all hardware prefetchers in the same way as normal loading.

The software prefetch operation works in the same way as the loading operation from the memory, except for the following:

● The software prefetch command is retired after translation of Virtual Physical addresses is completed.

● If there is an exception, such as a page error that requires data prefetch, the software prefetch command does not hide the data prefetch.

3.7.2 software and hardware prefetch in the previous architecture

Hardware prefetch is introduced in addition to software prefetch for Pentium 4 and Intel Xeon processors based on netburst microarchitecture. The hardware prefetch is transparent for operations on data and command streams from the memory, without the intervention of programmers. The subsequent architecture has been improved and features of the hardware prefetch mechanism have been added. Earlier implementations of the hardware prefetch mechanism focus on prefetch data and commands from memory to L2. Closer implementations provide additional features for retrieving data from L2 to L1.

In the Intel netburst microarchitecture, the hardware prefetch can track eight independent streams.

The Pentium M processor also provides a hardware prefetch for data. It can track 12 independent streams forward and 4 back streams. The prefetchnta command of the processor also takes 64 bytes to the first-level data cache, but does not pollute the second-level cache.

Intel Core solo and Intel Core Duo processor provide more advanced data hardware prefetch than Pentium M. Key differences are listed in Table 2-23.

Although the hardware prefetch operation is transparent (no programmer intervention is required ), however, if a programmer performs specific pruning on the Data Access Mode to adapt to its features (it is more suitable for the cache failure mode with a small span), the hardware prefetch is most efficient. It is highly recommended to optimize the data access mode to adapt to the hardware prefetch, and should be given priority over using software prefetch commands.

The hardware prefetch is preferably in the small span data access mode. Whether it is forward or backward, the span of a cache failure cannot exceed 64 bytes. This is true for data access to identify whether the publishing and loading operations are known or unknown. Software prefetch can supplement the hardware prefetch, if used with caution.

Make a trade-off between hardware and software prefetch. This applies to application features such as regular access and span access. Bus bandwidth, publish bandwidth (loading latency on key paths), and whether the access mode is suitable for non-temporary prefetch will also be affected.

For details about how to use prefetch, see Chapter 7th.

Adjustment suggestion 2: If a frequent loading failure is found, either insert a prefetch before it, or (if the publishing bandwidth needs to be concerned) Move the load to the execution earlier.

3.7.3 hardware prefetch for the first-level data cache

The L1 hardware prefetch mechanism in the Intel Core microarchitecture is discussed in section 2.2.4.2. A similar L1 prefetch mechanism is also available for Intel netburst-based processors with the cpuid signature family 15 and model 6.

Example 3-54 describes a technique that triggers a hardware prefetch. This code traverses a linked list and performs some computational work. The two members of each element reside in two different cache rows. The size of each element is 192 bytes. The total size of all elements is larger than that of L2 cache.

Example 3-54: DCU hardware prefetch

Original code mov EBX, dword ptr [first] XOR eax, eaxscan_list: mov eax, [EBX + 4] mov ECx, 60do_some_work_1: Add eax, eaxand eax, 6sub ECx, 1jnz do_some_work_1mov eax, [EBX + 64] mov ECx, role: Add eax, eaxand eax, 6sub ECx, 1jnz do_some_work_2mov EBX, [EBX] Test EBX, ebxjnz scan_list; prefetch the modified sequence mov EBX, dword ptr [first] XOR eax, eaxscan_list: mov eax, [EBX + 4] mov eax, [EBX + 4] mov eax, [EBX + 4] mov ECx, 60do_some_work1: Add eax, eaxand eax, 6sub ECx, 1jnz rjeax, [EBX + 64] mov ECx, 30do_some_work_2: Add eax, eaxand eax, 6sub ECx, 1jnz do_some_work_2mov EBX, [EBX] Test EBX, ebxjnz scan_list

In the modified code sequence, additional commands for loading data from a member can trigger the DCU hardware prefetch mechanism to pre-fetch data in the next cache row, allows the second member to be operated for faster completion.

The software can obtain the gain from the first-level data cache prefetch in the following two cases:

● If the data is not in the second-level cache, the first-level data cache prefetch allows early triggering of the second-level cache prefetch.

● If the data is in the second-level cache but not in the first-level cache, then, the first-level data cache prefetch takes the data of consecutive cache rows to the first-level data cache earlier.

In some cases, the software should pay attention to the potential side effects caused by unnecessary DCU hardware prefetch. If a large data structure containing many Members across many cache rows is accessed, only a small portion of the members are actually referenced, however, if multiple pairs access the same cache row, the DCU hardware prefetch triggers unwanted cache rows. In the following example, references to "PTS" and "altpts" will trigger DCU prefetch to obtain additional unwanted cache rows. If it detects a significant negative impact on the performance of the DCU hardware prefetch on the code, the software can try to reduce the size of the working group that occurs simultaneously less than half of the L2 cache.

Example 3-55: avoid causing DCU hardware prefetch to obtain unnecessary cache lines

while(CurCond != NULL){    MyATOM *a1 = CurrBond->At1;    MyATOM *a2 = CurrBond->At2;    if(a1->CurrStep <= a1->LastStep && a2->CurrStep <= a2->LastStep)    {        a1->CurrStep++;        a2->CurrStep++;                double ux = a1->Pts[0].x - a2->Pts[0].x;        double uy = a1->Pts[0].y - a2->Pts[0].y;        double uz = a1->Pts[0].z - a2->Pts[0].z;        a1->AuxPts[0].x += ux;        a1->AuxPts[0].y += uy;        a1->AuxPts[0].z += uz;        a2->AuxPts[0].x += ux;        a2->AuxPts[0].y += uy;        a2->AuxPts[0].z += uz;    }    CurrBond = CurrBond->Next;}

To make full use of these prefetchers, use one of the following methods to organize and access data:

Method 1:

● Organize data into continuous access, which can be found on the same 4 kb page.

● Use the constant span to access data either forward or backward, so that the IP Address [Note: instruction pointer] prefetch is performed forward or backward.

Method 2:

● Organize data in a continuous cache row

● Access data with incremental addressing in a continuous cache.

Example 3-56 shows the access to the continuous cache row, which can benefit from the first-level cache prefetch.

Example 3-56: L1 hardware prefetch Technology

Unsigned int * P1, J, a, B; For (j = 0; j <num; j + = 16) {A = P1 [J]; B = P1 [J + 1]; // use these two values}

By upgrading the operations starting from memory loading for each iteration, it is very likely that an important part of the latency from the memory to the second-level cache row transmission will be carried out in parallel with the transfer to the first-level cache row.

The IP prefetch only uses the minimum 8-bit address to distinguish a specific address. If the code size of a loop is greater than 256 bytes, the two loads may appear at a minimum of eight similar characters, and the IP prefetch will be constrained. Therefore, if you have a loop of more than 256 bytes, make sure that no two loads have the same minimum 8-bit address, in order to use the IP prefetch.

3.7.4 hardware prefetch for Level 2 Cache

The Intel Core microarchitecture contains two level-2 Cache prefetchers:

● Streamer-load data or commands from memory to second-level cache. To use streamer, you must organize the data or commands in a 128-byte block and align them with 128 bytes. The first access to one of the two cache rows in this block. When it is in memory, streamer is triggered to prefetch the cache row pair. For software, L2 streamer functions are similar to the cache row prefetch mechanism that is found in the Intel netburst microarchitecture.

● Data retrieval logic (DPL) -- DPL and L2 streamer are only triggered by the write-back memory type. They are only prefetch within the page boundary (4 K bytes. Both L2 prefetchers can be triggered by the software prefetch command and the prefetch request from the DCU prefetch. DPL can also be triggered by an owned read (RFO) operation. L2 streamer can also be triggered by DPL requests caused by L2 cache failure.

The software can organize data based on both the instruction pointer and the cache row span to obtain the gain. For example, for Matrix computing, columns can be prefetch Based on IP addresses, while rows can be prefetch through DLP and L2 streamer.

3.7.5 commands with cache capability

Sse2 provides additional commands with cache capabilities and those extensions in SSE. New commands with cache capabilities include:

● New stream storage commands

● New cache row flushing command

● New Memory fence commands

For more information, see section 7th.

3.7.6 rep prefix and data migration

Rep prefixes are usually used together with the database functions used for storage, such as memcpy (using Rep movsd) or memset (using Rep STOs. These string/mov commands with rep prefixes are implemented in the MS-ROM and have several implementation variants with different performance levels.

Specific variants of the implementation are selected based on data layout, alignment, and counter (ECx) values during execution. For example, movsb/stosb with rep prefix should be used as counter value, and the counter value should be less than or equal to three to achieve optimal performance.

The string move/store command has multiple data granularities. For effective data migration, a larger data granularity is superior. This means that by dividing any counter value into several double-character plus one byte and moving it with a Count value smaller than or equal to 3, we can achieve better efficiency.

Because the software can use the SIMD data migration command to move 16 bytes at a time, the following sections discuss how to design and implement such commands as memcpy (), memset (), and memmove () general principles for high-performance library functions. There are four factors to consider:

● Throughput of each iteration-if the two pieces of code have approximately the same path length, the efficiency is biased towards selecting the instruction for moving larger data slices in each iteration. At the same time, the code size of each iteration is more minor, generally reducing load and improving throughput. Sometimes this may involve comparing the relative load of an iterative loop structure with the prefix rep used for iteration.

● Address alignment-commands with the highest throughput of data movement usually have alignment restrictions, or if the destination address is aligned with its natural data size, the operation efficiency will be higher. In particular, the 16-byte movement requires that the destination address be aligned at the 16-byte boundary, while the 8-byte movement is best to align the destination address at the 8-byte boundary. The 8-byte alignment address can be used for frequent moving at the two-character granularity.

● Rep string movement vs SIMD movement-memory functions that use SIMD extension for general purpose usually need to add some preface code to ensure the availability of SIMD commands, the preface code is used to adjust the alignment of data movement at runtime. When comparing a rep string implementation with a SIMD method, you should also pay attention to the load of the preface code.

● Cache eviction-if the data volume to be processed by a memory routine is close to half of the size of the cache on the last layer, the temporary cache location may suffer loss. Using stream storage commands (such as movntq and movntdq) can minimize the effect of cache flushing. The threshold for starting to use a stream storage depends on the cache size of the last layer. Determine the size using the deterministic cpuid parameter branches.

The stream storage technology used to implement a memset (), the Type Library must also consider that the application can make profits from this technology, as long as it does not need to immediately reference that target address. This assumption is easily supported when a streaming storage implementation is tested on a micro-benchmark configuration, but is not supported in a full-scale application.

When the general exploration method is applied to the design of general purpose and high-performance Library Routines, the following guidelines are useful when optimizing an arbitrary Count value n with an address alignment. Different technologies are necessary for optimal performance, depending on the amount of N:

● When n is smaller than a certain small amount (the small threshold varies with different microarchitectures-in experience, 8 may be a good value for Intel netburst microarchitecture optimization) you can encode each case without the load of a loop structure. For example, you can use two movsd commands explicitly and one movsb with a rep and a Count value of 3 to process 11 bytes.

● When n is not small but still smaller than a threshold value (this threshold may vary with different microarchitectures, but can be determined by experience, using the runtime cpuid and alignment of a SIMD Implementation of the preface may deliver less throughput due to the load of the preface. One rep string implementation should be biased towards a dual rep string. To improve address alignment, a small piece of the preface code of movsb/stob is used, and the count value is smaller than 4, which can be used to strip out non-alignment data movement before the start of using movsd/stosd.

● When n is smaller than the cache size of the last layer, the throughput may be biased towards one of the following two situations:

-- Use a rep string with the maximum data granularity, because a rep string has little load on loop iteration, in addition, the Branch forecast failure load in the preface/end code can also be apportioned in many iterations to process address alignment.

-- Using an iterative method of commands with the maximum data granularity, SIMD Feature Detection load, iteration load, and preface/end for alignment control can be minimized here. The trade-off between the two methods can depend on the micro-architecture.

Example 3-57 shows an example of memset () implemented by using stosd to align any Count value with the destination address at the double-byte boundary in 32-Bit mode.

● When n is larger than the general size of the last layer cache, it uses 16-byte stream storage and contains the preface and end of address alignment, which may be more efficient, if the destination address is not immediately referenced later.

Example 3-57: rep stosd with any count size and 4-byte alignment:

// A c code example of memset () void memset (void * DST, int C, size_t size) {char * D = (char *) DST; size_t I; for (I = 0; I <size; I ++) * D ++ = (char) C ;}

Push edimovzx eax, byte PTR [esp + 12] mov ECx, eaxshl ECx, 8or ECx, eaxmov ECx, eaxshl ECx, 16or eax, ecxmov EDI, [esp + 8]; 4-byte alignment mov ECx, [esp + 16]; Byte Count SHR ECx, 2; dual-word CMP ECx, 1_jle _ maintest EDI, 4jz _ mainstosd; strip a dual-word dec ecx_main:; 8-byte alignment rep stosdmov ECx, [esp + 16] and ECx, 13; count <= 3rep stosb; use <= 3 optimal pop ediret

Memory routines in the Runtime Library generated by the Intel compiler have a wide range of optimizations for address alignment, Count value, and microarchitecture. In most cases, applications should use the default memory routines provided by the default intel compiler.

In some cases, the byte count of the data is known by the context (as is known by passing a parameter from a call ), in addition, we can adopt a simpler method than a common target library routine. For example, if the byte count is small, the rep movsb/stosb with a count less than four can ensure good address alignment and loop expansion to end the remaining data; using movsd/stosd can reduce the load associated with this iteration.

Using a rep prefix with the string movement instruction can provide high performance in the scenario described above. However, the use of a rep prefix with a string scan command (scasb, scasw, scasd, scasq) or a comparison command (cmpsb, cmpsw, smpsd, smpsq) is not recommended for high performance. Instead, we consider using the SIMD command.

3.7.7 enhanced rep movsb and stosb operations (ermsb)

Starting with an Intel micro-architecture processor named Ivy bridge, rep string operations using movsb and stosb can provide both flexibility and high-performance rep string operations for common software, copy and set operations like memory. The processor that provides enhanced movsb/stosb operations is enumerated by cpuid feature flag (eax = 7 h, ECx = 0 h, EBX. ermsb [bit 9] = 1.

3.7.7.1 considerations on memcpy

The interface of the standard library function memcpy introduces several factors that affect the microarchitecture (such as the length and alignment of the source cache and the destination cache) to determine the performance characteristics of the library function implementation. Two common methods for implementing memcpy are driven by small code sizes and maximum throughput. The former usually uses rep movsd + B (see section 3.7.6), while the latter uses the SIMD command to set and need to process Additional Data Alignment restrictions.

For processors supporting enhanced rep movsb/stosb, implementing memcpy with rep movsb will provide more compact benefits and better throughput than using a combination of rep movsd + B. For intel micro-architecture based on the Code Ivy bridge, using ermsb for memcpy may not be the same as the 256-bit or 128-bit avx replacement method, depending on length and alignment factors.

Figure 3-4 describes the relative performance of memcpy on rep movsd + B using ermsb on a third-generation Intel core processor, when both source and destination addresses are aligned at the 16-byte boundary and the source region is not overlapped with the target region. Using ermsb always gives better performance than rep mosd + B. If the length is a multiple of 64, it can even produce higher performance. For example, copying 128-bytes takes 40 cycles, while copying bytes only takes 35 cycles.

If an application wants to use its own custom implementation to bypass the standard memcpy library, and has the freedom to manage the cache length allocation of the source and target, therefore, it is worth controlling its memory copy operation length to a multiple of 64 to take advantage of the code size and the performance benefits of ermsb.

Using a SIMD register to implement the performance features of a general purpose memcpy library function is more colorful than using an equivalent general purpose register, depending on the length, sse2, 128-bit avx, 256-bit avx instruction set selection, relative source/destination alignment, and memory address alignment granularity/boundary.

Therefore, comparing a memcpy performance feature between ermsb and SIMD is highly dependent on a specific SIMD implementation. The remainder of this section discusses how to use ermsb to achieve the relative performance of memcpy with the unreleased and optimized 128-bit avx to describe the hardware performance of the Intel microarchitecture named Ivy bridge.

Table 3-3 shows the relative performance of the 128-bit avx function using the enhanced rep movsb, both the Source and Destination addresses are 16-byte aligned, and the source and destination regions are not overlapped. For memcpy with a length less than 128 bytes, ermsb is slower than 128-bit avx because of internal startup load in the rep string.

For address alignment, memcpy performance usually reduces 16-byte alignment (see Table 3-4)

3.7.7.2 considerations for implementing memmove

When there is a stack between the source and target areas, the software may need to use memmove instead of memcpy to ensure correctness. In a memmove () implementation, you can use rep movsb together with the direction sign (DF) to process the first part of the source region and the destination region. However, setting DF to force rep movsb to copy bytes from high to low will cause serious performance degradation.

When using ermsb to implement the memmove function, you can detect the above situation and process the first block after the source region, this block is written to the non-overlapping part of the target region by using Rep movsb with df = 0. After the stacked blocks in the following parts are copied, the remaining source areas can also be processed with df = 0.

3.7.7.3 considerations for implementing memset

Code size and throughput considerations are also applied to memset () implementation. For ermsb processors, the use of rep stosb will once again give more compact code sizes and better performance than the stosd + B Combination Technology described in section 3.7.6.

When the target cache is 16-byte aligned, memset () of ermsb can be executed more efficiently than SIMD. When the destination cache is not aligned, The memset () Performance of ermsb is reduced by 20% compared to alignment, For processors on Intel microarchitecture Based on the code name Ivy bridge. In contrast, the SIMD Implementation of memset () will suffer a smaller performance reduction when the purpose is not aligned.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More