7.4 prefetch
This section discusses the software prefetch instruction mechanism. Generally, software prefetch commands should be used to add and adjust an access mode to adapt to the practice of automatic hardware prefetch mechanism.
7.4.1 software data prefetch
The prefetch command can be obtained by allowing data to be acquired before actual use, to hide the latency of data access in the Performance-determining part of the application code. Prefetch commands do not change the semantics visible to the user of a program, although they affect program performance. Prefetch only provides a suggestion to the hardware and generally does not produce exceptions or errors.
Prefetch either loads non-temporary data or loads temporary data at the specified cache level. Both the data access type and cache level are specified as a suggestion. Depending on implementation, this command takes 32 or more aligned bytes (including the specified address byte) to the cache level specified by this command.
Prefetch is specific; applications need to adjust each implementation to maximize performance.
Note: The prefetch command is recommended only when data is not suitable for cache. The use of software prefetch should be restricted to the memory addresses managed or owned within the context of the application. Uncertain performance penalty is imposed for retrieving addresses not mapped to physical pages. For example, specifying a null pointer (0l) as a prefetch address can cause a long delay.
Prefetch provides a hint to the hardware; it does not produce exceptions or errors except for some special cases (see section 7.4.3 ). However, excessive use of the prefetch command may waste memory bandwidth, resulting in performance penalty due to resource restrictions.
However, prefetch can reduce the load of memory transactions by preventing cache contamination and Using Cache and memory efficiently. This is especially important for applications that share critical system resources, such as memory bus. See an example in section 7.7.2.1.
Prefetch is mainly designed to improve applications by hiding memory latency in the background. If an application segment accesses data in a predictable way (for example, using an array of known spans), they are good candidates for using prefetch to improve performance.
Use the prefetch command in the following format:
● Predictable Memory Access Mode
● The most time-consuming cycle
● The execution pipeline may be delayed. If the data is unavailable
7.4.2 prefetch command-Implementation of the Pentium 4 processor
The stream SIMD extension contains four variants of the prefetch command, one non-temporary and three temporary. They correspond to two types of operations: temporary and non-temporary.
Note: When prefetch is used, if the data is already in a cache layer that is closer to the processor than the cache layer specified by this command, no data migration will occur.
Non-temporary commands:
● Prefetchnta -- obtains data to the second-level cache layer to minimize cache pollution.
Temporary Commands include:
● Prefetchnt0: extracts data from all the cache layers. For the Pentium 4 processor, it is the second layer cache.
● Prefetchnt1 -- this command is the same as prefetcht0.
● Prefetchnt2 -- this command is the same as prefetcht0.
7.4.3 prefetch and load commands
The Pentium 4 processor has a memory architecture that decouples execution and allows commands to be executed independently with memory access (if data and resources do not have dependencies. A program or compiler can use pseudo-loading commands to simulate the prefetch function. However, pre-loading is not exactly equivalent to using prefetch commands. Prefetch provides better performance than pre-loading.
Currently, prefetch provides better performance than pre-loading, because:
● There is no target register. It only updates the cache row.
● Do not delay the normal command to retire.
● It does not affect the functional behavior of the program.
● Split access without cache.
● No exception is caused except when the lock prefix is used. The lock prefix is not a valid prefix used with prefetch.
● It will not complete its own execution. If so, an error will occur.
Currently, prefetch has a processor-specific advantage over prefetch commands. This will change in the future.
The following example shows that a prefetch does not fetch data:
● Prefetch causes a dtlb (Data Translation backup cache) failure. This is applied to the Pentium 4 processor corresponding to family 15 and models 0, 1, and 2. On the Pentium 4 processor with cpuid feature family 15 and model 3, prefetch resolves dtlb failure and retrieves data.
● Access to a specified address that causes an error/exception.
● If the memory sub-system uses the request cache between the first-level cache and the second-level cache.
● The prefetch target is in an uncache storage area (such as USWC and UC ).
● The lock prefix is used. This leads to an invalid operation code exception.