Intel System Programming Guide Chapter 1-11th available cache Methods

Last Update:2018-12-03 Source: Internet

Author: User

Tags prefetch intel core 2 duo

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The processor allows any region of the system memory to be cached in L1, L2, and L3 caches. In a separate page or area of system memory, it allows the cache type (also known as the memory type) to be specified (see section 11.5 ). The Storage types currently defined for Intel 64 and IA-32 architectures are: (see table 11-2)

1,High Availability (UC)-- The system memory location is not cached. All reads and writes that appear on the system bus are executed in program order without disorder. No speculative memory access, page table traversal, or prefetch of speculative target branches. This type of Cache control is useful for memory mapped I/O devices. When used with normal Ram, this greatly reduces the performance of the processor.

Note: The behavior of FP and SSE/sse2 operations as operands in UC memory depends on the implementation. In some implementations, access to UC memory may occur more than once. To ensure predictable behavior, using the loading and storage of general purpose registers to access UC memory may cause side effects of reading or writing.

2,Cannot be cached (UC -)-- It has the same characteristics as the (UC) memory type that cannot be cached. In addition to this memory type, it can be overwritten by mtrr programming of the WC memory type. This memory type can be used in families starting with the Pentium III processor and can only be selected through PAT (Translator's note:.

3,Write binding (WC)-- The system memory location is not cached (used together with the memory that cannot be cached) and the consistency is not forced by the processor's bus consistency protocol. Speculative reading is permitted. Write can be delayed and linked to the write-bound cache (WC cache) to reduce memory access. If the WC cache is partially filled, the write can be delayed until the next serialization event occurs. For example, the sfence or mfence command, cpuid execution, one read or write operation on Non-cached memory, one interruption occurs, or one lock command is executed. This type of Cache Control applies to video frame caching. For this type of cache, the write order is not important. As long as the write updates the memory, they can be seen on the graphic display. See section 11.3.1. This memory type is available on Pentium Pro and Pentium II processors by programming mtrr; or in the processor family starting from Pentium III processor, by programming mtrr or by using pat to select it.

4,Write-through (wt)-- Read/write to system memory is cached. Based on Cache hit, read is from the cache row; Miss Read leads to cache filling. Speculative reading is permitted. All writes are written to a cache row (when possible) and directly to the system memory. When writing data through memory, invalid cache lines are not filled, and valid cache lines are either filled or invalid. Write binding is allowed. This type of Cache Control applies to frame caching, or when the system bus has a device that accesses the system memory, but does not perform the Snoop of the memory access. It forces the consistency between the cache in the processor and the system memory.

5,Write-back (WB)-- Read/write to system memory is cached. The read from the cache row is hit as the cache; missed read leads to cache filling. Speculative reading is permitted. If a write Miss occurs, the cache row is filled (in the processor family starting with the P6 family), and the write is fully executed in the cache, if possible. Write binding is allowed. The write-back memory type reduces bus traffic by eliminating many unnecessary writes to system memory. Writes to a cache row are not immediately transferred to the system memory; instead, they can be accumulated in the cache. The modified cache row is written to system memory later. When a write-back operation is executed. When a cache row needs to be released, the write-back operation is triggered, for example, when a new cache row is being allocated in a filled cache. They are also triggered by mechanisms used to maintain cache consistency. This type of Cache Control provides optimal performance, but it requires that all devices accessing system memory on the system bus can snoop access to the memory to ensure system memory and cache consistency.

6,Protected write (WP)-- When possible, the read is from the cache row, and the missing read results in cache filling. The write is propagated to the system bus, and the corresponding cache lines on all processors on the bus are invalid. (Note: for example, the data at the address 0 x is mapped to the L1 cache of the processor 0 and the processor 1 respectively. When one of the processors writes WP to this address, the corresponding cache lines of the two processors and address 0x1000 are invalid .) Speculative reads are allowed. This memory type is available in the processor family that started with the P6 family processor by programming mtrr (see table 11-6 ).

11.3.1 buffer of write-bound memory locations

Write of WC memory type is not cached for words that are generally cached. They are kept in an internal write binding cache (WC cache), which is separated from the internal L1, L2, L3 cache and storage cache. The WC cache is not snoop, and therefore does not provide data consistency. Write buffering of WC memory is completed to allow the software to provide more modified data to WC cache in a short period of time, and the rest is as non-intrusive as possible for the software. The write buffer to WC memory also destroys the data. That is, multiple writes to the same storage location will leave the data that is finally written to this location, and other writes will be lost.

The size and structure of the WC cache are not defined in the architecture. For Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, and Intel Xeon processors, the WC Cache consists of several 64-byte WC caches. For the P6 family of processors, the WC Cache consists of several 32-byte WC caches.

When the software starts to write WC memory, the processor begins to fill the WC cache (note: the plural), one at a time. When one or more WC caches are filled, the processor can choose to evict the cached data to system memory. The WC cache eviction Protocol depends on implementation and should not depend on software for system memory consistency. When WC memory type is usedRequiredSensitive to the fact that writing data to system memory is being delayed, andRequiredIntentionally clear the WC cache (by sfence or mfence) When system memory consistency is required.

Once the processor begins to evict data from the WC cache to the system memory, it will make a bus transaction type decision based on the amount of valid data contained in the cache. If the cache is full (for example, all bytes are valid), the processor will execute a burst-write transaction on the bus. This causes all 32 bytes (P6 family processor) or 64 bytes (Pentium 4 or newer processor) to be transmitted on the data bus as a single burst (burst) transaction. If one or more WC cached bytes are invalid (for example, not written by software), the processor will use "partially written" transactions to pass the data to the memory, one chunk at a time (and one "Block" is 8 bytes ).

This will lead to a maximum of four write transactions (for the P6 family processor) or eight write transactions (for the Pentium 4 or updated processor ), for a WC cache that sends data to the memory.

WC memory type is defined as weak order. Once a WC cache is evicted, data is subject to the weak order semantics defined by WC memory. The order is not maintained between the allocation/release of successive WC caches (for example, write to WC cache 1 followed by write to WC cache 2, cache 1 may appear on the bus following cache 2 ). When a WC cache is evicted to the memory as part of the write, the sequence between the successive write operations cannot be guaranteed (for example, A partial write to block 2 may appear before the partial write to block 1 on the bus, or vice versa ).

Only the elements that WC transmits to the guaranteed system bus are provided by the transaction atomicity. For example, in a P6 family processor, a fully-filled WC cache will always be propagated as a 32-bit write at a time, here it should be 32 bytes). The transaction is in any order. In a WC cache eviction, data in the cache will be evicted as part, and all data contained in the same block (0-8 alignment) will be transmitted at the same time. Similarly, for an updated processor starting with an Intel netburst microarchitecture, a full WC cache will always be propagated as a single outbreak of transactions, use any sequence in a transaction. For some cache propagation, all data contained in the same block will be transmitted at the same time.

11.3.2 select a storage type

The simplest system memory model does not use side effects of reading or writing (Note: reading or writing cannot be observed by other processors or devices. In a broad sense, the so-called "Side effects" mean that a function or operation will produce inconsistency to the results after execution, for example, each input is the same, however, in some cases, the output results of each operation are not the same.) memory ing I/O does not contain a frame cache and uses the write-back memory type for all memories. An I/O proxy can execute Direct Memory Access (DMA) to write back the memory, and the cache Protocol maintains cache consistency.

A system can map I/O to other memory and use strong non-cache memory, and always map I/O to memory with read side effects. (Translator's note: If I/O ing to memory uses write-through or write-back cache, if the Read access cache hits, the data will be directly obtained from the cache, instead of ing I/O through memory, this will cause read side effects .)

Dual-port memory can be considered as a write side effect, so that relatively rapid writing is required. Because those writes cannot be observed on the other port until they reach the memory proxy. A system can use a frame cache or dual-port memory that contains pixels displayed on a screen, memory cannot be bound to cache, write, or write. Frame cache memory is generally large (several megabytes) and the processor usually writes more to it than reads. Using strong cache memory for a frame cache will produce a large amount of bus traffic, because operations on the entire cache are implemented using partial writing instead of row writing. The use of write memory for a frame cache replaces the cache of L2 and l3 In the processor and almost all other useful cache lines of L1 data cache. Therefore, whenever possible, the system should use Write-bound storage for the frame cache.

The software can use page-level Cache Control to allocate suitable and effective storage types when the software does not access the data structure in a way that is profitable from the write-back cache. For example, the software may have to read a large data structure once instead of accessing the data structure until the data structure is overwritten by another Proxy. Such a large data structure should be marked as not to be cached, or the cached row that the processor will reference again will be evicted by reading it.

A similar example would be a write-only data structure (output data to another Proxy), but never read by software. Such a data structure can be marked as not to be cached. Because the software never reads the value it writes (but, just as the memory cannot be cached, it will use partial write to be written, and as a write back memory, it will use row write to be written, this will not happen until another Proxy reads the structure and triggers an implicit write back ).

On Pentium III, Pentium 4, or an updated processor, new commands are provided to the software for greater control over data cache, prefetch, and write-back features. These commands allow the software to improve processor performance by using a memory type in a weak or processor order, and force a strong order of memory read or write when necessary. They also allow software to have greater control over cache data. For instructions and intentions, see section 11.5.5.

11.3.3 obtain code from memory that cannot be cached

Programs can execute code from never-buffered (UC) memory, but the meaning is different from accessing data from UC memory. When code is used, the processor will not speculate from the code that can be cached to the UC code. It also does not obtain speculative results as the branch target of the UC code.

The processor can fetch the same UC cache row multiple times to decode one instruction at a time. It can decode consecutive UC commands in a cache line, without the need to refer between each command. It can also take an extra cache action from the same or continuous 4 kb page to decode a non-speculative UC command (this can be true, even when the command is completely included in a line ).

As described above and the cache row size may change in future processing, the software should avoid placing memory ing I/O with read side effects on the same page, or put it in a later page to execute UC code.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More