Misunderstanding of CPU cache refresh

Source: Internet
Author: User
Tags switches

Even senior technicians, I often hear them talk about how certain operations are causing a CPU cache to refresh. It seems that this is a common misconception about how the CPU cache works and how the cache subsystem interacts with the execution core. This article will focus on explaining the capabilities of the CPU cache and how the CPU cores executing program instructions interact with the cache. I'll take the latest Intel x86 CPU as an example, and other CPUs use similar techniques to achieve the same goal.

Most common modern systems are designed to share memory on multiple processors. Shared-memory systems have a separate memory resource that is accessed simultaneously by two or more independent CPU cores. The latency of the core-to-main memory varies greatly, about 10-100 nanoseconds. Within 100ns, a 3GH CPU can handle up to 1200 instructions. Each of the Sandy Bridge's CPU cores can process 4 instructions in parallel in each CPU clock. The CPU uses the cache subsystem to avoid delays in processing the core's direct access to main memory, which enables the CPU to process instructions more efficiently. Some caches are small, very fast, and integrated within each core, while others are slower, larger, and shared across cores. These caches, together with the registers and main memory, form a non-persistent memory system.

When you are designing an important algorithm, remember that the delay caused by a cache miss may cause you to lose execution of 500 command times! This is also only on single-slot (Single-socket) systems, and if it is a multi-slot (Multi-socket) system, it can result in double performance loss due to memory access that requires cross-slot interaction.

Memory system

Figure 1. For Sandy Bridge Core, the memory model can be decomposed roughly as follows:

1. Register: at each core, there are 160 for integers and 144 register units for floating-point. Accessing these registers requires only one clock cycle, which forms the fastest memory for the execution core. The compiler assigns local variables and function parameters to these registers. When Hyper-Threading Technology (hyperthreading) is used, these registers can be shared in Hyper-threading collaboration.

2. Memory sort buffer (ordering buffers (MOB) ): Mob consists of a 64-length load buffer and a 36-length store buffer. These buffers are used to record the actions that are being performed while waiting for the cache subsystem. The store buffer is a complete correlation queue that can be used to search for existing store operations, which are queued when waiting for the L1 cache. Buffering allows the processor to run asynchronously when the data is transferred to the cache subsystem. When the processor asynchronously reads or asynchronously writes, the result can be returned in random order. To make it consistent with the released memory model, Mob is used to eliminate the order of load and store.

3.Level 1 Cache: L1 is a cache within the local core, divided into separate 32K data caches and 32K instruction caches. Access takes 3 clock cycles, and when the instruction is pipelined by the core, the access time can be ignored if the data is already in the L1 cache.

4.L2 Cache: The L2 cache is a cache within the local core that is designed to buffer between the L1 cache and the shared L3 cache. The L2 cache size is 256K, and the primary function is as an efficient memory access queue between L1 and L3. The L2 cache contains both data and instructions. The L2 cache has a delay of 12 clock cycles.

5. L3 Cache: The L3 cache is shared across all cores in the same slot. The L3 cache is divided into a number of 2MB segments, each of which is connected to a ring network on the slot. Each core is also connected to this ring network. Addresses are mapped to segments in a hash way to achieve greater throughput. Depending on the cache size, latency can be up to 38 clock cycles. Each additional node on the ring consumes an extra clock cycle. The cache size can be up to 20MB depending on the number of segments. The L3 cache includes data from all L1 and L2 caches on the same slot. This design consumes space, but enables the L3 cache to intercept requests for L1 and L2 caches, easing the burden of each core private L1 and L2 caches.

6. Main memory : When the cache is completely dead, the latency of the DRAM channel to each slot is averaged at 65ns. The specific delay depends on a number of factors, such as the next access to the data in the same cache row will greatly reduce the latency, while the queuing effect and the memory refresh cycle conflict will significantly increase the latency. Each slot uses 4 memory channels to aggregate to increase throughput, and this delay is hidden by pipelining (pipelining) on a separate memory channel.

7. NUMA: on a multi-slot server, non-conforming memory access is used (Non-uniform memories access). The so-called non-conformance means that the memory that needs to be accessed may be on another slot, and access via QPI bus takes an additional 40ns. Sandy Bridge is a huge step forward in the 2-socket system for previous compatible systems. On Sandy Bridge, the capacity of the QPI bus is lifted from 6.4gt/s to 8.0gt/s, and two lines can be used to eliminate bottlenecks in previous systems. For Nehalem and Westmere, QPi can only use 40% of the bandwidth allocated by the memory controller for a single slot, which makes accessing remote memory a bottleneck. In addition, QPI links can now use pre-read requests, and previous generation systems do not.

Correlation degree (associativity levels)

Caching is an efficient hardware that relies on a hash table. Using the hash function often simply maps the low bit in the address to implement the cache index. The hash table needs to have a mechanism to resolve conflicts for the same location. Correlation is the number of slots in the hash table, also known as groups (ways) and collections (sets), which can be used to store a hash version of a memory address. The degree of correlation requires a balance between the capacity of the stored data, the power consumption, and the query time. (Proofreading note: The higher the correlation degree, the more the number of slots, the smaller the hash conflict, the faster the query)

For Sandy Bridge,l1 and L2 are 8-way groups connected, L3 is a 12-way group connected. (For Sandy Bridge the l1d and L2 was 8-way associative, the L3 is 12-way associative.)

Cache consistency

Because some caches are local to the kernel, we need some method to ensure consistency, so that all core memory views are consistent. For mainstream systems, the memory subsystem needs to consider "the true source (source of Truth)". If the data is only from the cache, it will never expire, and when the data is present in both the cache and the main memory, the primary copy (master copy) is stored in the cache. This memory management is called write-back (Write-back), in which the cached data is only written back to main memory when the new cache line occupies the old row, causing the old row to be evicted. The size of each cache block of the x86 schema is bytes, called a cache line (Cache-line). Other types of processors may have different cache row sizes. Larger cache line capacity reduces latency, but requires greater bandwidth ( Proofing Note: Data bus bandwidth).

To ensure cache consistency, the cache controller tracks the state of each cache line, and the number of these states is limited. Intel uses the Mesif protocol, AMD uses Moesi. Under the Mesif protocol, the cache line is in 1 of the following 5 states.

modified (Modified): indicates that the cache line has expired and will be written back to main memory in the next scenario. The state is converted to exclusive (Exclusive) when it is written back to main memory.

Exclusive (Exclusive) : indicates that the cache row is held separately from the current core and is consistent with the main memory. When written, the state is changed to modify (Modified). To enter this state, a request-for-ownership (RFO) message needs to be sent, which contains a read operation plus a broadcast notification that the other copies are invalidated.

shared: indicates that the cache line is a consistent copy of the main memory.

fail (Invalid): indicates an invalid cache line.

forward (Forward): a special shared state. Used to represent a specific cache in a NUMA system that responds to other caches.

In order to transition from one state to another, a series of messages is sent between caches to make the state change take effect. For the previous generation (or earlier) of the Nehalem core of Intel CPUs and Opteron core AMD CPUs, the traffic between slots to ensure cache consistency needs to be shared via the memory bus, which greatly limits scalability. Today, the memory controller's traffic is transmitted using a separate bus. For example, Intel's QPi and AMD HyperTransport are used for cache-consistent communication between slots.

The cache controller is connected to a circular bus network on the slot as a module of the L3 cache segment. Each core, L3 cache segment, QPI controller, memory controller, and integrated graphics subsystem are all connected to this circular bus. The ring consists of four separate channels for: Complete the request, sniff, confirm, and transmit 32-bytes data within each clock (the ring is made up of 4 independent lanes for: request, Snoop , acknowledge, and 32-bytesdata per cycle). The L3 cache contains all the cache rows in the L1 and L2 caches, which helps the core to quickly confirm the changed rows when sniffing changes. The cache controller used for the L3 cache segment records which core might change its own cache line.

If a core wants to read some data and the data is not shared, exclusive, or modified in the cache, then it needs to do a read on the ring bus. It is either read from the main memory (cached in the dead) or read from the L3 cache (if it is not expired or is sniffed at by other cores). In any case, the consistency protocol guarantees that read operations never return an expired copy from the cache subsystem.

Concurrent programming

If our caches are always consistent, why should we worry about visibility when writing concurrent programs? This is because the core is designed to get better performance, and for other threads, there may be a chaotic sequence of data modifications. There are two main reasons for this.

First, when the compiler generates the program code, for performance, the variable may have a long time in the register, for example, the variable is reused in a loop. If we need these variables to be visible between the cores, then the variables cannot be allocated in the Register. In the C language, you can add the "volatile" keyword to achieve this goal. Keep in mind that volatile in C + + does not guarantee that the compiler will not rearrange our instructions. Therefore, you need to use a memory barrier.

The second major problem with sorting is that a thread writes a variable and then reads it quickly, possibly getting an older value from the read buffer than the newest value in the cache subsystem. This is not a problem for programs that follow the single writer Principle, but is a big problem for Dekker and Peterson lock algorithms. To overcome this, and to ensure that the latest value is visible, the thread cannot read the value from the local read buffer. You can use a barrier directive to prevent the next read operation from occurring before the write operation of another thread. Writing a volatile variable in Java is accompanied by a complete barrier instruction, except that it is never allocated in the register. On the x86 architecture, the Barrier directive significantly affects the running of the thread that placed the barrier before the read buffer is emptied. On other processors, barriers are more efficiently implemented, such as Azul Vega places a flag on the read buffer for boundary searches.

When following the single writer principle, to ensure the memory order between Java threads and avoid the store barrier, use J.u.c.atomic (int| Long| Reference). Lazyset () method rather than placing a volatile variable.

Misunderstanding

Back to the "refresh cache" myth as part of the concurrency algorithm, I think we would never "flush" the CPU cache on a program in user space. I believe the source of this misunderstanding is because some concurrency algorithms need to refresh, flag, or empty the store buffer so that the next read operation can see the latest value. To achieve this, we need a memory barrier rather than a flush cache.

Another possible source of this misunderstanding is that the L1 cache, or TLB, may need to be refreshed based on the address index policy when the context switches. ARM, the address space label is not used on the TLB entry before ARMV6, so the entire L1 cache needs to be refreshed when the context switches. Many processors require a L1 instruction cache refresh for similar reasons, in many scenarios simply because the instruction cache is not necessarily consistent. Context switching consumes a lot, and context switching also causes TLB and/or L1 cache refreshes in addition to polluting the L2 cache. The Intel x86 processor only needs a TLB flush when the context switches.

(Proofreading Note: TLB is translation lookaside buffer, which is the page table buffer, which contains some page table files, also known as the fast table technology, because the "page table" is stored in the main memory, the cost of querying the page table is very high, resulting in the TLB.) )

original articles, reproduced please specify: reproduced from the Concurrent programming network –ifeve.com

Misunderstanding of CPU cache refresh

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.