CPU Cache and Multithreading

Source: Internet
Author: User
Tags volatile

first, the CPU cache structure

CPU speed is much higher than memory (that is, if only CPU and memory factors are considered, the performance of the program is often limited by memory access speed, memory access and operation), in order to reconcile the CPU and the difference in the speed of the internal, the CPU has been increased cache. Similar to the computer storage pyramid structure, the cache can also follow the pyramid structure, from bottom to top the faster the CPU speed, and the smaller the capacity. Most processors now have level two or level three caches, from bottom to top L3 cache, L2 cache, L1 cache. The cache can be divided into instruction cache and data cache, instruction cache is used to cache program code, and data cache is used to cache program data.
A single-core CPU contains only a set of L1,L2,L3 caches, and if the CPU contains multiple cores, the multicore CPUs, each core contains a set of L1 (or even L2) caches, while sharing L3 (or L2) caches. Multi-CPU, each CPU is independent of each other, has its own cache, the CPU can not share the cache. is a single-CPU dual-core cache structure.

cached lines Cache line
The minimum unit of cache and memory Exchange data for cache rows, typically 64 bytes. That is, each time the data exchange between the cache and memory, is a byte-aligned contiguous 64 bytes of memory block throughout.

cached access  
      No cache structure, CPU execution process, the use of virtual address, virtual address through the MMU translated to physical address, mapped into memory; In a cached structure, the CPU accesses the data first to find out if the data exists in the cache, which can be found in the cache by the virtual address of the data in the process. , it can also be found in the cache by the physical address of the data in memory.  
     Therefore, the data in the cache can be retrieved from the virtual address of the data in the process space. It can also be retrieved from the actual memory physical address of the data. If you pass the virtual address in the process space, then many different processes may contain the same virtual address, so you actually need to add ASID (address space identifier) the hardware version of a process ID in the cache line, In this way, when the CPU runs different processes, it can pass the   virtual address +asid   in cache,  
      If the cache is retrieved through a physical address, the MMU requires an address translation between the CPU and the cache, which reduces the cache lookup rate, so the general L1 cache cannot be retrieved through a physical address, but is accessed directly by the CPU via a virtual address. The L2,L3 cache can retrieve the through the physical memory address.

TLB Quick Table
The MMU maps the virtual address to the physical address, and translates the virtual address into a physical address through the page table. The TLB is located inside the MMU, and it is faster to convert the virtual address to a physical address without having to go through a page table.
Each entry for each TLB register contains information about a page: a valid bit, a virtual page number, a modifier bit, a protection code, and the physical page number where the page is located, which corresponds to the table item one by one in the page table.
When the MMU is translating, the virtual page number of the virtual address is checked to see if there is a TLB (parallel lookup), and if there is no violation of the read-write permission limit, the physical page number in the TLB is given directly, and if it does not exist in the TLB, a regular page table lookup is made and an entry is eliminated from the TLB. and update to the page you just looked up.

caching in multithreaded scenarios

In single-threaded mode, a piece of memory corresponds to only one CPU core cache and is accessed by only one thread. Cache exclusive, there is no access violation, and so on.
In multithreaded mode, a CPU and a CPU single core, multiple threads in a process access shared data in the process at the same time, the CPU loads a chunk of memory into the cache,when different threads access the same physical address, they are mapped to the same cache location, so that the cache will not fail even if the thread switches。 However, because only one thread is executing at any time, there is no cache access violation. Of course, due to the existence of internal registers of the CPU, there will be problems caused by a++ non-atomic statements under multiple threads.

a++ ==>
Reg =  (&a);     reg ++;      (&a) = reg;  
When a thread switch occurs, the internal registers used by the thread are saved as live.  
)  
      multithreaded mode, one CPU, and CPU has multicore, each core has at least one L1 cache. When multiple threads access a shared memory in the process, and the threads execute on separate cores, each core retains a buffer of shared memory in its own caehe. Because multicore can do true parallelism, multiple threads may appear simultaneously writing their own cache, 
     therefore CPU has " Cache consistency "principle, that is, each processor (core) will be sniffing the data propagated on the bus to check whether its cache value is expired, when the processor found its own cache line corresponding to the memory address is modified, the current processor (CORE). Therefore, we often see that in multi-core multi-threaded scenarios, the use of the Volatile,volatile variable when declaring a variable requires that the cache be written to system memory immediately after it has been updated, rather than a volatile variable, the CPU modifies the cache. The cache writes the cached data to memory at the appropriate point of knowledge (not knowing when). The operation to write to the memory will start the other processor (core) invalidates the memory that is being written by itself, and reads back from memory the next time it needs to use that memory.

false Sharing (pseudo share)
The cache line is typically 64 bytes because the cache is interacting with the memory by the minimum unit cache line. The contiguous 64 bytes in memory are loaded into a cache line. When multithreading modifies variables that are independent of each other, if these variables share the same cache row, they inadvertently affect each other's performance, which is pseudo-sharing.

As shown, x and y are placed in the same cache line, and when Core1 modifies X, the cache rows need to be locked for inter-thread visibility, and the cache rows in the Core2 are also invalidated, and the cache rows are also locked when the Y is modified in Core2, and the cache rows in the Core1 are invalidated. Impact efficiency.

CPU Cache and Multithreading

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.