Translation still pending correction ...
The cache is a fast and small caching device, and the cache stores the most recently accessed memory data. This description is fairly accurate, but knowing the details of the cache work will help improve the performance of the program.
In this blog, I'll use code examples to illustrate how caching works and how cache affects program performance.
The example is described in C #, which does not affect the analysis and conclusion. Original address Gallery of Processor Cache Effects
One, memory access and performance
The following piece of code, compared to loop 1, how fast do you think loop 2 can run?
intnewint[6410241024];// Loop 1for (int0; i < arr.Length; i++) 3;// Loop 2for (int016) 3;
It is clear that loop 2 has an iteration count of about 6% of the Loop 1 iteration count. However, the experimental results show that the time spent on the two for loop is:80 and Ms.
? What is the cause of this phenomenon? The run time of the two cycles is mainly the memory access of the array, not the integer multiplication. In short, IO is time-consuming.
In fact, loop 1 and Loop 2 perform the same memory access.
second, the impact of cache rows
For the following piece of code, we change the growth step for the for loop.
fori0iarr.i += K) arr[i]3;
The following is the run time of the loop when the different step values (K) are:
? As you can see, the run time of the For loop is almost unchanged when the step size is within the range of 1 to 16. But from 16 onwards, every time we increase the step size, the uptime is halved.
? The reason for this behavior is that the CPU is not a byte-to-bytes access memory. Instead, they will fetch a memory block chunk (typically 64bytes) each time they visit, called the cache line. So, when you read a specific memory address, the entire cache line starting from the memory address will be loaded into the cache. In this case, if the next time you want to access the data in the cache line, then there will be no additional access overhead (the cache hit, you do not have to go to memory to get)!
? Since 16 int occupies just 64 bytes (the cache line size), the for loop will process data between each 1~16 iteration exactly in the same cache line. However, once the step size becomes 32, the memory access times for the above cycle becomes arr. LENGTH/32, before it was arr. Length/16.
Understanding the cache line is a great help in optimizing the program. For example, the alignment of data will affect whether the data in one operation is in a different cache line. Based on the example above, we can see that the operation will be one-fold slower if not aligned.
Summary: each time the memory is read, it reads a cache line size, rather than a single byte.
Third, L1 & L2 Cache sizes
? Computers now have a level 2 or Level 3 cache, often referred to as L1, L2, and L3. If you want to know the size of different level caches, you can use the Coreinfo Sysinternals tool, or use the Getlogicalprocessorinfo Windows API call. Both of these methods get the cache and cache line size.
On my machine, the Coreinfo report has a data cache of up to 4 KB L1, a 32kb L1 instruction cache (instruction cache), and a data cache with a memory of a buffer of up to MB. Where the L1 cache is private to each core, the L2 cache is shared:
Logical Processor to Cache Map:*--- Data Cache 0, Level1, +KB, ASSOC8, linesize -*---InstructionCache 0, Level1, +KB, ASSOC8, linesize --*-- Data Cache 1, Level1, +KB, ASSOC8, linesize --*--InstructionCache 1, Level1, +KB, ASSOC8, linesize -**--UnifiedCache 0, Level2,4MB, ASSOC -, linesize ---*- Data Cache 2, Level1, +KB, ASSOC8, linesize ---*-InstructionCache 2, Level1, +KB, ASSOC8, linesize ----* Data Cache 3, Level1, +KB, ASSOC8, linesize ----*InstructionCache 3, Level1, +KB, ASSOC8, linesize ---**UnifiedCache 1, Level2,4MB, ASSOC -, linesize -
Let's experiment to verify the numbers. For the following piece of code, we start with subscript 0 and then access the array elements every 16 locations. When the current label grows to the last value, the loop returns to the starting point. By changing the size of the array, we find that the performance starts to drop when the size of the array exceeds the cache.
int6410241024// Arbitrary number of stepsint1;for (int0; i < steps; i++){ 16// (x & lengthMod) is equal to (x % arr.Length)}
The results are as follows:
As you can see, performance starts to drop after the array size reaches 32KB and 4MB. --32KB corresponds to the size of the L1 data cache, 4MB corresponds to the size of the L2 cache.
Summary: L1 and L2 capacity is limited, when L1 is "filled", and then want to fill in the L1 cache line, you have to consider evcit other already in the L1 of the cache line.
iv. Instruction-level parallelism
Now, let's look at a few different things. Which of the following two loops do you want to be faster?
int25610241024;intnewint[2];// Loop 1for (int i=0; i<steps; i++) { a[0]++; a[0]++; }// Loop 2for (int i=0; i<steps; i++) { a[0]++; a[1]++; }
The results show that LOOP2 is twice times faster than LOOP1, at least on the machine I'm using. Why is it? Let's take a look at its execution process:
In Loop1, the entire execution process is as follows:
In Loop2, the entire execution process is as follows:
?
Why is this?
With the development of modern processors, the computing power of CPU has been greatly improved, especially in the aspect of parallelism. Now a single CPU core, it can access two different memory locations of the L1 cache at the same time, or perform two simple arithmetic operations (directives). In loop 1, because access to the same memory location, the CPU cannot take advantage of instruction-level parallelism, but in the second loop, it can.
Comments:
Instruction-level parallelism means that a single CPU core can execute multiple instructions at the same time.
In general, if a neighboring set of instructions in a program is independent of each other, that is, not competing for the same feature, not waiting for each other's results , and not accessing the same storage unit , then they can execute in parallel within the processor. In a word, a non-interfering instruction can be executed in parallel.
Instruction-level parallel dependent CPU technology has pipeline, multi-directive emission, superscalar, disorderly execution and extra-long instruction word.
seconded: There are a lot of people on Reddit who are curious about compiler optimization issues. For {a[0] + +; a[0] + +;}
, whether it will be optimized to {[0] = 2;}
. In fact, the C # compiler and the CLR JIT do not do this optimization.
V, Cache associativity
Think about it: whether any of the chunk in the memory can be mapped to any one of the cache slots, or only a subset of them can be mapped. The memory chunk is the same size as the cache slot and is bytes (related to CPU architecture, most processors are currently 64bytes).
How does the cache slot and memory chunk establish a mapping relationship? The following three possible scenarios are available:
1. Direct Mapped Cache
Each memory chunk can only be mapped to a specific cache slot.
Specifically, each memory chunk corresponding cache slot is calculated by Chunk_index% Cache_slots. Obviously, this is a "many-to-one" relationship. The implicit problem is that the two memory chunk insinuate to the same cache slot cannot be loaded into the same cache at the same time (you have to come in and it will have to go out).
Each memory chunk has only one candidate location in the cache.
2. N-way set Associative cache
Each memory chunk can be mapped to any one of n specific cache slots (N-way cache, for short). For example, a 16-way cache indicates that each memory chunk can be mapped to 16 different cache slots.
Each memory chunk has multiple candidate locations in the cache.
3. Fully Associative cache
Each memory chunk can be mapped to any one of the cache slots. Cache management can be as efficient as hash table.
The Direct mapped cache scenario is prone to conflict-when multiple value competes for the same cache slot, it will continue to evict with each other as a result of the conflict, resulting in a drop in the cache hit ratio. The Fully associative cache scheme is complex and expensive to implement. The N-way set associative caches scenario is a typical cache management scheme that does trade off between simple implementations and hit ratios.
For example, the 4MB L2 cache on my machine is 16-way. All 64-byte memory chunk can be divided into multiple sets, and the memory chunk in the same set will compete for 16 candidate cache slots.
To make the Cache associativity more noticeable, I need to repeatedly access more than 16 elements in the same set. The following code:
Public Static Long Updateeverykthbyte(byte[] arr,intK) {Stopwatch SW = stopwatch.startnew ();Const intRep =1024x768*1024x768;// number of Iterations–arbitrary intp =0; for(inti =0; I < rep; i++) {arr[p]++; p + = K;if(P >= arr.) Length) p =0; } SW. Stop ();returnSw. Elapsedmilliseconds;}
As above code, follow the step k, and then access the array arr[]. Once you reach the end of the array, start from scratch. Until the loop iterates 2^20 times.
We call by passing in arrays of different sizes (in increments of 1MB) and different steps UpdateEveryKthByte()
. Is the result of the run.
Todo
VI. Cache line Fake Share
On multi-core CPUs, the cache also faces a problem--cache consistency.
Most machines are now Level 3 caches, where L1 and L2 are private to each core and L3 are shared. Here, for the convenience of explaining the problem, we assume that there is only level 2 cache, where L1 private, L2 shared.
For example, there is a variable in memory val==1
, in the case of multithreading, assume that there are 2 threads running on core 1 and Core 2 (assuming T1 and T2 respectively), and that each private L1 cache caches a copy of the variable separately.
If, at this point, the T1 thread modifies Val to 2, then it is clear that Val in T2 and memory will now become invalid.
There is a problem with cache inconsistency. This is done by synchronizing the Val value in the L1 cache of memory and Core 2. (How to resolve the cache consistency issue is not covered in this blog).
Let's take a look at the following code to see how much overhead the cache consistency will cause.
private Static int [] s_counter = new int [1024 ]; private void updatecounter (int position) {for (int j = 0 ; J < 100000000 ; J + +) {S_counter[position] = S_counter[position] + 3 ; }}
In an experiment on a 4-core machine, if you pass in parameter 0,1,2,3 in 4 different threads, then all 4 threads will spend 4.3s.
If the parameters passed in are 16,32,48,64, then all 4 threads will be executed with only 0.28s.
What is the reason? In the first example, 4 threads of data are likely to be in the same cache line, and each time the array element value is updated, it is possible to cause cache inconsistency (invalidation) for caches of the same data.
vii. diversity of hardware
Even though you know some of the features of the cache, there are still many differences between hardware. Different CPUs to adapt to different optimization methods.
For some processors, parallel access to the non-contiguous/continuous cache line in different cache banks and successive cache line in the same cache bank are possible.
The following code:
privatestaticint A, B, C, D, E, F, G;privatestaticvoidWeirdness(){ for (int0200000000; i++) { <something> }}
Fill in the <something>
following 3 statements, the results are as follows:
<something> TimeA++; B++; C++; D++; 719 MsA++; C++; E++; G++; 448 MsA++; C++; 518 Ms
What is the reason for the above situation? I'm curious, but I don't know. If anyone can make a clear analysis, I would be happy to study.
In short, the major vendors to provide different hardware, hardware has its own complexity.
Effect of the cache