Original sticker: https://coolshell.cn/articles/10249.html
CPU cache has always been an important point of understanding the computer architecture, but also the technical difficulties in concurrent programming design, and related reference materials like Shijiang, the vast stars, read the're treading, bland, few words difficult to get started. Just online someone recommended Microsoft Daniel Igor Ostrovsky a blog "roaming processor Cache effect", the article not only with the 7 simplest source code example will be the principle of CPU cache to explain, but also add a quantitative analysis of the chart to do mathematical proofs, The personal feeling this kind of case teaching cut-in way is definitely my dish, therefore cannot help the hasty translation, treats yours faithfully crossing.
Original address: Gallery of Processor Cache Effects
Most readers know that the cache is a fast, small memory that stores the most recently accessed memory location. This description is reasonable and accurate, but more understanding of the "annoying" details of some processor cache work can be helpful in understanding the performance of the program.
In this blog, I'll use code examples to explain all aspects of the cache's work and the impact on the real-world program.
The following examples are written in C #, but the choice of language has little effect on the running state of the program and the conclusions it draws.
Example 1: Memory Access and run
How fast do you think Loop 2 will run compared to loop 1?
1234567 |
int [] arr = new int [64 * 1024 * 1024]; // Loop 1 for ( int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for ( int i = 0; i < arr.Length; i += 16) arr[i] *= 3; |
The first loop takes each value of the array by 3, the second loop takes every 16 values by 3, the second loop does only about 6% of the work, but on modern machines, they run almost the same time: 80 milliseconds and 78 milliseconds, respectively, on my machine.
The reason two cycles spend the same amount of time is related to memory. The length of the loop execution is determined by the number of memory accesses to the array, not the number of multiplication operations of the integer. following the explanation of the second example, you will find that the hardware has the same number of main memory accesses for both loops.
Example 2: Effect of cache rows
Let's explore this example further. We will try different loop steps, not just 1 and 16.
1 |
for ( int i = 0; i < arr.Length; i += K) arr[i] *= 3; |
Run time for the loop under different step (K):
Note that when the step size is within the range of 1 to 16, the cycle run time is almost constant. However, starting from 16, each step is doubled and the running time is halved.
The reason behind this is that today's CPU is no longer accessing memory by Byte, but rather in 64-byte chunks (chunk), called a cache line. When you read a specific memory address, the entire cache line is swapped from main memory into the cache, and the overhead of accessing other values within the same cache row is minimal.
Because 16 integers take up 64 bytes (one cache line), the For loop step must touch the same number of cache rows between 1 and 16: That is, all the cache rows in the array. When the step is 32, we only have about every two cache lines touching once, when the step size is 64, only every four contacts once.
Understanding cache rows may be important for certain types of program optimizations. For example, data byte alignment may determine whether an operation touches one or 2 cache rows at a time. In the example above, it is clear that the operation of misaligned data will lose half the performance.
Example 3:L1 and L2 cache size
Today's computers have two-or three-level caches, often called L1, L2, and possible L3 (translator Note: If you don't understand what a two-level cache is, you can refer to this lean post lol). If you want to know the size of the different caches, you can use the system internal tools Coreinfo, or the Windows API calls Getlogicalprocessorinfo. Both will tell you the cache line and the size of the cache itself.
On my machine, coreinfo reality I have a 32KB L1 data cache, a 32KB L1 instruction cache, and a 4MB size L2 data cache. The L1 cache is exclusive to the processor, and the L2 cache is shared with the processor.
Logical Processor to Cache Map:
*-data Cache 0, Level 1, KB, Assoc 8, linesize 64
*-instruction Cache 0, Level 1, KB, Assoc 8, linesize 64
-*–data Cache 1, Level 1, KB, Assoc 8, linesize 64
-*–instruction Cache 1, Level 1, KB, Assoc 8, linesize 64
**–unified Cache 0, Level 2, 4 MB, ASSOC, Linesize 64
–*-Data Cache 2, Level 1, KB, Assoc 8, linesize 64
–*-instruction Cache 2, Level 1, KB, Assoc 8, linesize 64
-* Data Cache 3, Level 1, KB, Assoc 8, linesize 64
-* instruction Cache 3, Level 1, KB, Assoc 8, linesize 64
–** Unified Cache 1, Level 2, 4 MB, ASSOC, Linesize 64
(Translator Note: The author platform is quad-core machine, so L1 number is 0~3, data/instruction each, L2 only data cache, two processors share one, number 0~1. The correlation field is illustrated in the following example. )
Let's validate these numbers with an experiment. Iterate through an array of integers, each of which has a 16-value self-increment to change each cache row in a frugal manner. When you traverse to the last value, start over. We will use a different array size, and we can see that when the array overflows the first cache size, the performance of the program will slide sharply.
1234567 |
int steps = 64 * 1024 * 1024; // Arbitrary number of steps int lengthMod = arr.Length - 1; for ( int i = 0; i < steps; i++) { arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length) } |
Is the run time chart:
You can see a noticeable slide in performance after 32KB and 4MB-exactly the size of the L1 and L2 caches on my machine.
Example 4: Instruction-level concurrency
Now let's take a look at different things. Which of the following two loops do you think is faster?
12345678 |
int steps = 256 * 1024 * 1024; int [] a = new int [2]; // Loop 1 for ( int i=0; i<steps; i++) { a[0]++; a[0]++; } // Loop 2 for ( int i=0; i<steps; i++) { a[0]++; a[1]++; } |
The result is that the second loop is about one times faster than the first one, at least on the machine I'm testing. Why is it? This is related to the dependence of the operating instructions on the two loops.
In the first loop, the actions are interdependent (the translator notes: The next time dependent on the previous one):
In the second case, however, the dependencies are different:
Modern processor in the different parts of the instruction has a bit of concurrency (translator Note: With the pipeline, such as Pentium processor has u/v two lines, the following description). This allows the CPU to access L1 two memory locations at the same time, or perform two simple arithmetic operations. In the first loop, the processor was unable to exploit this level of concurrency, but in the second loop.
[Original UPDATE]: Many people ask questions about compiler optimizations on Reddit, like {a[0]++; a[0]++;} can be optimized to {a[0]+=2;}. In fact, the C # compiler and the CLR JIT are not optimized--in terms of array access. I compiled all the tests in release mode (using the optimization option), but I queried the JIT assembly language to verify that the optimizations did not affect the results.
Example 5: Cache affinity
A key decision of the cache design is to ensure that each main memory block (chunk) can be stored in any buffer slot, or just some of them (the translator note: A slot here is a cache line).
There are three ways to map a cache slot to a main memory block:
- Direct mapping (directly mapped cache)
Each block of memory can only be mapped to a specific cache slot. A simple scenario is to map a block index chunk_index to the corresponding slot (chunk_index% cache_slots). Two blocks of memory that are mapped to the same memory slot cannot be swapped into the cache at the same time. (Translator Note: Chunk_index can be computed by physical address/cache line bytes)
- N-Channel Group Association (N-way set associative cache)
Each block of memory can be mapped to any one of the N-Path specific cache slots. For example, a 16-way cache, each memory block can be mapped to 16 different cache slots. In general, a block of memory with the same low bit address will share a 16-way cache slot. (Translator note: The same low address indicates contiguous memory of a certain unit size)
- Fully associative (Fully associative cache)
Each block of memory can be mapped to any one of the cache slots. The action effect is equivalent to a hash list.
Direct mapping of caches can cause collisions--when multiple values compete for the same cache slot, they expel each other, causing the hit rate to plummet. On the other hand, fully associative caches are too complex and expensive to implement on hardware. N-Channel group affinity is a typical scheme of processor cache, which has a good tradeoff between simplification of circuit and high hit ratio.
(This figure is given by the translator, the direct mapping and the full association can be regarded as the two extremes of the N-Way Group Association, when the N=1, that is, direct mapping; When n takes the maximum value, it is fully correlated.) The reader can imagine mapping the legend directly, as described in the reference material. )
For example, the 4MB size of the L2 cache on my machine is a 16-way association. All 64-byte memory blocks will be split into different groups, and memory blocks mapped to the same set will compete for the 16-channel slots in the L2 cache.
L2 Cache has 65,536 cache lines (Translator note: 4MB/64), each group requires 16-way cache lines, we will get 4,096 sets. In this way, which group the block belongs to depends on the low 12 bit (2^12=4096) of the Block index. therefore, the corresponding physical address of the cache line is generally differentiated by a multiple of 262,144 bytes (4096*64), and the same cache slot will be competed. I maintain up to 16 such cache slots on my machine. (Translator Note: Please combine the 2-way Association extension understanding, a block index corresponding to 64 bytes, chunk0 corresponding to the group 0 any way slot, chunk1 corresponding to the group 1 of any one of the slots, and so chunk4095 corresponding to the group 4095 of any one of the slots, The lower 12bit of the Chunk0 and chunk4096 addresses is the same, so the chunk4096, chunk8192 will be the same as the slots in the Chunk0 competition Group 0, the address between them is 262,144 bytes apart, and up to 16 competition, Otherwise, a chunk will be expelled).
To make the cache association more clear, I need to repeatedly access more than 16 elements in the same group, and prove it in the following way:
1234567891011121314 |
public static long UpdateEveryKthByte(byte[] arr,
int K)
{
Stopwatch sw = Stopwatch.StartNew();
const int rep = 1024*1024;
// Number of iterations – arbitrary
int p = 0;
for (
int i = 0; i < rep; i++)
{
arr[p]++;
p += K;
if (p >= arr.Length) p = 0;
}
sw.Stop();
return sw.ElapsedMilliseconds;
}
|
The method iterates through the array of k values each time, starting from the beginning when the end is reached. The loop stops after it runs long enough (2^20 times).
I use different array sizes (1MB increments each time) and different steps to pass in Updateeverykthbyte (). The following is a chart that is drawn, blue for a long time, and white for a shorter time:
The blue area (longer time) indicates that when we repeat an array iteration, the updated value cannot be placed in the cache at the same time. The light blue area corresponds to 80 milliseconds, and the white area corresponds to 10 milliseconds.
Let's explain the blue part of a table:
1. Why is there a vertical line? a vertical line indicates that the step value is too large to touch the memory location in the same group (greater than 16 times). In these times, my machine cannot simultaneously put the values that have been contacted into the 16-way association cache.
Some bad step values are powers of 2:256 and 512. For example, consider a 512-step traversal of a 8MB array, there are 32 meta-pixels spaced 262,144 bytes apart, all 32 elements are updated in the loop traversal, because 512 is divisible by 262,144 (The Translator notes: A step here represents a byte).
Since 32 is greater than 16, these 32 elements will always compete for the 16-channel slots in the cache.
(Translator Note: Why is the 512-step vertical line darker than the 256 step color?) Under the same number of steps, 512:256 accesses the number of block indexes that are competing more than once. For example, spanning 262,144-byte boundary 512 requires 512 steps, and 256 requires 1024 steps. So when the number of steps is 2^20, 512 accesses 2048 competing blocks and 256 has 1024 times. The worst-case step is a multiple of 262,144 because each loop throws a cache row eviction. )
Some of the steps that are not a power of 2 run longer are just bad luck and end up accessing a disproportionate number of elements in the same group, which are also shown as blue lines.
2. Why does the vertical line stop where the 4MB array length is? because for arrays less than or equal to 4MB, the 16-way association cache is equivalent to a fully associative cache.
A 16-way association cache can maintain up to 16 cache rows separated by 262,144 bytes, and the group 17 or more cache rows within 4MB are not aligned on a 262,144-byte boundary because of 16*262,144=4,194,304.
3. Why is there a blue triangle in the upper left corner? in the triangular region, we cannot store all the necessary data in the cache at the same time, not out of association, but only because of the size of the L2 cache.
For example, consider the step 128-time calendar 16MB array, which updates every 128 bytes in the array, which means we touch two 64-byte blocks of memory at a time. In order to store every two cache rows in a 16MB array, we need a 8MB size cache. But I only have 4MB cache in my machine (translator Note: This means there must be conflict and thus delay).
Even though the 4MB cache is fully associative in my machine, it is still not possible to store 8MB data at the same time.
4. Why is the leftmost part of the triangle faded? Note the left 0~64 byte section--Just a cache line! As shown in examples 1 and 2 above, the additional access to the same cache rows has little overhead. For example, with a step of 16 bytes, it takes 4 steps to reach the next cache line, which means that 4 memory accesses are only 1 overhead.
In all test cases with the same number of cycles, the time taken to take a low effort step is short.
The model after which the chart is extended:
Cache affinity is interesting and can be proven, but it certainly won't be the first thing you need to consider when you're programming, compared to the other issues discussed in this article.
Example 6: pseudo-sharing of cached rows (false-sharing)
On multi-core machines, the cache encounters another problem-consistency. Different processors have a fully or partially separated cache. On my machine, the L1 cache is separate (which is common), and I have two pairs of processors, each pair sharing a L2 cache. This differs depending on the situation, if a modern multi-core machine has a multilevel cache, then the fast, small cache will be exclusive to the processor.
When a processor changes a value that belongs to its own cache, the other processor will no longer be able to use its own original value because its corresponding memory location will be flushed (invalidate) to all caches. And because cache operations are granular in cache rows instead of bytes, the entire cache line in all caches is refreshed!
To prove the problem, consider the following example:
12345678 |
private static int
[] s_counter =
new int
[1024];
private void UpdateCounter(
int position)
{
for (
int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
|
On my four-core machine, if I pass four threads into the parameter 0,1,2,3 and call Updatecounter, all threads will take 4.3 seconds.
On the other hand, if I pass in 16,32,48,64, the whole operation takes 0.28 seconds!
Why is this so? The four values in the first example are likely to be in the same cache line, each time one processor increments the count, the cache rows where the four counts are flushed, and the other processors access their respective counts the next time (translator Note: Note that the array is a private attribute, each thread exclusive) will lose hit (Miss) A cache. This multithreading behavior effectively disables the caching function and weakens the program performance.
Example 7: Hardware complexity
Even if you understand the basics of caching, there are times when hardware behavior can surprise you. There are different optimizations, heuristic and subtle details when working without a processor.
On some processors, the L1 cache is capable of concurrently processing two accesses, and if access is from a different storage body, access to the same storage body can only be processed serially. And the processor's clever optimization strategy will surprise you, for example, in the case of pseudo-sharing, the previous performance on some machines without fine-tuning was not good, but my home machine was able to optimize the simplest example to reduce cache refreshes.
Here is a strange example of a "hardware Quirk":
12345678 |
private static Code class= "CPP Color1 bold" >int a, B, C, D, E, F, G; private static void weirdness () { &NBSP;&NBSP;&NBSP;&NBSP; for ( int i = 0; i < 200000000; i++) &NBSP;&NBSP;&NBSP;&NBSP; { &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP; //do something ... &NBSP;&NBSP;&NBSP;&NBSP; } } |
When I do three different operations in the loop, I get the following uptime:
Operation Time
a++; b++; C + +; d++; 719 ms
a++; C + +; e++; g++; 448 ms
a++; C + +; 518 ms
Increasing the A,b,c,d field takes longer than adding the A,c,e,g field, and, more surprisingly, adding a,c two fields is longer than adding the a,c,e,g execution!
I'm not sure about the reasons behind these numbers, but I suspect it's about the storage, and if anyone can explain the numbers, I'm all ears.
The lesson of this example is that you can hardly fully predict the behavior of the hardware. You can predict a lot of things, but in the end, it's important to measure and validate your assumptions.
A reply to a 7th example
Goz: I asked the Intel engineer for the final example and got the following reply:
"It is clear that this involves the termination of instructions in the execution unit, the speed at which the machine handles storage-hit-load, and how to quickly and gracefully handle the cyclic unfolding of exploratory execution (for example, if there are multiple loops due to internal conflicts). But that means you need a very detailed line tracker and simulator to figure it out. It is extremely difficult to predict the chaotic order in the assembly line on paper, even if it is the person who designs the chip. For the layman, no way, sorry! ”
P.S. Personal sentiment--the principle of locality and pipeline concurrency
The operation of a program is Local in time and space , which means that as long as the value in memory is swapped into the cache, it will be referenced several times in the future, which means that the value near the memory is also swapped into the cache. If you pay particular attention to the use of local principles in programming, you will get a performance return.
For example , the C language should minimize the static variable reference, because the static variable is stored in the global data segment, in a repeatedly called function body, referencing the variable needs to swap the cache multiple times, and if it is allocated on the stack of local variables, Every time the function calls the CPU, it can be found only from the cache because the stack has a high repetition rate.
For example, the code in the loop body should be as concise as possible, because the code is placed in the instruction cache, and the instruction cache is a first-level cache, only a few K-byte size, if a piece of code needs to read multiple times, and this code across a L1 cache size, then the cache advantage will be lost.
About CPU pipelining (pipeline) concurrency Simply put, the Intel Pentium processor has two pipeline U and V, each pipeline can read and write the cache independently, so you can execute two instructions simultaneously in one clock cycle. However, these two lines are not equivalent, you can process all the instruction set, V pipelining can only handle simple instructions.
CPU instructions are usually divided into four categories, the first class is commonly used simple instructions, such as MOV, NOP, push, pop, add, sub, and, OR, XOR, Inc, DEC, CMP, Lea, can be executed in any one pipeline, as long as there is no dependency between each other, can do instruction concurrency.
The second type of instruction needs to be combined with other pipelining, such as some carry and shift operations, if in the U pipeline, then other instructions can be run concurrently in the V pipeline, if in the V pipeline, then the U pipeline is paused.
The third type of instruction is a number of jump directives, such as cmp,call and conditional branching, which, contrary to the second class, can only be used when working in the V pipeline to work with U pipelining, otherwise it can only monopolize the CPU.
Class fourth directives are other complex directives that are generally not commonly used because they all only monopolize the CPU.
If you are programming at the assembly level, to achieve the instruction level concurrency, you must focus on pairing between instructions. try to use the first class of directives, avoid class fourth, and reduce context dependencies in order.
Resources
CPU cache parsing on the wiki (Chinese version).
A teaching demonstration program of the cache mapping function and hit rate calculation is made by the teachers and students of Shanghai Jiaotong University, which simulates the cache mapping and hit probability in different correlation modes, and the image is intuitive.
NetEase database Daniel @ What _ Deng into homemade ppt "CPU Cache and Memory ordering", the information is super large!
Computer teaching in Nanjing University public PPT, warm hint, address domain name inside change field "lecture" After the number can be switched courses;-)
(End of full text)
7 Examples of PST CPU CACHE (ZZ)