We have 7 examples to learn about CPU caching (cache)

Source: Internet
Author: User
Tags intel pentium

CPU cache has always been an important knowledge of computer architecture, but also the technical difficulties in concurrent programming design, and related reference materials like Shijiang, the vast stars, read the're walking, bland, words difficult to get started. Just on the Internet, someone recommended Microsoft Daniel Igor Ostrovsky a blog "roaming processor cache effect," the article not only with 7 of the simplest source code examples of the principle of the CPU cache, but also attached to the quantitative analysis of the chart to do mathematical evidence, Personal feeling this kind of case teaching approach is definitely my dish, so I can't help but hastily translated to treat you reader.

Original address: Gallery of Processor Cache effects

Most readers know that cache is a fast, small memory that stores the most recent memory location. This description is reasonable and accurate, but knowing more about the "annoying" details of some processor caching work can be a great help in understanding the performance of a program.

In this blog, I'll use code examples to explain the various aspects of cache work and the impact on the running of programs in the real world.

The following examples are written in C #, but the choice of language has little effect on the state of the program and the conclusions drawn.

Example 1: Memory Access and running

How fast do you think cycle 2 will run compared to loop 1?

int[] arr = new INT[64 * 1024 * 1024];

Loop 1
for (int i = 0; i < arr. Length; i++) Arr[i] *= 3;

Loop 2
for (int i = 0; i < arr. Length; i + +) Arr[i] *= 3;

The first loop takes each value of an array by 3, the second loops by 3 for every 16, and the second loop only does the first 6%, but on modern machines, the two run almost the same time: on my machine are 80 milliseconds and 78 milliseconds respectively.

The reason two cycles spend the same amount of time is related to memory. The length of the loop execution is determined by the number of memory accesses to the array, not by the number of multiplication operations for the integer. After the explanation of the second example, you will find that the hardware has the same number of main memory accesses to the two loops.

Example 2: Effects of cache rows

Let's explore this example further. We'll try different loop lengths, not just 1 and 16.

1for (int i = 0; i < arr. Length; i = = K) arr[i] *= 3;

The following figure is the elapsed time for the loop under different step lengths (K):

Note that the cycle time is almost unchanged when the step is in the range of 1 to 16. But starting from 16, each step is doubled and the running time is halved.

The reason behind this is that today's CPU is no longer byte-access memory, but a 64-byte block (chunk) fetch, called a cache line. When you read a specific memory address, the entire cache line is swapped from main memory into the cache, and the overhead of accessing other values within the same cache row is very small.

Because 16 integers occupy 64 bytes (a cache row), the For loop step is bound to reach the same number of cache rows between 1 and 16: All cached rows in the array. When the step is 32, we only have about two cache rows per contact once, when the step size is 64, only every four contact once.

Understanding cache rows may be important for some types of program optimizations. For example, data byte alignment may determine whether a single operation touches one or 2 cache rows. In the example above, it is clear that the misaligned data will lose half the performance.

Sample 3:L1 and L2 cache size

Today's computers have two-or three-level caches, often called L1, L2, and possibly L3: If you don't understand what a level two cache is, you can refer to this vigorous blog lol. If you want to know the size of different caches, you can use the System internal tool coreinfo, or the Windows API call Getlogicalprocessorinfo. Both will tell you the size of the cache line and the cache itself.

On my machine, coreinfo reality I have a 32KB L1 data cache, a 32KB L1 instruction cache, and a 4MB size L2 data cache. The L1 cache is exclusive to the processor, and the L2 cache is shared with the processor.

Logical Processor to Cache Map:
*-data Cache 0, Level 1, MB, ASSOC 8, linesize 64
*-instruction Cache 0, Level 1, MB, ASSOC 8, linesize 64
-*–data Cache 1, Level 1, MB, ASSOC 8, linesize 64
-*–instruction Cache 1, Level 1, MB, ASSOC 8, linesize 64
**–unified Cache 0, Level 2, 4 MB, ASSOC, Linesize 64
–*-Data Cache 2, Level 1, MB, ASSOC 8, linesize 64
–*-instruction Cache 2, Level 1, MB, ASSOC 8, linesize 64
-* Data Cache 3, Level 1, MB, ASSOC 8, linesize 64
-* instruction Cache 3, Level 1, MB, ASSOC 8, linesize 64
–** Unified Cache 1, Level 2, 4 MB, ASSOC, Linesize 64

(Translator Note: The author platform is a quad-core machine, so the L1 number is 0~3, data/instruction each one, L2 only data cache, two processors share one, number 0~1. The correlation field is illustrated in the following example. )

Let's use an experiment to validate these numbers. Iterate through an array of integers, each 16-value self 1--a way to change each cache row in a frugal manner. When you traverse to the last value, start all over again. We will use different array sizes to see that when the array overflows the first cache size, the performance of the program runs dramatically.

int steps = 64 * 1024 * 1024;
Arbitrary number of steps
int lengthmod = arr. Length-1;
for (int i = 0; i < steps; i++)
{
arr[(i *) & lengthmod]++; (X & Lengthmod) is equal to (x% arr.) Length)
}

The following figure is the running time chart:

You can see a noticeable drop in performance after 32KB and 4MB-exactly the size of the L1 and L2 cache on my machine.

Example 4: Instruction-level concurrency

Now let's take a look at the different things. Which of the following two loops do you think is faster?

int steps = 256 * 1024 * 1024;
Int[] A = new int[2];

Loop 1
for (int i=0; i<steps; i++) {a[0]++; a[0]++;}

Loop 2
for (int i=0; i<steps; i++) {a[0]++; a[1]++;}

The result is that the second cycle is about one times faster than the first, at least on the machine I'm testing. Why not? This is related to the operation instruction dependencies in the two loop bodies.

In the first cycle, operations are interdependent (the next time they depend on the previous one):

But in the second example, the dependencies are different:

The modern processor has a bit of concurrency for the different parts of the instruction (translator Note: With the assembly line, such as the Pentium processor there are u/v two lines, described later). This allows the CPU to access the L1 two memory locations at the same time, or perform two simple arithmetic operations. In the first loop, the processor is unable to discover this level of concurrency for this instruction, but in the second loop.

[Original UPDATE]: Many people ask questions about compiler optimizations on Reddit, like {a[0]++; a[0]++;} can be optimized to {a[0]+=2;}. In fact, the C # compiler and CLR JIT do not optimize--in terms of array access. I compiled all the tests using release mode (using the optimization option), but I queried JIT assembly language to verify that the optimizations did not affect the results.

Example 5: Cache affinity

A key decision of the cache design is to ensure that each main memory block (chunk) is stored in any one buffer slot, or just some of them (translator Note: A slot here is a cache line).

There are three ways to map a cache slot to a main memory block:

Direct mapping (directly mapped cache)

Each block of memory can only be mapped to a specific cache slot. A simple solution is to map the Chunk_index to the corresponding slot (chunk_index% cache_slots) through the block index. Two blocks of memory that are mapped to the same memory slot cannot be swapped into the cache at the same time. (Translator Note: Chunk_index can be computed via physical address/cache line byte)

N-Way Group Association (N-way set associative cache)

Each memory block can be mapped to any one of the N-channel specific buffer slots. For example, a 16-way cache, each memory block can be mapped to 16 different buffer slots. Generally, memory blocks with the same low bit address will share 16-channel buffer slots. (Translator note: The same low address indicates a contiguous memory separated by a certain unit size)

Full Association (fully associative cache)

Each memory block can be mapped to any one of the buffer slots. The action effect is equivalent to a hash table.

Direct mapping caching raises conflicts--when multiple values compete for the same cache slot, they expel each other, causing the hit rate to plummet. On the other hand, full associative caching is too complex and expensive for hardware implementations. N-Way Group Association is a typical scheme of processor caching, which achieves a good compromise between the simplification of the circuit and the high hit ratio.

(This figure is given by the translator, direct mapping and full association can be regarded as the two extremes of N-Way Group Association, which is known as direct mapping when n=1, when n takes the maximum value, which is fully correlated.) The reader can imagine the direct mapping of the legend, specific expression see resources. )

For example, a 4MB L2 cache on my machine is a 16-way association. All 64-byte blocks of memory will be split into different groups, and the memory blocks mapped to the same group will compete for the 16-channel slots in the L2 cache.

The L2 cache has 65,536 cache lines (translator: 4MB/64), each group needs 16-way cache rows, we will get 4,096 sets. Thus, which group the block belongs to depends on the lower 12 bit bit (2^12=4096) of the Block index. Therefore, the corresponding physical address of the cache line is usually differentiated by multiples of 262,144 bytes (4096*64), which will compete for the same buffer slot. I have a maximum of 16 such cache slots on my machine. (Translator Note: Please combine the 2-way Association of the above figure to understand, one block index corresponds to 64 bytes, chunk0 corresponding group 0 of any one of the slots, chunk1 corresponding group 1 Any one way slot, and so on chunk4095 corresponding group 4095 in any one way slot, The lower 12bit of the chunk0 and chunk4096 addresses are the same, so chunk4096, chunk8192 will be in the slot in the Chunk0 competition Group 0, the address between them is a multiple of 262,144 bytes, and up to 16 times the competition, Otherwise, a chunk will be expelled.

To make the caching association effect clearer, I need to repeatedly access more than 16 elements in the same group, and prove it through the following methods:

public static long Updateeverykthbyte (byte[] arr, int K)
{
stopwatch SW = Stopwatch.startnew ();
const INT rep = 1024*1024; Number of Iterations–arbitrary
int p = 0;
for (int i = 0; i < rep; i++)
{
arr[p]++;
p = K;
if (P >= arr). Length) p = 0;
}
Sw. Stop ();
return SW. Elapsedmilliseconds;
}

The method iterates over K values in the array each time and starts from scratch when the end is reached. The loop stops after running long enough (2^20 times).

I use different array sizes (1MB per increment) and different step lengths to pass in Updateeverykthbyte (). Here is the chart drawn, the blue represents a long time, white represents a shorter time:

The blue area (longer) indicates that when we repeat an array iteration, the updated value cannot be placed in the cache at the same time. The light blue area corresponds to 80 milliseconds, and the white area corresponds to 10 milliseconds.

Let's explain the blue part of the chart:

1. Why are there vertical lines? A vertical line indicates that the step value is too much exposed to the memory location in the same group (greater than 16 times). In these times, my machine was unable to put the contact value into the 16-way associative cache at the same time.

Some bad step values are 2 power: 256 and 512. For example, consider 512 steps to traverse the 8MB array, there are 32 of dollars in the 262,144-byte space distribution, all 32 elements will be updated in the loop traversal, because 512 can divide 262,144 (translator Note: A step here represents a byte).

Because 32 is greater than 16, these 32 elements will always compete for the 16 channel slots in the cache.

(Translator Note: Why is the 512 step of the vertical line more than 256 step color deeper?) in the same number of steps, 512:256 accesses more than one times the number of contested block indexes. For example, crossing 262,144-byte boundary 512 requires 512 steps, and 256 requires 1024 steps. So when the number of steps is 2^20, 512 accesses 2048 competing blocks and 256 only 1024 times. In the worst-case scenario, the step is a multiple of 262,144 because each loop throws a cache row expulsion. )

Some of the step lengths that are not 2 power are just bad luck and end up accessing a disproportionate number of elements in the same group, which also appear as blue lines.

2. Why does the vertical line stop at the length of the 4MB array? Because for an array less than 4MB, the 16-channel association cache is equivalent to a full associative cache.

A 16-channel association cache can maintain up to 16 byte-delimited cache rows, and 17 or more cache rows in 4MB are not aligned on 262,144-byte boundaries because of 16*262,144=4,194,304.

3. Why is there a blue triangle in the upper left corner? Within the Triangle area, we cannot store all the necessary data in the cache, not out of association, but only because of the L2 cache size.

For example, consider the step 128 calendar 16MB array, every 128 bytes in the array update once, which means that we touch two 64 bytes of memory block at a time. To store every two cache rows in a 16MB array, we need a 8MB size cache. But my machine only has 4MB cache (Translator Note: This means there must be conflict and thus delay).

Even if the 4MB cache in my machine is fully associative, it is still not possible to store 8MB data at the same time.

4. Why is the leftmost part of the triangle fading? Note the left 0~64 byte portion--Just a cache line! As shown in examples 1 and 2 above, there is little overhead in accessing the same cached rows with additional data. For example, the step is 16 bytes and it takes 4 steps to reach the next cache line, which means that 4 memory accesses have only 1 overhead.

In all test cases under the same cycle count, the run time to take the effort step is shorter.

The model to extend the chart:

Caching affinity is fun to understand and can be proven, but it certainly won't be the first thing you need to think about when it comes to the other issues discussed in this article.

Example 6: pseudo-sharing of cached rows (false-sharing)

On multi-core machines, the cache encountered another problem-consistency. Different processors have a full or partially separated cache. On my machine, the L1 cache is separate (this is common), and I have two pairs of processors, each sharing a L2 cache. This varies with the circumstances, and if a modern multicore machine has a multilevel cache, a fast, compact cache will be exclusive to the processor.

When a processor changes a value in its own cache, the other processor can no longer use its own original value because its corresponding memory location will be refreshed (invalidate) to all caches. And because the cache operation is granular with cache rows rather than bytes, the entire cache row in all caches is refreshed!

To prove this problem, consider the following example:

private static int[] S_counter = new int[1024];
private void Updatecounter (int position)
{
for (int j = 0; J < 100000000; J + +)
{
S_counter[position] = s_counter[position] + 3;
}
}

On my four nuclear machine, if I pass the parameter 0,1,2,3 through four threads and call Updatecounter, all threads will take 4.3 seconds.

On the other hand, if I pass in the 16,32,48,64, the whole operation takes 0.28 seconds!

Why is that? The four values in the first example are likely to be in the same cache line, with one processor increasing at a time count, the cache rows where the four counts are being refreshed, and the other processors have their respective counts on the next access (note: The array is private, each thread exclusive) loses a hit ( Miss) a cache. This multi-threaded behavior effectively disables the caching function and weakens the program performance.

Example 7: Hardware complexity

Even if you understand the basics of caching, sometimes the hardware behavior can still surprise you. No processors have different optimizations, heuristic, and subtle details at work.

On some processors, the L1 cache can concurrently process two accesses, and access to the same storage body can only be serially processed if the access is from a different storage body. And the processor Smart optimization strategy also surprises you, such as in the case of pseudo sharing, previously on some not tuned machine performance is not good, but my home machine can optimize the simplest example to reduce cache flushing.

Here's a strange example of a "Hardware Freak":

private static int A, B, C, D, E, F, G;
private static void weirdness ()
{
for (int i = 0; i < 200000000; i++)
{
Do something ...
}
}

When I do three different kinds of operations in the loop, I get the following run time:

Operation Time
a++; b++; C + +;     d++; 719 ms
a++; C + +; e++;     g++; 448 ms
a++;                      C + +; 518 ms

Adding a,b,c,d Fields takes longer than adding a,c,e,g fields, and, more surprisingly, adding a,c two fields longer than increasing a,c,e,g execution!

I'm not sure what's behind these numbers, but I suspect it has to do with the storage, and if anyone can explain the numbers, I'll be all ears.

The lesson of this example is that you can hardly predict the behavior of the hardware completely. You can predict a lot of things, but in the end, it's important to measure and validate your assumptions.

A reply to the 7th example

Goz: I asked Intel's engineers for the final example and got the following answer:

"Obviously this involves how the instructions in the execution unit terminate, how fast the machine handles the storage-hit-load, and how to handle the loop expansion of the exploratory execution quickly and gracefully (for example, whether it loops multiple times due to internal conflicts)." But that means you need a very meticulous assembly line tracker and a simulator to figure it out. It is extremely difficult to predict the order in the assembly line on paper, even for those who design the chip. For the layman, no way, sorry! "

P.S. Personal perception--the principle of locality and pipelining concurrency

The operation of the program exists in time and space, which means that as long as the value in memory is swapped into the cache, it will be referenced several times in the future, which means that the value near the memory is also swapped into the cache. If you pay special attention to the use of local principles in programming, you will get performance rewards.

For example, the C language should minimize the reference to static variables, this is because the static variable is stored in the global data segment, in a function that is called repeatedly, the variable needs to be swapped out for the cache multiple times, and if the local variable is allocated on the stack, the function can be found from the cache every time the CPU is invoked. Because the stack has a high reuse rate.

Again, such as the circulation of the Code to be as concise as possible, because the code is placed in the instruction cache, and the instruction cache is a first-level cache, only a few K-byte size, if you need to read a piece of code multiple times, and this code across a L1 cache size, then the cache advantage

About the CPU pipelining (pipeline) concurrency simply stated that the Intel Pentium processor has two lines U and V, each of which can read and write independently of the cache, so that two instructions can be executed simultaneously within one clock cycle. But these two lines are not equivalent, U-line can handle all instruction sets, V-line can only handle simple instructions.

CPU instructions are typically grouped into four categories, the first category is commonly used simple instructions, such as MOV, NOP, push, pop, add, sub, and, OR, XOR, Inc, DEC, CMP, Lea, can be executed on any of the pipelining, as long as there is no dependency between each other, can be done completely instruction concurrency.

The second type of instruction needs to be in line with other lines, like some carry and shift operations, such instructions if in the U-line, then other instructions can be run in the V pipeline, if in the V-line, then the U-line is suspended.

The third type of instruction is a number of jump instructions, such as cmp,call and conditional branches, they are the opposite of the second class, when working in the V-line can be used in the U-line collaboration, otherwise only exclusive CPU.

Class fourth directives are other complex directives that are not commonly used because they can only monopolize the CPU.

If assembly-level programming is required to achieve command-level concurrency, it is important to focus on pairing between instructions. Try to use the first class of instructions, avoid the fourth class, and reduce the contextual dependencies in order.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.