About CPU Cache--The program apes need to know.

Source: Internet
Author: User
Tags unique id

/

This article will describe some of the CPU cache related knowledge that a program ape or it practitioner should know

Article welcome reprint, but reproduced when please retain this paragraph text, and placed on the top of the article Lu Junyi (Cenalulu) This article address: http://cenalulu.github.io/linux/all-about-cpu-cache/

Let's take a look at a mind map of all the concepts in this article.

Why do I need a CPU Cache

As the process has increased in recent decades, CPU frequency has been increasing, and is subject to manufacturing processes and cost constraints, the current memory of the computer is mainly DRAM and there is no qualitative breakthrough in access speed. As a result, the CPU's processing speed and memory access speed gap is increasing, even tens of thousands of times. In this case, the traditional CPU directly connected to the memory through the FSB will obviously be due to memory access waiting, resulting in a large number of idle computing resources, reduce the overall CPU throughput. At the same time, due to the focus of memory data access hotspot, between the CPU and memory with a faster and more expensive SDRAM to do a layer of cache, it appears cost-effective very high.

Why should I have a multi-level CPU Cache

With the development of science and Technology, the volume of hot-spot data is getting bigger, the simple increase of the first-level cache size is very low price. As a result, there is a level two cache (L2 cache) that adds a layer of access speed and cost between the first level cache (L1 cache) and the memory. Here is an explanation from the excerpt from what every Programmer should Know in memory:

Soon after the introduction of the cache, the system got more complicated. The speed difference between the cache and the main memory increased again, to a point that another level of cache is add Ed, bigger and slower than the First-level cache. Only increasing the size of the First-level cache is not a option for economical rea-sons.

In addition, the L1 cache is divided into l1i (i for instruction) and l1d (d for data) two dedicated caches because of the difference in behavior and hotspot distribution of program instructions and program data. The following diagram shows the response time gap between cache levels and how slow the memory is!

What is cache line

Cache line can be simply understood as the minimum cache unit in the CPU cache. The current cache line size for the main CPU cache is 64Bytes. Assuming that we have a 512-byte cache, then the cache size of the 64B is counted as the number of caches that can be stored in this first level cache 512/64 = 8 . For details, see:

To get a better understanding of the cache line, we can also do the following interesting experiment on our own computer.

The following C code, which receives a parameter from the command line, creates an int array of n as the size of the array. The array content is accessed from this array sequentially, looping 1 billion times. The total size of the final output array and the corresponding total execution time.

#include "stdio.h" #include <stdlib.h> #include <sys/time.h>LongTimediff(clock_tT1,clock_tT2){LongElapsed;Elapsed=((Double)T2-T1)/Clocks_per_sec*1000;ReturnElapsed;}IntMain(Intargc,Char*Argv[])#*******{IntArray_size=Atoi(Argv[1]);IntRepeat_times=1000000000;LongArray[Array_size];For(IntI=0;I<Array_size;I++){Array[I]=0;}IntJ=0;IntK=0;IntC=0;clock_tStart=Clock();While(J++<Repeat_times){If( k==array_size) {k =0} c = array[k< Span class= "o" >++} clock_t end =clock  (); printf ( "%lu\n" timediff (start endreturn 0;}        

If we make a line chart of this data, we will find that the total execution time has a significant inflection point when the array size exceeds 64Bytes (of course, because bloggers are tested on their own Mac notebooks, they can be disturbed by many other programs and therefore fluctuate). The reason is that when an array is less than 64Bytes, the array is most likely to fall within a single cache line, and an element's access causes the entire cache line to be populated, which is worth the benefit of several subsequent elements from cache acceleration. When the array is larger than 64Bytes, it is necessary to require at least two cache line, and then in the loop access will be two times the cache lines fill, because the buffer population time is much higher than the response time of data access, so the impact of a multiple cache fills on the total execution will be magnified, The result: If the reader is interested, it can be compiled on your own Linux or Mac gcc cache_line_size.c -o cache_line_size and ./cache_line_size executed.

How does the concept of the cache line help our program ape? Let's take a look at the following example of a commonly used loop optimization in C language The following two paragraphs of code, the first code in the C language is always faster than the second piece of code. The specific reason to believe that you carefully read the introduction of the cache line is easy to understand.

For(IntI=0i < ni++) {for (int j = 0j < nj++) {int num< Span class= "P"; //code arr[i][ J] = num}}             /span>                
For(IntI=0i < ni++) {for (int j = 0j < nj++) {int num< Span class= "P"; //code arr[j][ I] = num}}             /span>                
How does the CPU cache store the data and how would you design the cache's storage rules?

Let's try to answer that question first:

Suppose we have a 4MB area for caching, and the unique identifier for each cached object is the physical memory address where it resides. The size of each cache object is 64Bytes, and the sum of the size of all the objects that can be cached (that is, the total size of physical memory) is 4GB. So how do we design this cache?

If you are like a blogger who is a college student without a good foundation/digital circuit, one of the most reliable ways to do this is: the hash table. Design the cache as a hash array. The hash value of the memory address as the index of the array, the value of the cached object as the array. Each time access, the address is a hash and then find the corresponding location in the cache operation. Such a design is common in high-level languages and is obviously very efficient. Because hash is worth the time (10,000 CPU cycles or so), it is negligible compared to other operations in the program (millions of CPU cycles). For the CPU cache, it was originally designed to capture data within dozens of CPU cycles. If the access efficiency is the million cycle this level, it is better to get the data directly to the memory. Of course, the more important reason is that the ability to implement memory Address hash on hardware is very expensive.

Why the cache can't be made into fully associative

Fully Associative literally means full association. In the CPU cache the meaning is: if in a cache set, any memory address of the data can be slowed down in any one cache line, then we become this cache is fully associative. From the definition we can conclude that given a memory address, to know if he exists in the cache, you need to traverse all cache line and compare the memory address of the cached content. The cache is meant to fetch data in as little CPU cycles as possible. It is almost impossible to design a fast fully associative cache.

Why the cache cannot be made direct Mapped

In contrast to the fully associative, a cache line is uniquely identified using the direct mapped mode of the caches given a memory address. Low design complexity and fast speed. So why doesn't the cache use this mode? Let's imagine a situation where a 32-bit CPU with a 1M L2 cache has a size of 64Bytes per cache line. Then the whole l2cache was zoned for 1M/64=16384 cache line. We have numbered each cache line starting with 0. At the same time, the memory address range that the 32-bit CPU can manage is 2^32=4G , in direct mapped mode, the memory is also zoned as 4G/16384=256K a small part. This means that each 256K memory address shares a cache line. However, the usage of each cache line in this mode, if it is close to 100%, requires the operating system to allocate and access the memory at an almost average address. Contrary to our wishes, in order to reduce the memory fragmentation and ease of implementation, the operating system is more continuous centralized use of memory. That's what happens when the low-numbered cache line, number 0-1000, is often allocated and used, and the cache line above number 16,000th is almost always idle due to the fact that the memory is rarely accessed by the process. In this case, it would have been a valuable 1M level two CPU cache, and the utilization rate might not have been reached by 50%.

What is N-way Set associative

To avoid the drawbacks of these two design patterns, the N-way Set associative cache appears. His principle is to put a cache in accordance with the N-cache line as a set (set), the cache is divided by the group. Such a 64-bit system memory address in the 4MB two cache is divided into three parts (see), the low 6 bit indicates the offset in the cache line, the middle 12bit represents the cache group number (set index), the remaining high 46bit is the memory address of the unique ID. This design has the following two benefits compared to the first two designs:

    • Given a memory address that uniquely corresponds to a set, it is possible to determine whether the object is in the cache (the number of comparisons in the full associative increases linearly with memory size) by traversing only 16 elements in the set
    • 2^18(256K)*16(way) 4M continuous hotspot data per = will result in a set of conflict (512K of continuous hotspot data in Direct mapped will appear conflict)

Why is the set segment of the N-way set associative starting from low rather than high?

Here is an explanation of the excerpt from How misaligning Data Can increase performance 12x by reducing Cache misses:

The vast majority of accesses is close together, so moving the set index bits upwards would cause more conflict misses. You might is able to get away with a hash function that isn ' t simply the least significant bits, but most proposed schemes Hurt about as much as they help while adding extra complexity.

Because memory accesses are usually large contiguous, or because the address is close in the same program (that is, the high-level of these memory addresses are the same). Therefore, if the memory address of the high position as set index, then a large amount of memory access for a short period of time because the set index will fall in the same set index, resulting in the cache conflicts to make L2, L3 cache Hit rate is low, affect the overall execution efficiency of the program.

Learn about the storage mode of the N-way Set associative how does it help us?

After understanding the concept of n-way set, it is not difficult to draw the following conclusions: 2^(6Bits <Cache Line Offset> + 12Bits <Set Index>) = 2^18 256K . That is, a cache object in the same cache set will appear every 256K in a contiguous memory address. This means that these objects will scramble for a cache pool (16-way Set) with only 16 slots. And if we use the "Memory alignment" of the so-called optimization artifact in the program, the scramble will increase. The loss of efficiency can also become very obvious. Specific practical testing we can refer to: How misaligning Data can increase performance 12x by reducing Cache Misses article. Here we cite a test result diagram from Gallery of Processor Cache effects to explain the performance penalty under extreme conditions of memory alignment.

The graph is actually a variant of our first Test in the previous section. The vertical axis represents the size of the array of test objects. The horizontal axis represents the index interval between each array element access. The color in the graph indicates the length of response time, and the more obvious the blue part indicates the longer the response time. From this figure we can get a lot of conclusions. Of course we are only interested in the performance loss of memory. Interested readers can also read the original analysis to understand other conclusions that can be obtained.

It is not difficult to see that every 1024 steps in the figure, that is 1024*4 , 4096Bytes each, has a particularly obvious blue vertical line. That is, as long as we follow the 4K step into the memory (memory according to 4K alignment), no matter how big the hot data it's actual efficiency is very low! According to our analysis above, if 4KB of memory is aligned, then a 240MB array will contain 61,440 array elements that can be accessed, whereas for a 16Way two cache with set collisions per 256K, there will be a total of 256K/4K = 64 elements to compete for 16 seats. There is a total 61440/64 960 of such elements. The cache hit rate is only 1%, and the natural efficiency is low.

In addition to this example, interested readers can also check another article on page align results in inefficient experiment: http://evol128.is-programmer.com/posts/35453.html

To learn more about memory address alignment in the current Cpu-cache architecture, you can read the following two articles in detail:

    • How misaligning Data Can increase performance 12x by reducing Cache Misses
    • Gallery of Processor Cache Effects
Cache Elimination Policy

At the end of the article, let's mention the CPU cache elimination strategy. The common elimination strategy mainly has LRU and Random two kinds. In general, LRU has a better chance of hitting the cache than random, so the CPU cache culling strategy is chosen LRU . Of course, some experiments show that the random strategy has a higher hit rate when the cache size is larger.

Summarize

The CPU cache is transparent to the program ape, and all operations and policies are done inside the CPU. However, understanding and understanding the CPU cache design, working principle is conducive to better use of CPU cache, write more CPU cache-friendly programs

Reference
    1. Gallery of Processor Cache Effects
    2. How misaligning Data Can increase performance 12x by reducing Cache Misses
    3. Introduction to Caches

About CPU Cache--The program apes need to know.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.