CPU performance Explore-linux cache mechanism

Source: Internet
Author: User
Tags intel pentium



CPU performance Explore-linux cache mechanism


Before reading the article, you should have basic knowledge of the memory hierarchy, at least to understand the principle of locality. To learn more about the cache fundamentals, you can refer to this book, "in-depth understanding of computer systems" in the Memory Architecture chapter:

With questions to read the article, the cache is not visible to programmers, it is completely hardware-controlled, why in the Linux kernel and cache.h this header file, define some of the cache structure?

1. Cache Overview

Cache, the translation of high-speed buffer memory, its role is to better use the principle of local, reduce the number of CPU access to main memory. Simply put, the CPU is accessing the instructions and data that may be accessed more than once, either in the memory area near the instruction and the data, or it may be accessed multiple times. Therefore, when you first access this area, copy it to the cache, and later access the command or data for that zone, you do not need to remove it from main memory.

2. Cache structure

Assuming the memory capacity is M, the memory address is M-bit: Then the addressing range is 000 ... 00~fff ... F (M-bit)

If the memory address is divided into the following three intervals:

"In-depth understanding of computer systems" p305 English beta draft

What is the use of tag, set index, block offset three intervals? Let's look at the logical structure of the cache:

Comparing this figure with the following, the parameters can be drawn as follows:

B = 2^b

S = 2^s

Now let's explain the meaning of each parameter:

A cache is divided into S groups, each group has e Cacheline, and a cacheline, there is a B storage unit, modern processor, this storage unit is generally in bytes (usually 8 bits) in units, is also the smallest addressing unit. Therefore, in a memory address, the middle s bit determines which group the unit is mapped to, and the lowest B-bit determines the offset of the cell in the Cacheline. Valid is usually a single, representing whether the cacheline is valid (which is, of course, invalid when the cacheline does not have a memory map). The tag is the high T bit of the memory address, because multiple memory addresses may be mapped to the same cacheline, so this bit is used to verify that the Cacheline is the memory unit that the CPU is accessing.

When the tag and valid verification is successful, we call the cache hit, just take the unit out of the cache and put it in the CPU register.

When the tag or valid checksum fails, it means that the memory unit to be accessed (or some contiguous unit, such as int 4 bytes, double 8 bytes) is not in the cache, then it needs to be taken in memory, which is the case of cache misses (cache miss )。 When a failure occurs, the unit is taken from memory, loaded into the cache, and placed in the CPU register, waiting for the next step to be processed. Note that this is important to understand the Linux cache mechanism:

When a cell is taken from memory to the cache, a cacheline-sized memory area is taken to the cache at a time and then stored in the corresponding cacheline.

For example: We want to take the address (T, S, b) within the CD, the cache miss, then the system will take (t, S, 00 ... 000) to (t, S, FF ... FFF) and put it in the corresponding cacheline.

Here's a look at the cache mapping mechanism:

When e=1, there is only one cacheline per group. Then 2 units of memory separated by 2^ (s+b) are mapped to the same cacheline. (Think hard about why?)

When 1<e<c/b, each group has E Cacheline, different addresses, as long as the middle s bit is the same, then it will be mapped to the same group, the same group is mapped to which cacheline is dependent on the replacement algorithm.

When e=c/b, S=1, each memory unit can be mapped to any cacheline. The processor with such a cache has little, because this mapping mechanism requires expensive and complex hardware to support.

Regardless of the mapping, as long as the cache miss occurs, there must be a cacheline-sized memory area that is taken to the corresponding cacheline in the cache.

Modern processor, the cache is generally divided into two or three levels, L1, L2, L3. L1 are generally CPU-specific and are not shared across multiple CPUs. L2 caches are generally shared by multiple CPUs or may be installed on the motherboard. L1 cache may also be divided into instruction cache, data cache. This allows the CPU to fetch both instructions and data.

Take a look at the parameters of the cache in the real world, as an example of an Intel Pentium Processor:

  e b s c
l1 i-cache 4 32b 128 16kb
l1 d-cache 4< /td> 32b 128 16kb
l2 4 32b 1024~16384 128KB~2MB

3. The cost of cache miss

The cache may be divided into L1, L2, L3, the farther away, the longer the access time, but also the less expensive.

L1 Cache hit, the access time is one or two CPU cycles.

L1 cache not hit, L2 cache hit, access time is 5~10 CPU cycles

When you go to the memory to fetch the unit, the access time may be 25~100 CPU cycles.

So, we always want the cache to hit as high as possible.

4. False sharing (pseudo share) issue

So far, we don't seem to mention how the cache is consistent with memory.

In fact, in Cacheline, there are other flag bits, one of which is used to mark whether Cacheline has been written. We call it the modified bit. When modified=1, it indicates that Cacheline was written by the CPU. This means that the contents of the Cacheline may have been modified by the CPU, which is inconsistent with the corresponding storage units in memory. So if Cacheline is written, then we should write the contents of the Cacheline back into memory to keep the data consistent.

Now that's the question, when do we write it back into memory? Of course, not every time the modified bit is set to 1 write, this will greatly reduce the performance of the cache, because each time the memory read and write operations. In fact, most systems will write the contents of Cacheline back to memory in such cases:

When the cacheline is swapped out, and the modified bit is 1.

Most systems are now slowly transitioning from a single-processor environment to a multiprocessor environment. One machine integrates 2, 4, or even 16 CPUs. So the new question is coming.

Typical for Intel processors, the L1 cache is CPU-specific.

Let's look at the following example:

The system is dual-core, that is, there are 2 cpu,cpu (for example, the Intel Pentium processor) L1 Cache is proprietary, for other CPUs are not visible, each cacheline has 8 storage units.

In our program, there is an array of char arr[8], which will of course be mapped to the same cacheline as the CPU L1 cache, because the mapping mechanism is implemented by hardware, and the same memory is mapped to the same cacheline.

2 threads write the array separately. When line Line 0 and line Line 1 are running on CPU No. 0 and CPU 1th respectively, assume that the following sequence is run:

Start CPU 0 to arr[0] write;

Then CPU 1 to arr[1] write;

Then CPU 0 to arr[2] write;

......

CPU 1 to arr[7] write;

Protocol based on cache consistency in multiprocessor:

When CPU 0 to Arr[0] is written, 8 char elements of the array is loaded into the CPU 0 of a cacheline, the Cacheline modified bit has been set 1;

when CPU 1 is written to arr[1], the array should also be loaded into a cacheline of CPU 1, but the array has been changed in the cpu0 cache, so cpu0 first writes the contents of Cacheline back to memory, The array is then loaded from memory into the Cacheline in CPU 1. CPU 1 writes will change the CPU 0 corresponding cacheline to invalid state Note that because of the same mapping mechanism, cpu0 in Cacheline and Cacheline in CPU1 is logically the same line (the direct mapping mechanism is the same row , the same group is in group-linked mappings)

When CPU 0 is written to Arr[2], the cacheline is invalid state, so CPU 1 needs to transmit the array data in Cacheline to the CPU 0,CPU 0 when writing to it, the CPU The corresponding cacheline in 1 is placed in the invalid state

......

In this way, the performance of the cache has been greatly damaged! The performance of this program under multi-core processors is not as high as that of a single-core processor.

Multiple CPUs accessing the same chunk of memory at the same time are "shared", creating a conflict that requires a control protocol to coordinate access. The smallest memory area that will cause "sharing" is a cache line. Therefore, when more than two CPUs have access to a memory area of the same cache line size, this can cause a conflict, which is called "sharing." In this case, however, there is a "pseudo-sharing" scenario that is "not actually shared." For example, two processors each have access to a word, and the two words exist in the same cache line size area, and from the application logic level, the two processors do not share memory because they are accessing different content (different word). However, because of the presence and limitations of the cache line, these two CPUs must access the same cache line block when accessing these two different words, creating a de facto "share". Obviously, because the cache line size limit brought about by this "pseudo-sharing" is what we do not want, will waste the system resources (this section of the free URL: http://software.intel.com/zh-cn/blogs/2010/02/26/false-sharing/)

There are 2 good methods for pseudo-sharing problems:

1. Increase the interval of the array elements so that elements accessed by different threads are located on different cache line. Typical space-changing time
2. Create a local copy of each element of the global array in each thread, then end and then write back to the global array

The Linux cache mechanism we're going to talk about is related to the 1th approach.

5. Cache-Friendly Code

Cache-friendly code, simply put,

      1. Reduce the cache Miss rate
      2. In multi-core environment, the probability of "pseudo-sharing" problem is reduced or even eliminated.

In a single-core environment, there is a typical example:

Cache-Friendly code:

int sumarrayrows (char a[m][n]) {    int i, j, sum = 0;     for (i = 0; i < M; i++)         for (j = 0; J < N; J + +)            Sum + = A[i][j];    return sum;}

As a general machine, C-language arrays are stored on a row-first basis. Assuming that the size of the Cacheline is b bytes, the total cache capacity is C bytes, and the storage mode is mapped directly, then there are c/b rows cacheline. For A[m][n] This m*n byte. The cache miss occurs whenever the N*b element (0<n<m*n/b) is read, so that at most m*n/b times the cache miss occurs, with no hits up to (m*n/b)/(m*n) = 1/b.

Cache unfriendly Code:

int sumarraycols (char a[m][n]) {    int i, j, sum = 0;     for (j = 0; J < N; J + +)         for (i = 0; i < M; i++)            Sum + = A[i][j];    return sum;}

This code is prioritized by column and is a bit more complicated. We only look at a relatively simple situation:

    1. When N=b, M*n > C, E is the number of lines of Cacheline, which is c/b. See what happens: When you access a[0][0]~a[e-1][0], each time you create the cache miss and then access A[e][0]~a[m-1][0], the 0~m-e-1 row cacheline is overwritten, so when you access a[0][0]~a[ M-1][0] will always cause the cache miss. When accessing a[0][1]~a[m-1][1], divided into 2 processes, the former 0~m-e-1 line because it is overwritten, so the cache will not hit, and in the M-e~e-1 line, that is, access to a[m-e][1]~a[e-1][1], due to not be overwritten, These rows will be hit. So there is a total of m+2* (M-E) * (N-1) times the cache miss. The non-hit rate can be calculated as: 2-2e/m-1/n+2e/(m*n). Visible, when m>=2e, do not hit >=1.

When the m,n is given a larger value, the test result will be that the column-first-access program runs much longer than the line-first-access program runs.

In multicore environments, "pseudo-sharing problems" occur whenever different threads or processes access different content of the same cacheline. Such problems are more subtle and difficult to spot.

6. GCC Attribute

7. header File <linux/cache.h> interpretation

The code's not going to stick.

A. L1_cache_align (x) This macro

#define L1_CACHE_ALIGN (x) ALIGN (x, L1_cache_bytes)//linux/kernel.h#define __ALIGN_KERNEL (x, a) __align_ Kernel_mask (x, (typeof (X)) (a)-1) #define __ALIGN_KERNEL_MASK (x, Mask) (((x) + (mask)) & ~ (mask))

The macro returns the Cacheline boundary address of the beginning of the memory area to which X points.

B. ____cacheline_aligned macro

#define Smp_cache_bytes l1_cache_bytes#define ____cacheline_aligned __attribute__ ((__aligned__ (SMP_CACHE_BYTES)))

The macro is a GCC property that makes spatial alignment of the defined data structure, aligning the starting position to the Cacheline

C. __cacheline_aligned macros

#define __cacheline_aligned __attribute__ ((__aligned__ (smp_cache_bytes), __section__ (". Data: cacheline_aligned")))

The data is assigned to the cacheline_aligned sub-segment of the data segment, and the starting position of the information is aligned Cacheline. Cacheline_aligned is defined under ARCH/XXX/KERNEL/VMLINUZ.LDS.S, interested readers can check their own code

B and C macros look similar, with only 2 underscores, the difference being that the former is used for local data declarations, which are declared on global data and can be placed in the. Data segment

Some of the key data structures in multiprocessor architectures are declared using cacheline_aligned, such as:

//Linux+v3.1.1/arch/ia64/kernel/numa.c#l27U16 Cpu_to_node_map[nr_cpus] __cacheline_aligned;
This prevents each CPU from causing false sharing when it reads and writes to its own map.

CPU performance Explore-linux cache mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.