Principle of CPU Cache

Source: Internet
Author: User

Overview

Today to share with you the CPU cache related things, in understanding how the CPU cache work, extrapolate, later in learning some of the implementation of the cache technology will be more easily, and now so many cache technology, the principle is mostly similar.


Basic Description

as we all know, CPU operation is much faster than memory read/write speed, which will make the CPU spend a long time waiting for data to come or write data to memory. In a computer system, the CPU cache (the cache) is the part that reduces the average amount of time the processor needs to access memory. In a pyramid-type storage System It is located on the top-down second level, after the CPU register. Its capacity is much smaller than memory, but the speed can be close to the processor frequency. The data for the computer chip is that there is no electricity, read a data to take a charge (the charge is gone), so you need to follow a certain frequency constantly refreshed to charge, because the CPU cache using static RAM technology, do not need to constantly refresh, read can continue to repeat reading, so soon, But its circuit components are much more complex than RAM.

When the processor makes a memory access request, it first checks to see if there is any request data in the cache. If a (hit) is present, the data is returned directly without access to the memory, and if it does not exist, the appropriate data in memory is loaded into the cache before it is returned to the processor. The CPU saves the data in the RAM hot zone to the cache, and when the CPU accesses the data in the cache, if the cache space is full, a substitution policy is used to displace the data in the cache. For example, the LRU algorithm (least recently used or the longest unused) and the MRU algorithm (the most recently used algorithm, which relies on additional token bits to complete the cache permutation, are more complex than the LRU circuit).

Caching is effective primarily because the program runs with local characteristics of memory access (that is, the execution of the entire program is limited to a certain part of the program over time.) Accordingly, the storage space accessed by the execution is also limited to a memory area. This locality includes both spatial locality (meaning that once a program accesses a storage unit, it is not long after.) The nearby storage unit will also be accessed), and also includes time locality (meaning that if an instruction in the program is executed, the instruction may be executed again shortly, and the easy-to-understand point is that if the data is accessed, the data may be accessed again shortly. )。 Using this locality effectively, the cache can achieve a very high hit rate.

Storage structure of the cache

structure, a direct mapping cache consists of several cache blocks. Each cache block stores several storage units that have a contiguous memory address. Each cache block has an index, which is generally the low-end part of the memory address, but does not contain the minimum number of bits within the block offset and byte offset. Since this is a many-to-one mapping, it is necessary to store a piece of data while indicating the exact location of the data in memory. So each cache block is equipped with a label. The memory address of the cache block can be obtained by stitching the tag value and the index of this cache block. If you add a block offset, you can derive the corresponding memory address of any piece of data.



Operation Process

The following is a brief description of the workflow of a hypothetical direct mapping cache. This cache has a total of four cache blocks, each block 16 bytes, or 4 words, so there is a total of 64 bytes of storage space. Use a write -back policy to ensure data consistency.

650) this.width=650; "Src=" https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/CPU%E7%BC%93%E5%AD%98_01_% e8%bf%90%e4%bd%9c%e6%b5%81%e7%a8%8b.png/300px-cpu%e7%bc%93%e5%ad%98_01_%e8%bf%90%e4%bd%9c%e6%b5%81%e7%a8% 8b.png "width=" "height=" 308 "class=" Thumbimage "style=" border:1px solid rgb (204,204,204); vertical-align:middle; " />

When the system starts, there is no data in the cache. After that, the data is gradually loaded or swapped out of the cache. Suppose at a later point in time, the cache and memory layout are shown in the image on the right. At this point, if the processor executes the data read instruction, the control logic follows the following process:

    1. divides the address from high to low into four parts: label, index, intra-block offset, byte offset. Where the intra-block offset and byte offset are two bits, the latter is not used in the following operations.

    2. Navigate to the appropriate cache block with the index.

In, such as at this time the processor requests the address between 0x0020 to 0x0023, or between 0x0004 to 0x0007, or between 0x0528 to 0x052b, or between 0x05ec to 0x05ef, will be hit. The remaining addresses are all missing.

When the processor executes the data write instruction, the control logic follows the following process:

1: Use the index to navigate to the appropriate cache block.

2: Try to match the corresponding tag value of the cache block with the tag. The result is a hit or miss.

3: If hit, use the intra-block offset to locate the target word within this block. Then rewrite the word directly.

4: If not hit, depending on the system design can have two processing strategies, respectively, called write allocation and non-write allocation. If a write assignment is done, the missing data is read into the cache as if it were a read miss, and then the data is written to the read-in unit. If the write assignment is not pressed, the data is written back to memory directly.

For the direct mapping method, in order to facilitate data lookup, it is generally stipulated that the memory data can only be placed in specific areas of the cache. For a direct map cache, each memory block address can be mapped to a unique cache block through modulo operations. Note This is a many-to-one mapping: multiple memory block addresses must share a cache area. By describing its working process, we find that the direct matching cache, although very simple in circuit logic, has a significant conflict problem. Because many different memory blocks share only one cache block, the current contents of the cache block must be purged once the cache fails. This approach not only caused a lot of delays because of the frequent replacement of cache content, but also failed to effectively take advantage of the time-locality of the program's runtime. Thus, a group association technique appears.



Group Association

use group-linked caches to organize storage spaces into groups, each with several blocks of data. By establishing the correspondence between the memory data and the group index, a block of memory can be loaded into any block of data within the corresponding group.

For example, if using a 2-way group, the memory address of 0, 8, 16, 24 of the data can be placed in the cache in the No. 0 Group of two blocks of any one, if the use of 4-way group, the memory address is 0, 8, 16, 24 of the data can be placed in the cache No. 0 Group of four data block any.


650) this.width=650; "Src=" https://upload.wikimedia.org/wikipedia/commons/b/bc/CPU%E7%BC%93%E5%AD%98_02_%E7%BB% 84%e5%85%b3%e8%81%94.png "alt=" cpu%e7%bc%93%e5%ad%98_02_%e7%bb%84%e5%85 "/>


When a group is used, after the index is anchored to the corresponding group, the label values of all cache blocks must be further matched to determine if the lookup is hit. This increases the complexity of the circuit to a certain extent, resulting in a decrease in the search speed.

In addition, simply increasing the number of groups associated without increasing the size of the cache will not alter the corresponding proportions of the cache and memory. In the example on the right, for the 2-way group, although there are two cache blocks in group No. 0, the group is now the target block of memory blocks 1, 9, 17, 25.

A direct match can be thought of as a single-path group link. Empirical rules show that when the cache is less than 128KB, to achieve the same failure rate, a two-way group-linked cache only needs to be equivalent to half of the storage space directly matched cache.

1-way: Some storage units of RAM can only be cached in one storage unit of the CPU cache

2-way: Some storage units of RAM can only be cached in one of the two storage units in the CPU cache

4-way: Some storage units of RAM can only be cached in the CPU cache of some four storage units

There is also a technique called full-coupling, one of the extremes of which is group-linked. This cache means that the blocks in memory can be placed in any area of the cache. This is completely exempt from the use of the index, and directly by matching the entire cache space on the label to find. Because such lookups cause the longest circuit delay, they are only used in special situations, such as when the cache is very small.


Let's take a look at the replacement and writeback strategies mentioned above:

For group-linked caches, when all the cache blocks of a group are fully occupied, a cache block must be selected to replace if the cache fails again. There are multiple policies that determine which block is replaced.

The FIFO algorithm replaces the cache block that enters the group for the longest time.

The longest unused algorithm (LRU) tracks the usage of each cache block and compares which block has not been accessed for the longest time based on statistics. For 2-way or more, the time cost of this algorithm can be very high.

An approximation of the longest unused algorithm is a non-recent use (NMRU). This algorithm only records which cache block is recently used. At the time of substitution, any other block is randomly replaced. It is said to be not recently used. Compared to LRU, this algorithm requires only hardware to add a single usage bit to each cache block.

In addition, a purely random substitution method can be used. Tests show that the performance of a completely random replacement is approximate to the LRU

In order to maintain data consistency with subordinate storage (such as memory), it is necessary to update the data in a timely manner. This update is done by writing back. There are generally two write-back strategies: write-back and write-through.

Writeback means that the contents of a cache block are written to memory only if they need to be replaced back into memory. If the cache is hit, the memory is always not updated. To reduce memory writes, the cache block usually has a dirty bit (the modern memory snap-in is page-by-dirty bit indicates whether the page has been written) to identify whether the block has been updated since it was loaded. If a cache block has never been written before being swapped back into memory, it is exempt from the write-back operation.

The advantage of writing back is that it saves a lot of write operations. This is mainly due to the fact that updates to different units within a block can be done in a single write operation.

Write-through means that whenever the cache receives a write-data instruction, it writes the data back to memory directly. If this data address is also in the cache, you must update the cache at the same time. Because this design can cause a lot of write memory operations, it is necessary to set a buffer to reduce hardware conflicts. This buffer is called a write buffer and typically does not exceed 4 cache block sizes. However, for the same purpose, write buffers can also be used for write-back caches. Write-through is easier to implement than writeback, and it's easier to maintain data consistency.

When write invalidation occurs, the cache can have two processing policies, called write-by and non-write-based allocations, respectively.

Write-by-assignment refers to reading the required data into the cache, and then writing the data to the unit that is being read, just as it does with read failures. Without write assignment, the data is always written back to memory directly.

You can use any combination of write-back policies and allocation policies when you design the cache. For different combinations, the behavior of data write operations also differs. As shown in the following table.


Write back Policy Allocation Policy when ... When writing to ...
Write back Distribution Hit Cache
Write back Distribution Failure Cache
Write back Non-assignable Hit Cache
Write back Non-assignable Failure Memory
Write-Through Distribution Hit Cache and Memory
Write-Through Distribution Failure Cache and Memory
Write-Through Non-assignable Hit Cache and Memory
Write-Through Non-assignable Failure Memory



Sharing is the end, in fact, CPU cache related things more than this, too deep

This article is from the "Long Way to repair" blog, please make sure to keep this source http://xiaojielinux.blog.51cto.com/10565265/1875073

Principle of CPU Cache

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.