From the programmer's perspective (1)

Source: Internet
Author: User

For a summary of the cache, refer to: computer system a programmer's perspective original version _ second version

Locality: The number or command recently operated by other commands, or itself.

Good temporal locality: The referenced memory has been referenced multiple times recently.

Good spatial locality: the adjacent memory of the referenced memory has been referenced multiple times recently.

Data Locality applies to the locality of commands, but we seldom modify commands, but frequently modify data.

Use temporal locality to hit when Miss occurs for the first time. Use spatial locality to put all data in the block where data is stored in the cache when Miss occurs for the first time, you can use hit to retrieve the adjacent data again.

 

The k + 1 level storage is divided into several blocks, and the K-level storage is also divided into several blocks, and the size of the K-level block is the same as that of the k + 1 level block. Data is transmitted between levels k + 1 and K in units of block data (blocks of adjacent levels are of the same size, blocks of non-adjacent levels are not necessarily the same, and low-level blocks are large, the advanced block is small)

 

Cache composition:

When you access data in a memory, you need to provide the address, assuming M-bit, then:

  1. Divide the cache into 2 ^ s sets;
  2. Each set contains e cache lines (the block is the fixed length information transmitted, and the cache line is the block container, which not only stores the block, but also has other storage units, for example, the valid bit -- valid indicates whether the block in the line is valid, the tag bit -- tag matches the tag, and the dirty sign -- dirty indicates whether the block data has changed ...... In addition, set is a set of several lines );
  3. The data offset of the block in each cache line is 2 ^ B;
  4. T = m-(S + B) Each cache has a t-bit tag bits to identify the cache line in each set.

Note: 1. M-bit addresses are divided into tags: SET index: block offset

Explain why the Division is as follows:

If the SET index is high, a large number of blocks can easily be mapped to the same set. This results in a high probability of conflict misses and reduces the cache performance. SET index cannot be placed at a low level. Therefore, the data cached in the block cannot be effectively used. For example, if the block is 16 bytes and the integer is read from 0, 12 pieces of data will be put into the cache at the same time, reading 4 pieces of data. At this time, because the SET index is low, it will not be read from the cache, but still read from the memory, resulting in no effective application of the cache. So set index is in the middle.

2. In fact, the S-bit required by the set does not exist in the hardware.

Cache composition can be simply described as group elements (S, E, B, m)

Cache capacity C = S * E * B

 

Cache hit: If you want to read d from a k + 1 module, search for D in a K block. If yes, cache hit

Cache miss: opposite to top.

For cache miss, the K-level module (full) is replaced by the block where D in level k + 1 is located, or the K-level module is transferred. If it is replaced, the process is called replacing or evicting, And the replaced block is called victiming block. It is called the replacement policy to determine which part is replaced.

 

Cache misses classification:

Cold misses: inevitable. If the K-level cache is empty, the cache miss will occur. The empty cache is called the cold cache. This cache misses is called the compulsory misses or cold misses. When the cache has been warmed up, cold misses will not occur again.

Conflict misses: Multiple blocks in the k + 1 level are mapped to the same block in the K level. This is related to whether programmers can write cache-friendly code.

The program is usually executed in stages (such as loop: inner and outer layers). Each stage takes several fixed blocks of cache blocks,

A collection composed of these blocks is called a working set.

Capacity misses: When the working set exceeds the cache size, Miss is called capacity misses.

 

When a cache misses occurs, the placement policy must be implemented for the K-level cache to determine which block to store the K-level block. Policy:

  1. Random (the cost is too large to implement the purpose of the cache );
  2. A block at the K + 1 level can only be mapped to a set in blocks at the K level.

 

 

Cache ing mode (e value ):

The conflict Miss probability of a cache with a large e value is small, but the cost is too high (power consumption, technology, circuit ......), Generally, the e value of level k + 1 is greater than the e value of level K.

1. E = 1: direct_mapped caches (each set has only one cache line)

The k + 1 block can only be mapped to the only block in the corresponding K-level set. The replacement policy for directed_mapped caches when misses occurs is very simple: the row where the set is located must be replaced. Directed_mapped caches has a high probability of confilct misses. If the cache performance is not fully utilized, thrashing may occur: loading or evicting in a cache without interruption.

2. E = C/B, S = 1: fully associative Caches

For such cache, the circuit structure is complex (because the tag requires a large number of matching tests, the cost is high, so it is generally only a relatively small cache, such as TLB)

3.1 <e <C/B, called E-way set associative Cache

This allows the k + 1 cache to be mapped to a block in the K-level set. For line replacement in which misses occur in set associative caches, which policy should be taken: If the set is empty or invalid, it will be replaced directly; if the set is full, then: Random (high cost), LFU, LRU

 

The read is discussed above, and the following summary is for writing.

If the word w to be written is already in the K-level cache, it is called write hit. What policy should I use to update W IN THE k + 1 level?

Policy:

1. write-through: writes the K-level cache block of W back to the K + 1 cache block. This generates the cache bus traffic, therefore, we can add the write buffer to save the address to be updated and the data in it. The write buffer is controlled by the memory controller. It is a first-in-first-out structure to reduce the number of updates. According to statistics, No 10 commands will execute a write operation, so at low speed CPU, this can work well, but when the CPU frequency increases, the storage system cannot digest write operations at the average CPU write rate.

2. write-back: try to delay writing the K-level cache block of W back to the K + 1-level cache block, the updated block is written to the corresponding k + 1 block only when the K-level block is evicted. This requires that the dirty bit be added for each block to indicate whether the block is changed. The write-back policy can significantly reduce the number of transmissions, because the data is transferred to the lower-level cache for a longer time, therefore, the lower-level cache tends to adopt write-back instead of write-through (of course, we can also use Write-through with write-buffer to use the cache more effectively ).

If the word w to be written is not in the K-level cache, it is called write misses. What policy should I use to update W IN THE k + 1 level?

1. write_allocate: Write the corresponding k + 1 blocks first, and then update the blocks already in K. In this way, space locality can be used. However, in this way, each misses writes data from k + 1 to K level first, which requires high cache performance.

2. no_write_allocate: writes data directly to the K + 1 level cache without passing through the K-level cache.

 

2012_11_18, read the boss's paper and record it:

When the address space for the write operation proposed by the processor kernel is not buffered by the cache temporarily, that is, when the cache fails to write, its behavior can be divided into three policies: Write allocation, read allocation, and read/write allocation. Read distribution refers

When the read operation does not hit the data, the data is filled by the cache and loaded from external storage to the cache; write Allocation refers to the data loading from the cache to the cache only when the write Miss occurs; the read/write allocation policy will cause the cache retrieval logic to reload the data from the primary storage to the cache row regardless of whether the read or write is missing.

 

Cache friendly code:

In a continuous vector operation, an operation every k elements is called stride-K pattern. When K increases, the layout of the space becomes worse. From this we can also see that the storage rules of the number of elements in the memory will affect the performance of our algorithm.

If the cache block size is B bytes, a stride-K (in words) pattern leads to an average miss rate of min (1, (wordsize * k)/B ).

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.