One of the factors that must be considered during Cache Optimization

Source: Internet
Author: User

The cache is an important consideration for Performance Optimization on the same platform. As commands run faster and faster, memory read/write operations become the most time-consuming operations.Article, Pretty good.

0. Origin

Inspired by the principle, add the background of cache. Play the page coloring again.

1. Overview

Cache is used to cache memory data. The data to be accessed by the CPU is cached in the cache, which is known as "hit", and vice versa)

The speed at which the CPU accesses it is between the register and the memory (an order of magnitude difference ). The cost of implementing cache is between registers and memory.

Now the CPU cache has been subdivided into several layers. The common ones are L1 cache, L2 cache, and L3 cache. The read/write latency increases in turn.
This is reduced in turn.

The modern system uses register-> L1 cache-> L2 cache-> L3 cache-> memory-> mass storage
The hierarchical structure is the compromise design used to solve the contradiction between performance and price.

The theoretical basis for introducing cache isProgramLocality Principle, including temporal locality and spatial locality. That is, the data recently accessed by the CPU.
It also needs to be accessed (time); The data near the data accessed by the CPU, and the CPU needs to be accessed (Space) in a short period of time ). Therefore
The cache can be retrieved directly from the cache for the next access, and the speed can be improved by an order of magnitude.

2. Structure

2.1 cache rows

The cache is divided into several rows in line size. Line is the data storage and management unit of cache. A typical cache row
Structure:

It consists of tag, status, and data fields,

The tag field stores the high address of the corresponding row of data. After the CPU is indexed, compare the corresponding address with the tag of all rows in the group to distinguish specific rows.

The number of bytes that the data field can accommodate is the line size, which is the unit of data exchange between cache and memory.

The Status field contains some control bit information (such as valid, lock, and parity check bits), different cache types, and different cache instances.
The Current Status domain is slightly different. For details, refer to the corresponding CPU manual.

2.2 connection

2.2.1 group Association

In general, it is a group of cache rows. For example, a group of four rows is associated with a group of four rows, and a group of four rows is ......

For example, if the row size is 16 bytes in a 32 KB 4-way connected cache, it will have 32*1024/(16*4) = 512 groups.

When the CPU access group is associated with the cache, the IP address is first indexed to the group, and then the group matches the tag for routing. A typical connected group Cache
As follows:

The memory is logically segmented by the cache row size and numbered from low to high by address. Therefore, the cache and memory will have a ing problem, such as: Fourth
When the data in a memory row is accessed by the CPU, Which row of the row will be cached in the cache?

The associated cache group is the first index group. The rule is: cache group number = memory row number % Total number of cache groups.

For the cache with only two groups, assuming the row size is 16 bytes, the index process of this group is actually using 5th bits of the address.
(ADDR [4]) The index process, 0 is in the first group, and 1 is in the second group. The 4-bit low address is used for intra-row index data (ADDR []).

Then, the memory row can be mapped to any row in the group. When data is cached, empty space is occupied. If all rows in the group are occupied, replaceAlgorithm(LRU,
Random, FIFO, LFU) to replace a row.

2.2.2 direct connection

For more information about the MIPs Cache architecture, see

2.2.3 fully connected

For more information about the MIPs Cache architecture, see

3. Group Association

The common way to connect a K-path group to a cache is to use a physical address or virtual address index group, and then use the same physical address (PA) or virtual address (VA)
Match with the tag field of K rows in the group at the same time. If there is a match, it is hit (hit). Then, the corresponding words in the hit rows are sent to register.

For example, if the row size is 16 bytes, the address is divided as follows during normal operation:

ADDR [] is used to index groups. 9-bit indexes can be used to index 512 groups.
ADDR [] is used to index byte in the row, 2 ^ 4 = 16 bytes

The working method of VA index and PA matching cache is as follows:

4. Chaos

On the OS level, we also have the concept of paging. The physical memory is divided into several page frames with a size starting from 4 kb. Lines of 4 kb page size
The 12-bit low address is used to index the data in the page.

If the system uses a 4 kb page_size, the first page frame will always be indexed to the first 256 groups in the cache, and the second page frame
It will be indexed to the last 256 groups of cache, and the third will be indexed to the first 256 groups of cache ...... (Red for the first 256 and 256 For The Last)
(Black) if the system runs to the first 256 sets of PF, the cup will be available. In a system with frequent page changes
The number of rows is always certain. After the rows in the group are used up, the system will reuse some rows, and the original data will be replaced. When accessed again, it will be cache miss.
You can. So how can we make the system evenly allocate physical pages on the OS level to the set on the CPU level?

In layman's terms, it is to allow the OS to allocate as many red pages as Black Pages when assigning pages. This should be page coloring.
Yes.

Color bits are more formal:

As a convenience for understanding, please take a look at the size of the group index bit and the intra-row index bit of the address? 2 ^ 13 = 8 KB, this
The value is the size of the cache (way array): 32kb/4 = 8kb (way_size)

Therefore, when obtaining a cache to determine the color bit of the system, you can use the concept of a way array. The size of a cache is the address usage.
Low-level space to be indexed, which is equivalent to a channel of internal storage that uses the low-level access cache of the address.

When way_size> page_size, that is, log2 (way_size)> log2 (page_size ),
Log2 (way_size)-log2 (page_size) is used for the low position of the page offset. For convenience of analysis, this extra bit is often called a color bit.
(Color bit ). For the above cache, if the system uses a 4 kb page_size, ADDR [12] is the color bit. When the OS is assigned a page
Verify that there are as many pages of each color. For example, if ADDR [] is a color bit, it can be as many as 00, 01, 10, and 11.

If log2 (way_size) <= log2 (page_size), the data can be evenly distributed to the cache.
The idea is to increase the page size of the system.

5. Summary

Way_size = cache_size/ways

When log2 (way_size)> log2 (page_size), the system has a "color bit". The effect is:

1. cache alias problems may occur when va indexes and pa-matched cache are used.
2. Using PA indexes and cache matching by PA can extract a higher cache hit rate. You can use page coloring or improve the cache hit rate.
Page_size to remove color bits.

I personally think that, if possible, it is more reliable to increase the page size, and the impact of large pages is not only on the cache, but also on the CPU of the reduced-level CPU.
Improves the hit efficiency of TLB (waiting for bricks)

Reprinted: http://www.tektalk.org/2011/04/14/%e7%8e%b0%e4%bb%a3-cpu-%e4%b8%ad%e7%9a%84-cache-%e7%bb%93%e6%9e%84/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.