8.6 Design Essentials of Cache

Source: Internet
Author: User

Computer Composition 8 Storage Hierarchy 8.6 Cache design Essentials

The cache is a very fine component, and to make it work efficiently, it has to be carefully weighed at design time. To design a good cache component, we have to start with a few basic concepts.

Let's take a look at the cache's access process. If the CPU makes a read request to the cache, the cache checks to see if the corresponding data is inside itself.

If so, we're called the cache hit. A performance metric is called hit rate, which is the percentage of requests that have a cache hit in all of the CPU's requests for memory. Returns data from the cache internally to the CPU if hit. The time it takes to return the hit data from the cache is called the hit time, which is also an important performance parameter. In the current CPU, the first-level hit time is about 1 to 3 cycles, and the two-level cache has a hit time of about 5 to 20 cycles.

If the data is not in the cache, we are called the cache failure, corresponding to the loss of efficiency. The loss of efficiency and hit ratio is definitely 100%. After the failure, the cache initiates a read request to main memory, and then waits for the time to return the read data from main memory, which we call the failure cost. It now usually takes 100 to 300 clock cycles to get the read data.

That is to evaluate the performance of the visit, we often use the average time to visit the indicator, which is calculated by the parameters just described. The average time to visit is equal to the hit time, plus the failure cost multiplied by the loss efficiency. To improve the performance of the visit, we have to reduce the average time to visit, to do is to reduce the three parameters, which is the main way to improve the performance of the visit.

If you want to reduce the hit time, you should try to do a smaller cache capacity, the structure of the cache should not be too complicated. However, the small size of the simple cache, it is easy to fail, which will increase the average time to visit. If you want to reduce the failure cost, either to improve the performance of main memory, or between the current cache and main memory to add a second level cache. That will also be the problem in the new cache level. So, these three paths are not independent, they are intertwined and interact with each other. Let's focus first on the factor of hit ratio.

If there is a cache hit rate of 97%, we assume that the hit time is 3 cycles and the failure cost is 300 cycles. Then the average time to visit is 12 cycles. So if we have a way to increase the hit rate to 99% without affecting the hit time and the cost of failure, the average time to visit is reduced to 6 cycles. So, although the hit rate only increased by 2%, it does not look very good, but the performance of the visit increased by one times, which is a very large increase. So, for the current cache, the ability to improve a little bit of hit rate can bring a good performance boost.

Which factors will affect the hit rate?

Or do we look at what causes the cache to fail?

There is a void called obligation. Data blocks that have never been accessed must not be in the cache. Therefore, the failure of the first access to this data block is called an obligation failure. It is difficult to avoid the obligation failure.

The second cause of failure is called capacity failure. If the data block required by this program now exceeds the sum of this cache capacity, it will always fail no matter how cleverly we design the cache. Of course, capacity failures can be mitigated by expanding the cache, but adding cache capacity increases costs on the one hand and may also affect hit time. Therefore, it needs to be considered comprehensively.

The third type of failure is called conflict invalidation. That is, the cache is not full, because we map multiple memory locations to the same cache row, resulting in a conflict at the location of the failure. Let's take a look at how to solve the problem. This problem is called the cache's mapping strategy.

That's the area of memory, and we're still drawing the area of this memory in the form of a block of data for every 16 bytes (the hexadecimal address is the lowest bit of 0) . Therefore, the Address column is marked with the starting address of each data block, which is exactly 16 between each address. So if we have a cache of 8 table items, we'll take the kind of storage we've described earlier. The data block with address 0 is to be placed in table item 0, and the data block with address 080 must be placed in table item 0, and the data block with the same address 100 will be stored in this table item. This is actually the memory into a group of 8 blocks of data, any group of the NO. 0 block, will be placed in the table item 0, the first data block will be placed in table item 1, so that the cache mapping strategy is called Direct mapping . The advantage of this is that the hardware structure is very simple, so that we can know which table item the corresponding data block should be placed on according to the address, but its problem is also obvious. If we have to constantly access two data alternately in the program, we might call it data A and data B. If data A is in this data block with address 0, and data B is in this block of address 080, then when accessing data A, the data block of address 0 is transferred to the cache, then the corresponding data A is given to the CPU, then the CPU needs to access data B, and then it starts to discover Data B is not in the cache, so the 080 corresponding data block is redeployment to replace the original table 0 data, and then the data B to the CPU. Then the CPU then accesses the data A, and the cache also redeployment the block with the address 000H, overwriting the table entry 0 again. That if the CPU is constantly alternating access to data A and data B, this time the cache access every time is not hit, such a memory performance is not as good as no cache, and then the other cache of the line is still empty, and did not play a role. So that's the problem with this mapping strategy.

In order to solve this problem, we can make some improvements, without increasing the total cache capacity, we can divide the 8 cache rows into two groups, which is the two-way group of the cache. In this case, the alternating access to data A and data B, there is no problem, because the CPU access to data A, the address 0 corresponding to the data block here, and then access to data B, the address 080 corresponding data block is placed here. Then, repeated access to data A and data B will be hit in the cache, so the performance of the visit will be very good. Of course, if the CPU is alternately accessing the data in these three data (000h,080h,100h) blocks, then the two-way set of caches will have successive misses. So, we can also slice it further. This is a four-way set of caches.

Can we cut it down without limits? That's a yes. If the cache has only 8 rows in total, and we divide it into eight groups, that is, any chunk of memory can be placed in any row of the cache without having to agree on which row to place in the address's characteristics. The cache of this structure is called the fully-connected cache. This design flexibility is obviously the highest, but its control logic can also become very complex. We assume that the CPU sends an address, and the cache wants to determine if the address is inside itself, and it needs to compare the tags in the cache business that might contain the address. For a direct mapping of the cache, only need to take a label to compare on the line, when the two groups are linked, you need to take two tags at the same time to compare, four-way group associated with the need to take out four tags to compare, and in the case of full-time, it is necessary to all the labels in the business to compare.

This comparison requires the use of a large number of hardware circuits, both increase the delay and increase the power consumption. If you divide too many, although it may reduce the loss of efficiency, but increase the hit time, so it is not worth the candle. And then again, the increase in the number of way, not necessarily can reduce the loss of efficiency.

Because in a multi-group cache, the same data block can be placed in different places. If these places are already occupied, you need to choose a row to replace, the replacement algorithm is well designed, it has a great impact on performance. If the cache chooses to replace the row, it is always going to use the data block, so performance will be poor.

Now there are several common cache replacement algorithms. The simplest is random substitution, which is obviously not a good performance. Then there are rotary replacements, that is, in accordance with the predetermined sequence of the replacement, if it is a four-way group, the last replacement of the No. 0 road, this time replace the 1th road, the next time to replace the 2nd road, and the next time to replace the 3rd road. This is relatively simple in terms of hardware design and design, but the performance is also general. The better-performing substitution algorithm, the least recently used replacement algorithm, is called LRU, which requires additional hardware to record the historical information of the access, and, when replaced, chooses the cache line that is not accessed for the longest time. In use, the performance of this method is better, but its hardware design is quite complex. Therefore, both the mapping strategy and the substitution algorithm need to be weighed between performance and implementation cost.

Let's take a look at some examples of cache design.

Of the X86 series CPUs, 486 is the first to integrate the cache inside the CPU chip, but it uses a cache that is shared between the instruction and the data. One obvious disadvantage of this cache is that the locality of the instruction and the data will affect each other. Because directives and data are generally stored in different areas of memory, they are local in nature. With the execution of a program to manipulate a large amount of data, the data will quickly fill the cache, the instructions are squeezed out, at this time, the execution of an instruction, referring to the stage is likely to be a cache miss, need to wait for access to memory, it takes a long time. And in the execution phase, to fetch the operand, but often will hit the cache, although this period of time is relatively short, but the entire instruction execution time is very long.

So in the later Pentium, the instructions and data are divided into two separate caches, so that their respective local properties will not affect each other.

The first-level cache for most CPUs now takes this form.

This is now the more advanced Core i7. It uses a multi-level cache structure, the first level of the cache is the instruction and data separation of each of the 32K byte, the use of 8-way group-linked form, hit time is 4 cycles. Therefore, in the CPU pipeline, access to the cache also need to occupy a number of pipeline level.

So in this 4-core i7, each processor core has its own two-level cache. Level two cache is no longer divided into two parts of the instruction and data, because its capacity is larger, the interaction between instructions and data is less obvious. But the two-level cache hit time is also relatively long, need 11 cycles, Core i7 CPU Pipeline total of 16 or so, there is definitely no way to work with the level two cache directly. This is one of the reasons why the first-level cache does not do a great job (it takes more than one pipeline to make it bigger).

Below the level two cache, there is a level three cache. This is shared by four cores, with a total of 8 megabytes. The three-level cache uses a 16-way group-linked structure with a large capacity of 8 megabytes, which leads to a long hit time and 30 to 40 cycles, but it has a high structural hit rate, which rarely requires access to main memory.

We can see this three-level cache, its hit time from 4 cycles, 11 cycles to 40 cycles, we will consider the main memory of 100 to 300 cycles, you can see this multilevel cache + main memory structure has opened a clear level, in their respective design can have different focus, Complement each other to improve the performance of the entire system.

Cache research has been going on for a long time, until today is still a hot spot for research. But the previous research object is this level of cache between CPU and main memory, and now the research object is a multi-level structure composed of multilevel cache. In any case, the research of cache technology has brought us the highest possible system performance under controllable cost.

8.6 Design Essentials of Cache

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.