[Z] principle, design and implementation of cache in Computer Architecture

Source: Internet
Author: User

Preface

Although the increase in CPU clock speed will drive the improvement of the system performance, the improvement of the system performance is not only dependent on the CPU, it is also related to factors such as the system architecture, command structure, transfer speed of information between components, and access speed of storage components, especially the access speed between CPU and memory.

If the CPU runs at a high speed but the memory access speed is relatively low, the CPU waits, reduces the processing speed, and wastes the CPU capacity.

For example, for MHz p iii, the execution time of one command is 2ns, And the access time of the memory (SDRAM) is 10ns, which is 5 times slower than that of the former. How can the performance of CPU and PC be realized?

How can we reduce the speed difference between CPU and memory? There are four methods:

One is to insert a wait in the basic bus cycle, but this will waste the CPU capacity.

Another method is to use the SRAM with fast access time as the memory. This solves the problem that the speed mismatch between the CPU and the memory, but greatly increases the system cost.

The 3rd method is to insert a fast and small-sized SRAM between the slow DRAM and the fast CPU to buffer the data, so that the CPU can access the data in the SRAM at a fast speed, the cache method does not increase the system cost too much.

Another method is to use a new type of memory.

Currently, 3rd methods are generally used. It is a very effective technology for improving performance of PC systems without increasing the cost.

This article introduces the concept, principle, structure design, and implementation of cache in PC and CPU.


How cache works

The principle of cache is based on the locality of program access.

The analysis results of a large number of typical program running conditions show that the addresses generated by the program are usually concentrated in a very small range of memory logical address space within a short interval. The distribution of command addresses is continuous, and the cyclic and subroutine segments must be repeated multiple times. Therefore, access to these addresses naturally tends to be distributed in a centralized manner in time.

This kind of centralized tendency of data distribution is not as obvious as that of commands, but the storage and access of arrays and the selection of work units can make the memory address relatively concentrated. This kind of frequent accesses to memory addresses in a local range, and access from addresses out of this range is very rare, which is called the locality of program access.

Based on the program's local principle, you can set a high-speed memory with relatively small capacity between the main memory and general-purpose CPU registers, call part of the instruction or data near the instruction address to the memory for the CPU to use for a period of time. This will greatly improve the program running speed. The high-speed Small-capacity memory between the primary memory and the CPU is called the high-speed buffer memory (cache ).

Based on this principle, the system constantly reads a relatively large successor Instruction Set associated with the current instruction set from the memory to the cache, and then transmits it with the CPU at high speed to achieve speed matching.

When the CPU requests data from the memory, it usually accesses the cache first. Because the Locality Principle cannot ensure that the requested data is in the cache, a hit rate exists here. That is, the probability that the CPU can reliably obtain data from the cache at any time.

The higher the hit rate, the higher the reliability of correct data acquisition. In general, the cache storage capacity is much smaller than the primary storage capacity, but it cannot be too small. If it is too small, the hit rate will be too low. There is no need to be too large. If it is too large, it will not only increase costs, when the capacity exceeds a certain value, the hit rate will not increase significantly as the capacity increases.

As long as the cache space and the primary storage space maintain a proper ing relationship within a certain range, the cache hit rate is quite high.

Generally, the space ratio of cache to memory is 4:, that is, kb cache can be mapped to 32 MB memory, and kb cache can be mapped to 64 MB memory. In this case, the hit rate is above 90%. As for the data with no hits, the CPU has to directly obtain the data from the memory. At the same time, it is also copied to the cache for future access.



Basic cache structure

Cache is usually implemented by connected storage. Each storage block of the connected storage has additional storage information, called tags ). When accessing the connected storage, compare the address with each tag at the same time to access the storage block with the same tag. The basic structure of the cache is as follows:

Fully Connected Cache

In fully connected cache, there is no direct relationship between the stored block and block, and between the storage sequence or storage address. Programs can access many subprograms, stacks, and segments, which are located in different parts of the primary storage.

Therefore, the cache stores many unrelated data blocks. The cache must store the addresses of each block and its own. When you request data, the cache controller compares the request address with all the addresses for confirmation.

The main advantage of this cache structure is that it can store different blocks in the primary memory at a given time, with a high hit rate; the disadvantage is that it takes a considerable amount of time for each request data to be compared with the address in the cache, and the speed is slow.


Direct image Cache

The direct image cache is different from the fully connected cache, and the address only needs to be compared once.

In direct image cache, because each primary memory block only has one location in the cache, the number of comparisons of addresses is reduced to one. The method is to assign an index field to each block location in the cache and use the tag field to distinguish different blocks stored in the cache location.

A single direct image divides the primary storage into several pages. Each page of the primary storage is the same size as the cache storage. The offset of the matched primary storage can be directly mapped to the cache offset. The cache Tag Memory (offset) stores the page address (page number) of the primary storage ).

As can be seen above, direct image cache is superior to fully-connected cache and can be quickly searched. Its disadvantage is that when the groups of the primary memory are frequently called, the cache controller must perform multiple conversions.

Group associated Cache

A group-connected cache is a structure between a fully-connected cache and a direct image cache. This type of cache uses several sets of direct image blocks. For a given index number, several blocks can be located, thus increasing the hit rate and system efficiency.



Consistency between cache and DRAM access

After the cache is added between the CPU and the primary memory, there is a problem of how to access data between the CPU, the cache, and the primary memory. There are two ways to read and write data.

Look through)

In this way, the cache is separated between the CPU and the primary storage. The CPU sends all data requests from the primary storage to the cache, And the cache searches for the data on its own. If the request is hit, the request sent from the CPU to the primary storage is cut off and the data is sent. If the request is not hit, the data request is sent to the primary storage.

The advantage of this method is that it reduces the number of requests sent from the CPU to the primary memory, and the disadvantage is that it delays the access time from the CPU to the primary memory.

Look aside)

In this way, when the CPU sends a data request, it does not pass through the cache in a single channel, but sends a request to both the cache and the primary storage. Because the cache speed is faster, if hit, the cache can interrupt the requests from the CPU to the primary storage while sending the data back to the CPU. If not hit, the cache does not take any action, the CPU directly accesses the primary memory.

It has the advantage that there is no time delay, but the disadvantage is that each time the CPU accesses the primary memory, it occupies part of the bus time.

Write Through)

Any write signal sent from the CPU to the cache is also written to the primary storage to ensure that the data in the primary storage can be updated synchronously.

It is easy to operate, but because of the slow speed of the primary storage, it reduces the write speed of the system and occupies the bus time.

Copy back)

In order to overcome the disadvantages of reducing the system write speed and occupying the bus time, we try to minimize the number of accesses to the primary memory, there is a write-back statement.

It works like this: data is generally only written to the cache, so that the data in the cache may be updated and the data in the primary storage will not change (the data is outdated. However, in this case, you can set a flag of the address and outdated data in the cache. Only when the data in the cache is changed again will the original updated data be written to the corresponding unit of the primary storage, then accept the updated data. This ensures that the cache and the data in the primary storage do not conflict.



Hierarchical System Design of Cache

The performance of a microprocessor is estimated by the following factors:

Performance = K (f x 1/CPI-(1-H) x n)

Formula: k is the proportional constant, F is the operating frequency, CPI is the number of cycles required to execute each command, H is the cache hit rate, and N is the number of storage cycles.

Although, in order to improve the performance of the processor, it is necessary to increase the operating frequency, reduce the number of cycles required to execute each command, and increase the cache hit rate. Multiple commands are distributed at the same time and disordered control is adopted to reduce the CPI value. Transfer prediction and cache capacity increase can be used to increase the H value. To reduce the number of storage cycles by N, you can use a high-speed bus interface and a cache solution without blocks.

In the past, the performance of the processor was improved mainly by increasing the operating frequency and the degree of parallelism at the instruction level. In the future, the hit rate of the cache will be increased. The hierarchical structure of non-blocking cache is designed.

The main advantage of the Hierarchical Cache structure is that 80% of Memory Applications for a typical level-1 cache system occur within the CPU, and only 20% of Memory Applications are dealing with external memory. 20% of these 80% external memory applications are dealing with second-level cache. Therefore, only 4% of memory requests are directed to dram.

The disadvantage of the Hierarchical Cache structure is that the number of high-speed cache groups is limited, which requires space on the circuit board and some support logic circuits, which will increase the cost. The comparison results are classified cache.

L1 cache is designed in two ways: first-level separation of slices and unified design.

Intel, AMD, and former Dec have designed L1 cache as command cache and data cache separation. This dual-channel high-speed cache structure reduces conflicts caused by contention for high-speed cache and improves processor performance so that data access and command calls can be performed within the same clock period.

However, simply increasing the size of the first-level cache on a chip does not increase the performance of the microprocessor in a proportional manner. You also need to set the second-level cache.

In terms of L1 cache structure, the write-back static random memory (SRAM) is generally used ). At present, L1 cache capacity is increasing.

L2 cache is designed in two ways: built-in chip design and external design.

For example, amd K6-3 built-in 256kb L2 cache and CPU synchronization work. External L2 cache usually requires tight coupling between the second-level cache and the CPU, and forms a non-blocking hierarchical structure with the first-level cache on the chip. The separate Front-End bus (external I/O bus) and background bus (secondary cache Bus) are also used.

Obviously, with the improvement of the semiconductor integration process in the future, if the CPU and second-level cache are integrated on a single chip, the coupling effect between the CPU and second-level cache may be better.

Because L2 cache is built in, you can also enable external large-capacity cache for 1 MB ~ 2 MB, which is called L3 cache.



Implementation of Cache Technology in PC

The development of cache in PC is based on 80386.

Conclusion

At present, one of the development trends of PC systems is that the CPU clock speed increases, the system architecture becomes more advanced, and the structure and access time of primary DRAM are improved slowly. Therefore, the more important the cache technology is, the greater the cache in PC systems. The majority of users have used cache as an important indicator for evaluating and purchasing PC systems. This article summarizes the source pulse of cache. We hope that we can give users a more systematic reference.

(Reposted from report calculation)

[Z] principle, design and implementation of cache in Computer Architecture

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.