Go deep into Cache

Source: Internet
Author: User

Cache is a topic that we often care about. The release of K6-III introduces a brand new cache structure, namely, tri-level cache design. So what is the function of cache for a microcomputer system? How does it work?

  I. Necessity of Using CacheThe so-called cache is a high-speed buffer memory, which is located between the CPU and the primary memory, that is, the dram (Dynamic RAM dynamic memory), which is usually composed) memory with a small size but fast access speed. Currently, the main memory used by computers is dram, which features low price and large capacity. However, due to the use of capacitor to store information, it is difficult to increase the access speed, each time the CPU executes a command, it needs to access the primary memory once or multiple times. The DRAM read/write speed is much lower than the CPU speed. Therefore, in order to achieve speed matching, you can only insert the Wait Status in the CPU instruction period. The high-speed CPU is in the waiting status, which greatly reduces the system execution efficiency. Because it adopts the same manufacturing process as the CPU, the access speed of SRAM is faster than that of DRAM, but it is large in size, with high power consumption and high in price, it is impossible and unnecessary to use SRAM for all the memory. Therefore, in order to solve the contradiction between speed and cost, a hierarchical processing method is generated, that is, a relatively small SRAM is installed between the master memory and the CPU as the high-speed buffer memory. When the cache is used, the cache stores copies of some content in the primary storage (called a memory image). When the CPU reads and writes data, it first accesses the cache (because the cache speed is equivalent to that of the CPU, therefore, the CPU can execute commands in the zero-Wait state. Only when no data is required by the CPU in the cache (this is called "not hit", otherwise it is called "hit "), CPU to access the primary memory. At present, the high-capacity cache can make the CPU access cache hit rate reach 90%-98%, which greatly improves the speed of CPU access data and the system performance. Ii. Feasibility of Using CacheThe analysis results of a large number of typical programs show that in a short period of time, the addresses generated by the program are usually concentrated within a very small range of the memory logical address space. In most cases, commands are executed sequentially, so the distribution of command addresses is continuous. In addition, the cyclic and subroutine segments must be executed multiple times, therefore, access to these addresses naturally tends to be distributed in a centralized manner in time. This kind of centralized tendency of data is not as obvious as that of instruction, but access to arrays and the choice of work units can make the memory address relatively concentrated. This frequent access to memory addresses in a local range, and access from addresses out of this range is seldom called the local access of programs. Based on the program's local principle, set a cache between the master memory and the CPU to load some of the commands or data near the executed command address from the master memory into the cache for the CPU to use for a period of time, it is completely feasible. Iii. basic working principle of CacheIn traditional socket architecture, two-level Buffer structure is usually used, that is, a level-1 cache (L1 cache) is integrated into the CPU, and a level-2 cache (L2 cache) is installed on the motherboard ); the L2 cache in the slot I architecture is executed on the same circuit board as the CPU, and runs at half the speed of the core or kernel, which is faster than the L2 cache running at the system frequency in the socket, the higher the clock speed, the higher the cache process requirements. The CPU first searches for data in L1 cache. If the data cannot be found, it searches in l2cache. If the data is in L2 cache, the Controller modifies L1 cache while transmitting the data; if the data is neither in the L1 cache nor in the L2 cache, the cache controller obtains the data from the primary memory and provides the data to the CPU and modifies the cache at the same time. K6-ⅲ is special. 64 kB L1 cache and kb full core speed L2 cache are used. The cache on the original motherboard actually becomes L3 cache. According to relevant tests, the system performance can reach 2% ~ when the 512k-2mb level-3 cache plays a role ~ 10% improvement; tri-level has become the most detailed and complex solution to the bottleneck between high-speed CPU and low-speed memory since the emergence of PC systems. In addition, the future development direction of cache is also large capacity and ultra-high speed. In the primary-cache storage system, all commands and data are stored in the primary storage. The cache only stores copies of some program blocks and data blocks in the primary storage, it is just a block-based storage method. Cache and primary storage are divided into blocks, each of which consists of multiple bytes. According to the program Locality Principle above, the program block and data block in the cache will make the content accessed by the CPU already exist in the cache in most cases. The CPU read/write operations are mainly performed between the CPU and the cache. When the CPU accesses the memory, it sends out the address of the access unit, which is transmitted by the address bus to the master memory address register MA in the cache controller, the Master-Cache address translation institution obtains the address from Ma and determines whether the content of this unit already contains a copy in the cache. If the copy already exists in the cache, it hits. When hit, immediately convert the access address into its address in the cache, and then access the cache. If the content to be accessed by the CPU is not in the cache, that is, it does not hit the cache, then the CPU directly accesses the primary storage and contains the entire data block of the storage unit (including the address information of the data block) transfer to the cache so that several subsequent accesses to the memory can be converted to access to the cache. If the cache memory is full, you need to replace the original information in the cache with this block information under the control of the replacement control component based on a replacement algorithm/policy. Therefore, to improve system efficiency, you must increase the cache hit rate. The increase in the cache hit rate depends on a series of factors such as the cache image method and cache refresh algorithm; at the same time, the content in the cache should be consistent with the content in the primary storage. That is to say, if the content in the primary storage changes after being transferred to the cache, the image in the cache should change immediately. Otherwise, when the content in the cache is modified by the CPU, the content in the master memory should also be modified. From the above brief introduction, we know that cache is also a type of memory, which is set to solve the speed matching problem between the CPU and the primary memory, and cannot be accessed directly by users. Next, we will briefly discuss the problems of cache image, refresh, and data consistency. 4. Address image the so-called image problem refers to how to determine which part of the content in the cache is copied from the primary storage. That is, a function must be applied to locate the primary address image in the cache, it is also called an address image. After the information is loaded into the cache in this way, the Master Address should be changed to the cache address when the program is executed. This conversion process is called address transformation. The address image method generally uses direct image, fully connected image, and group connected image. 1. The direct image method is called the direct image method. In direct image mode, the data of a storage unit in the primary storage can only be transferred to one location in the cache. If the data of another storage unit in the primary storage is transferred to this location, a conflict will occur. Address image is usually used to partition the primary storage space by the cache size. The same block number in each region is mapped to the same block location in the cache. Generally, the cache is divided into 2n blocks, and the primary memory is divided into 2 MB blocks of the same size. The correspondence between the primary memory and the cache block can be represented by the following image function: J = I mod 2n. In formula, J is the block number in the cache, And I is the block number in the primary storage. Direct image is the simplest method of address image. Its address conversion speed is fast, and it does not involve the replacement policy issues in the other two image methods. However, this method has a high probability of block conflicts. When the sequential round-trip accesses data in two conflicting blocks, the cache hit rate decreases sharply, because even if there are other idle blocks in the cache, it cannot be applied because of the fixed address image relationship. 2. Each block in the primary storage of a fully connected image can be mapped to any block location in the cache. This method is called a fully connected image. In this way, block conflicts only occur when all the blocks in the cache are filled up. Therefore, the probability of block conflicts is low and a high cache hit rate can be achieved. However, the implementation is complicated. When accessing data in a block, the block address must be compared with all the address tags in the cache block table to determine whether the block address is hit. There is a complicated Replacement problem during data block transfer, that is, it determines where the data block is transferred to the cache and the data in the cache is transferred to the primary storage. To achieve a high speed, all comparisons and replacements must be implemented by hardware. 3. The method of associating a group of connected images is a compromise between a direct image and a fully connected image. In this way, the storage space is divided into several groups, each of which is a direct image, and each block in the group is a fully connected image. It is the general form of the two image methods above. If the size of the group is 1, that is, the cache space is divided into 2n groups, it becomes a direct image. If the size of the group is the entire size of the cache, it is changed to a fully connected image. The Group join method is simpler than the full join method in the block hit and replacement algorithms. The probability of block conflict is lower than that of direct images, and the hit rate is also between direct images and full join images. 5. There are two basic operations in cache and memory, that is, read operations and write operations. When the CPU sends a read operation command, there are two scenarios based on the master memory address it generates: one is that the required data is already in the cache, so you only need to directly access the cache, read Information from the corresponding unit to the data bus; the other is that the required data has not been loaded into the cache, while the CPU needs to read information from the main memory, the cache replacement part copies the storage content of the address from the primary storage to the cache. If the corresponding location in the cache is already filled with text blocks, you must remove the old ones. There are two common replacement policies: 1. the first in first out (FIFO) policy always replaces the first incoming cache block. It does not need to record the usage of each block at any time, it is easier to implement. The disadvantage is that a frequently used block, such as a block containing a loop program, may be replaced because it is the earliest block. 2. the least recently used (LRU) LRU (least recently used) policy is to replace the information block with the least recent usage in the cache, this replacement algorithm must record the usage of the block in the cache at any time. The average hit rate of LRU is higher than that of FIFO. When the group size increases, the hit rate of LRU increases. When the CPU sends a write operation command, it must also be divided into two types based on its primary storage address: one is to write information to the primary storage only when the primary memory is not hit, you do not need to transfer the entire content of the address unit to the cache at the same time. When hit is reached, there are usually three ways to maintain the consistency between the cache and the primary storage: 1. write Through (write through) means that when the CPU writes data to the cache, it also writes the data to the primary storage to ensure the consistency of the corresponding unit data in the cache and the primary storage. It features simplicity and reliability, however, because the CPU needs to write data to the primary memory each time it is updated, the speed will inevitably be affected.

2. the slow write (post write) means that when the CPU updates the cache, it does not directly update the data in the primary storage, but sends the updated data to a cache for temporary storage, write the content in the cache to the primary storage when appropriate. In this way, the CPU does not have to wait for the delay caused by the primary storage write, which increases the speed to a certain extent. However, because the cache has only limited capacity, it can only lock the data written once, for continuous writing, the CPU still needs to wait.

3. write back: the CPU only writes data to the cache and marks it to indicate it until the information block into which the written block in the cache will be replaced, only once written to the primary storage. This method takes into account that the write is usually an intermediate result, and the write speed is slow and unnecessary. It features high speed and avoids unnecessary redundant write operations, but the structure is complicated.

In addition, there is also a non-cacheable block method, that is, to open up an area in the primary storage, the data in this area is not managed by the cache controller, and cannot be transferred to the cache, the CPU can only directly read and write content in this region. Because the region does not have a relationship with the cache, there is no data inconsistency problem. Currently, most BIOS setup programs in the microcomputer system allow users to set the first address and size of non-Cache areas.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.