Seventh Chapter Cache
7.1 How the program works
To understand the cache, you need to understand how the computer runs the program. You should learn computer architecture to understand this topic in depth. My goal in this chapter is to give a simple model of program execution.
When the program starts, the code (or program text) is usually located on the hard disk. The operating system creates a new process to run the program, and then the loader copies the code from storage to main memory, and starts the program by invoking the.
While the program is running, most of its data is stored in main memory, but some data are in registers, which are small storage units on the CPU. These registers include:
Program Counter (PC), which contains the address of the next instruction (in memory) of the program.
Instruction Register (IR), which contains the machine code of the currently executing instruction.
A stack pointer (SP) that contains a pointer to the current function stack frame, which contains function arguments and local variables.
The general register of data that is currently used by the program.
A status register, or bit register, containing information about the current calculation. For example, a bit register usually contains a shift to store the result of whether the last operation is zero.
During a program run, the CPU performs the following steps, called "Instruction Cycles":
Fetch: Remove an instruction from memory and store it in the instruction register.
Decoding (Decode): A part of the CPU is called a "control unit" that decodes the instruction and sends a signal to the rest of the CPU.
Execute: An appropriate calculation is performed when a signal from the control unit is received.
Most computers can execute hundreds of different instructions, called "instruction sets." However, most directives can be categorized into several common categories:
Load: Sends the in-memory value to the register.
Arithmetic/logic: Loads the operand from the register, performs an arithmetic operation, and stores the result in a register.
Storage: The value in the register is sent to memory.
Jump/Branch: Modify the program counter so that the control flow jumps to another location in the program. A branch is usually conditional, which means it checks the flag in the bit register and jumps only when set.
Some instruction sets, including universal x86, provide mixed instructions for loading and arithmetic operations.
In each instruction cycle, the instruction is read from the program text. In addition, almost half of the instructions in a common program are used to store or read data. A fundamental problem with computer architecture, "memory bottleneck" is here.
On the current desktop, the CPU is typically 2GHz, which means that a new statement is initialized every 0.5ns. But the time it takes to transfer data from memory is about 10ns. If the CPU needs to wait for 10ns to grab a command and wait for 10ns to load the data, it may require 40 clock cycles to complete an instruction.
7.2 Cache Performance
The solution to this problem, or at least part of the solution, is caching. "Cache" is a small, fast storage space on the CPU. On the current computer, the storage is usually 1~2mib and the access speed is 1~2ns.
When the CPU reads data from memory, it saves a copy to the cache. If the same data is read again, the CPU reads the cache directly and no longer waits for memory.
When the last cache is full, we need to throw some data away in order for the new data to come in. So if the CPU loads the data and then reads it over a period of time, the data may not be in the cache.
The performance of many programs is limited by the efficiency of the cache. If the data required by the CPU is usually in the cache, the program can run at full CPU speed. If the CPU often needs data that is not in the cache, the program will be limited to the memory speed.
The "hit ratio" H of the cache, which is the ratio of data found in the cache when memory is accessed. "Missing rate" m, which is the ratio of memory access required to access memory. If th is the time to handle cache hits, the TM is the time of the cache misses, the average time of each memory access is:
H * Th + M * Tm
Similarly, we can define "missing penalty", which is the extra time required to process the cache misses, Tp = tm-th, then the average access time is:
Th + M * Tp
When the loss rate is low, the average access time is approaching th, that is, the program can behave as if the memory has a cache speed.
7.3 Local Sex
When a program reads a byte for the first time, the cache usually loads a "block" or "row" of data, containing the required bytes and some adjacent data. If the program continues to read these adjacent data, they are already in the cache.
For example, if the block size is 64B, you read a string with a length of 64, and the first byte of the string happens at the beginning of the block. When you load the first byte, you trigger the missing penalty, but then the rest of the string is in the cache. After reading the entire string, the hit rate is 63/64. If a string is divided into two blocks, you should trigger two missing penalties. But this hit rate is 62/64, about 97%.
On the other hand, if the program skips predictably, reads the data from a fragmented location in memory, and rarely accesses the same location two times, the performance of the cache is low.
The tendency for programs to use the same data more than once is called "temporal locality." The tendency to use data from adjacent locations is called "spatial locality". hurrily, many programs are born with these two kinds of locality:
Many programs contain blocks of code that are not jumps or branches. Because of these code blocks, instruction order directives, access patterns have spatial locality.
In the loop, the program executes several times the same instruction, so the access mode has time locality.
The result of an instruction is usually used for the operand of the next instruction, so the data access mode has time locality.
When a program executes a function, its parameters and local variables are stored together on the stack. Access to these values has spatial locality.
One of the most common processing models is the sequential reading and writing of data elements. This model also has spatial locality.
In the next section we will explore the relationship between the program's access pattern and cache performance.
7.4 Metrics for cache performance
When I was a UC Berkeley graduate, I was a TA in Brian Harvey's computer architecture class. One of my favorite exercises involves an iterative algebra group, a program that reads and writes elements and measures average time. By changing the size of the array, it is possible to infer the size of the cache, the size of the block, and some other properties.
The modified version of my program is in the cache directory of this book repository.
The core part of the program is a loop:
iters = 0;
do {
SEC0 = Get_seconds ();
for (index = 0; index < limit; index += stride) array[index] = array[index] + 1;iters = iters + 1; sec = sec + (get_seconds() - sec0);
} while (sec < 0.1);
The internal for loop iterates through the array. Limit determines the range of array traversal. Stride decides how many elements to skip. For example, if limit is 16,stride is 4, the loop accesses 0, 4, 8, and 12.
The SEC tracks the entire time that the CPU is used for internal loops. The external loop does not stop until the SEC exceeds 0.1 seconds, which is long enough for us to calculate the average time required.
Get_seconds uses the system call Clock_gettime to convert the result to seconds, and returns the result as a double.
Double Get_seconds () {
struct TIMESPEC ts;
Clock_gettime (clock_process_cputime_id, &ts);
return ts.tv_sec + ts.tv_nsec/1e9;
}
Figure 7.1: Average missing penalty function for data size and stride length
In order to separate the time to access the data, the program runs the second loop, which is exactly the same except that the inner loop does not have access to the data. It always adds the same variable:
iters2 = 0;
do {
SEC0 = Get_seconds ();
for (index = 0; index < limit; index += stride) temp = temp + index;iters2 = iters2 + 1;sec = sec - (get_seconds() - sec0);
} while (Iters2 < iters);
The second loop runs the same number of iterations as the first loop. After each iteration, it reduces the time it consumes from the SEC. When the loop is complete, the SEC contains the total time of all array accesses, minus the time used to increase the temp. The difference is all the missing penalties for all access triggers. Finally, we divide it by the total number of accesses to get the average missing penalty for each access, in NS units:
SEC * 1e9/iters/limit * Stride
If you compile and run cache.c, you should see this output:
size:4096 Stride:8 read+write:0.8633 NS
size:4096 stride:16 read+write:0.7023 NS
size:4096 stride:32 read+write:0.7105 NS
size:4096 stride:64 read+write:0.7058 NS
If you have Python and matplotlib installed, you can use graph_data.py to turn the results into graphics. Figure 7.1 shows the results I ran on the Dell Optiplex 7010. Note that the array size and stride length are expressed in bytes, not the number of array elements.
Take a minute to think about the picture and see if you can infer the cache information. Here are some things to think about:
The program iterates and reads the array multiple times, so there is a lot of time locality. If the entire array can be placed into the cache, the average penalty for deletion should be almost 0.
When the step is 4, we read each element of the array, so the program has a lot of spatial locality. If the block size is large enough to contain 64 elements, for example, the hit rate should be 63/64, even if the array cannot be placed entirely in the cache.
If the step is equal to the size of the block (or greater), the spatial locality should be 0, because each time we read a block, we only access one element. In this case, we will see the biggest missing penalty.
In summary, if the array is smaller than the cache size, or the step size is smaller than the block, we think there is good caching performance. If the array is larger than the cache size, and the steps are large, performance will only degrade.
In Figure 7.1, the cache performance is good for all steps, as long as the array is less than 2 * * 22 bytes. We can speculate that the cache size is approximately 4MiB. In fact, according to the specification should be 3MiB.
Cache performance is good when the stride size is 8, 16, or 32B. Starting at 64B, the average missing penalty is about 9ns for larger steps. We can infer that the block size is 128B.
Many processors use "multilevel cache", which contains a small, fast cache, and a large slow cache. In this example, when the array size is greater than 2 * * 14B, the missing penalty seems to grow a bit. Therefore, this processor may also have a 16KB cache with a access time of less than 1ns.
7.5 Cache-Friendly programming
Memory caching is hardware-enabled, so programmers don't need to know much about it in most cases. But if you know how caching works, you can write programs that use them more efficiently.
For example, if you are working with a large array, iterating through the array only once, performing multiple operations on each element may be faster than iterating through the array multiple times.
If you work with a two-dimensional array, it is stored as an array of rows. If you need to traverse elements, traverse by row and step to element size is faster than column traversal and the size of the row is greater.
Linked list data structures do not always have spatial locality, because nodes are not necessarily contiguous in memory. But if you assign a number of nodes at the same time, they are usually allocated together in the heap. Or, if you assign an array of nodes at a time, you should know that they are contiguous, which is better.
Recursive strategies like merge sort generally have good caching behavior because they divide large arrays into small fragments and then process them. Sometimes these algorithms can be tuned to take advantage of caching behavior.
For those applications where performance is critical, you can design an algorithm that fits the cache size, block size, and other hardware characteristics. Algorithms like this are called "Cache-aware." The obvious disadvantage of cache-aware algorithms is their hardware-specific.
7.6 Memory Hierarchy
In several places in this chapter, you may have a question: "If the cache is much faster than main memory, why not use a chunk of cache and throw the main memory away?" ”
There are two reasons: electronic and economic, before you go into computer architecture. Caching is fast because they are fast and close to the CPU, which reduces latency and signal propagation due to capacitance. If you make a big cache, it gets very slow.
In addition, the cache occupies the processor chip space, the larger processor will be more expensive. Main memory typically uses dynamic random access memory (DRAM), with only one transistor and one capacitor per bit, so it can pack more RAM in the same space. But this implementation of memory is slower than the way the cache is implemented.
At the same time, main memory is usually packaged in a dual-inline RAM module (DIMM), which contains at least 16 chips. Several smaller chips are cheaper than a large chip.
The tradeoff between speed, size, and cost is the root cause of the cache. We don't need anything else if there's a fast, big, cheap memory technology.
The same principle as memory applies to memory. Flash memory is very fast, but they are more expensive than hard drives, so they are smaller. Tapes are slower than hard drives, but they can store more things and are relatively inexpensive.
The table below shows the typical access times, sizes, and costs for each technology.
Device access time is usually the size cost
Register 0.5 ns-B?
Cache 1 NS 2 MiB?
DRAM Ten NS 4 GiB Ten/GIBS S DTenμs -GIB 1/gib
HDD 5 ms GiB 0.25/GIBMagneticwithmINuTes1–2T IB 0.02/gib
The number and size of registers depends on the architecture details. The current computer has 32 general-purpose registers, each of which can store a "word". On a 32-bit computer, a word is 32 bits, and 4 bytes. On a 64-bit computer, a word is 64 bits, and 8 bytes. So the total capacity of the register file is 100~300 bytes.
The cost of registers and caches is hard to measure. They are included in the cost of the chip. But customers are not directly aware of their costs.
For the rest of the data in the table, I looked at the computer hardware specifications that were typically sold in the computer's online store. By the time you read this, the data should be out of date, but they can give you some idea of the performance and cost gap at some point in the past.
These technologies form the "Memory Architecture". Each level in the structure is larger and slower than it is on the upper level. In a sense, each level is cached at its next level. You can assume that main memory is a cache of programs and data that is persisted on SSD or HDD storage. And if you need to deal with very large datasets on tape, you can cache part of the data with your hard disk.
7.7 Caching policies
The memory hierarchy shows a framework that takes into account the cache. At each level of the structure, we need to emphasize the basic issues of the four caches:
Who moves the data up or down in the hierarchy? At the top of the structure, the registers are usually allocated by the compiler. The cache of hardware management memory on the CPU. During the execution of a program or opening a file, the user can implicitly move the files on the storage into memory. But the operating system also moves data from memory back to memory. At the bottom of the hierarchy, an administrator explicitly moves data between tape and disk.
Moved something? Typically, the block size at the top of the structure is smaller than the bottom. In the memory cache, the block size is usually 128B. The In-memory page may be 4KiB, but when the operating system reads the file from disk, it may read 10 or 100 blocks at a time.
When will the data be moved? In most basic caches, data is moved to the cache when it is first used. However, many caches use some "prefetch" mechanisms, which means that the data is loaded before an explicit request. We have seen some form of prefetching: loading the entire block when part of the request is requested.
Where is the data in the cache? When the cache fills up, it's impossible to put something in it without throwing something away. Ideally, we intend to keep the data that will be used and replace the data that will not be used.
The answers to these questions form a "caching strategy." Near the top, caching strategies tend to be simpler because they are very fast and are implemented by hardware. Near the bottom, there are more decisions to make, and a well-designed strategy can be quite different.
Most caching strategies are based on the principle that history repeats itself, and if we have information about the recent period, we can use it to predict the near future. For example, if a piece of data is recently used, we think it will be used again soon. This principle shows what is called "least recently used", or LRU. It removes the longest unused block of data from the cache. For more topics, see Wikipedia, caching algorithms.
7.8 page Scheduling
In systems with virtual memory, the operating system can move pages between memory and memory. As I mentioned in 6.2, this mechanism is called "page scheduling", or simply called "Page break".
Here is the workflow:
Process A calls malloc to allocate the page. If there is no free space in the heap for the requested size, malloc calls SBRK to request more memory from the operating system.
If there is a free page in physical memory, the operating system loads it into process A's page table and creates a new virtual memory valid range.
If there is no free page, the dispatch system chooses a sacrifice page that belongs to process B. It copies the page content from memory to disk, and then modifies the page table of process B to indicate that the page is "swapped out".
Once the data for process B is written, the page is reassigned to process a. To prevent process A from reading the data from process B, the page should be emptied.
At this point the SBRK call can be returned, providing additional space for the heap area to malloc. Then malloc allocates the requested memory and returns. Process A can continue to execute.
When process a finishes, or interrupts, the scheduler may let process B continue execution. When it accesses a page that has been swapped out, the memory Manager unit notices that the page is "invalid" and triggers an interrupt.
When the operating system handles an outage, it sees the page being swapped out, and it transfers the page from disk to memory.
Once the page has been swapped in, process B can continue execution.
When page scheduling works well, it can greatly increase the utilization of physical memory, allowing more processes to execute in less space. Here is the reason for it:
Most processes do not run out of allocated memory. Many parts of the text segment will never be executed, or will not be executed once. These pages can be swapped out without causing any problems.
If the program leaks memory, it may throw away the allocated space and never use it. By swapping out these pages, the operating system can effectively fill the leaks.
In most systems, some processes, like daemons, are limited for most of the time, and are only "awakened" to respond to time on specific occasions. These processes can be swapped out when they are idle.
In addition, there may be many processes running the same program. These processes can share the same text segment, avoiding the retention of multiple copies in physical memory.
If you increase the total memory allocated to all processes, it can exceed the size of the physical memory and the system still works well.
In a way.
When a process accesses a page that is swapped out, it needs to fetch the data from the disk, which can take several milliseconds. This delay is usually obvious. If you leave a window idle for a while and then switch back to it, it might perform slower, and you may hear the disk work when the page is swapped.
Occasional delays like this may be acceptable, but if you have many processes that occupy a lot of space, they will affect each other. When process a runs, it reclaims the page required by process B, and then when process B runs, it reclaims the page required by process a. When this happens, all two processes are slow and the system becomes unresponsive. The scene we don't want to see is called "bumps."
Theoretically, the operating system should avoid bumps by detecting the growth of the schedule and blocks, or kill the process until the system can respond again. But in my opinion, most systems do not do this, or do poorly. They usually allow the user to limit the use of physical memory, or try to recover when a bump occurs.
Allen B. Downey
Original: Chapter 7 Caching
Translator: Dragon
Protocol: CC BY-NC-SA 4.0
Operating system considerations