Section 6 Chapters Memory Hierarchy
A memory system is a hierarchical structure of storage devices with different capacities, costs, and access times.
The CPU registers hold the most commonly used data.
Small, fast cache memory near the CPU as part of the buffer area of data and instructions stored in relatively slow main memory.
Main memory temporarily stores data stored on a large, slow disk, which is often used as a buffer zone for data stored on disks or tapes of other machines connected over the network.
6.1 Storage Technology
6.1.1 Random access memory
Random access memory is divided into two categories-static and dynamic.
Static RAM (SRAM) is faster, but much more expensive than dynamic RAM (DRAM).
(SRAM is used as a cache memory, which can be on the CPU chip or on the CPU chip.)
DRAM is used as a frame buffer for primary and graphical systems. )
1. Static RAM
(1) SRAM stores each bit in a bistable memory unit. Each unit is implemented using a six-transistor circuit. One of the properties of this circuit is that it can be kept indefinitely in one of two different voltage configurations or states, and any other state is unstable. (bistable)
(2) Because of the bistable characteristics of SRAM, as long as there is electricity, it will always maintain its value, even if there is interference, such as electronic noise, when the interference is eliminated, the circuit can also be restored to a stable value.
2. Dynamic RAM
(1) DRAM stores each bit as a charge to the capacitor.
(2) Various factors that leak current can cause the DRAM unit to lose its charge in 10~100 milliseconds.
3. The traditional DRAM
The units (BITS) in the DRAM chip are divided into D-units, each of which consists of a W DRAM unit, and a d*w dram stores the DW bit information. The cells are organized into a rectangular array of R row C columns, Rc=d. The address of each of the cells is represented by (I,J) (zero-based).
4. Memory Module
(1) Dual-row in-line memory module: 168 pins, 64-bit for block incoming/outgoing data to the storage controller.
(2) Single inline memory module: 72 pins, 32 bits for the block incoming/outgoing data to the storage controller.
5. Enhanced DRAM
(1) The synchronous Dram:sdram replaces many of these control signals with the rising edge of the external clock signal that is the same as the drive memory.
(2) Double data rate synchronous DRAM:DDR Sdarm is an increase in SDRAM, which doubles the DRAM speed by using two edges of the clock as the control signal.
6. Non-volatile memory
(1) DRAM and SRAM will lose information if power is lost.
(2) PROM: can only be programmed once. Prom Each storage unit has a fuse that can only be fused once with a high current.
(3) Erasable programmable ROM (EPROM): ultraviolet light shines through the window, the EPROM is cleared to 0, the number of erased and reprogrammed is 1000 times.
(4) Electronic erasable ROM (EEPROM): Do not need a physically independent programming device, so can be directly on the printed circuit card programming, the number of times can be programmed to 10^5.
Flash: Based on EEPROM, provides fast and durable nonvolatile storage for a large number of electronic devices.
7. Accessing main memory
(1) The Read transaction transmits data from main memory to the CPU, and the write transaction transmits data from the CPU to main memory.
(2) The bus is a group of parallel wires, can carry the address, data, control signals.
6.1.2 Disk storage
- Disk Manufacturing
Each surface consists of a set of concentric circles called tracks (track), each of which is divided into a set of sectors (sector), each sector containing an equal number of data bits (usually 512 bytes), which are encoded in magnetic materials on the sector. The sectors are separated by some gap (GAP), and no data bits exist in these gaps. The Gap store is used to identify the format bits of the sector.
- Disk capacity
Disk capacity is determined by the following technical factors:
Record density (bits per inch): the number of digits that can be placed in a segment of a track in an inch.
Track density (channel/inch): The number of tracks that can be in an inch of the radius from the center of the disc
Surface density (bits per square inch): the product of recording density and track density.
Side note: For units associated with DRAM and SRAM capacity, usually k = 210,m = 220,g = 230, for units of I/O device capacity such as disks and networks, usually k = 103,m = 106,g = 109.
(1) All reading and writing heads are located on the same cylinder at any time.
(2) At the end of the transmission arm read/write head on the disk surface height of about 0.1 microns on a thin layer of air cushion on the fly, the speed of about 80km/h. The disk reads and writes data as a block of sector size.
(3) The access time of the sector is composed of three main parts:
1. Seek time: In order to read the contents of a target sector, the drive ARM positions the read/write head first on the track containing the target sector. The required time is the seek time, approximately equal to the maximum rotation time.
2. Rotation time: After locating the desired track, the drive waits for the first bit of the target sector to rotate below the read/write head.
Tmax rotation = 1/maximum rotational number rate
Tavg rotation = (Xtmax) rotation
3. Delivery time: Tavg transfer = (1/maximum rotation) x (1/Average number of sectors per track)
- Logical Disk Block
Modern disk construction is complex and has multiple disk faces, which have different recording areas. To hide such complexity from the operating system, modern disks simplify their construction to a sequence of logical blocks of a B sector size, numbered 0,1,2,... b-1. The disk has a small hardware/firmware device called a disk controller that maintains a mapping relationship between the logical block number and the actual (physical) disk sector.
- Connecting to I/O devices
- Access disk
- Anatomy of a commercial disk
6.1.3 solid-State drives
SSD is a flash-based storage technology that, in some cases, is a very attractive alternative to traditional spinning disks.
A hard drive package consists of one or more flash chips and a memory translation layer, the Flash chip replaces the mechanical drive in the spinning disk, and the Flash translation layer translates the request for the logical block into access to the underlying physical device
6.1.4 Storage Technology Trends
- From our discussion of storage technology, we can summarize a few important ideas.
- Different storage technologies have different price and performance tradeoffs. SRAM is a bit faster than DRAM, and DRAM is much faster than disk. On the other hand, fast storage is always more expensive than slow storage. The cost of SRAM per byte is higher than that of DRAM, and the cost of DRAM is much higher than that of disk. The SSD is located between the DRAM and the spinning disk.
- The price and performance attributes of different storage technologies vary at a very different rate.
6.2 locality of
There are two forms of locality: Temporal locality and Spatial locality . In a program with good time locality, a memory location that has been referenced once is likely to be referenced many times in the near future; in a program with good spatial locality, if a memory location is referenced once, the program is likely to refer to a nearby memory location in the near future.
Program instructions are stored in memory, and the CPU must take out (read out) these instructions.
But one important attribute of code that differs from program data is that it cannot be modified at run time .
Some quantitative evaluation of the simple principle of locality in a program:
1. A program that repeatedly references the same variable has good time locality.
2. For a program with a reference pattern with a step size of K, the smaller the step size, the better the spatial locality, and the local nature of the program space that jumps in large steps in memory is poor.
3. The loop has a good time and space locality for taking orders. The smaller the loop body, the more the loop iteration number, the better the locality.
6.3 Memory Hierarchy
The memory hierarchy is shown in the following example:
(1) Cache Hit: When a program needs a data object D in the k+1 layer, it first finds D in a block of the currently stored K-tier, and if D is just in the K-level, it is called a cache hit.
(2) Cache misses: If there is no cache data d in the K-tier, it is called a cache miss, at which point the block containing D is removed from the k+1 layer and may overwrite (replace/expel) the current block (sacrificial block).
(3) Types of cache misses
Mandatory misses/cold misses: The K-tier cache is empty (cold cache), only a transient state, and does not occur after repeated access to the memory to allow the cache to warm up in a stable state.
Conflict misses: Block I of the k+1 layer must be placed in block K (i mod 4), a restrictive placement policy that causes conflicting misses.
Summary of the concept of memory hierarchy structure
In summary, the cache-based memory hierarchy works because slower storage devices are cheaper to store faster, and because programs tend to show locality:
- Time locality: Due to temporal locality, the same data object may be used multiple times, and once a data object is copied to the cache for the first time, we expect a series of access hits for that target later. Because the cache is faster than the lower-level storage device, the subsequent hit service will be much faster than the initial miss.
- Take advantage of spatial locality: blocks typically contain multiple data objects. Due to the spatial locality, we expect the other home accesses in the subsequent conversation to compensate for the cost of copying the block after the missed hit.
6.4 high-speed buffer memory
6.4.1 General-Purpose cache Memory Architecture
6.4.2 Direct Map Cache
A cache of only one row per group (E = 1) is called direct mapping cache.
The cache determines whether a request is hit, and then extracts the requested word into a three-step process.
Group selection
Row matching
Word extraction
Only one row per group (e=1) of cache is called direct mapping cache.
(1) Group selection in direct map cache: Cache extracts s group index bits from the address of the word to be extracted, which are interpreted as an unsigned integer corresponding to a group number.
(2) Row matching in direct map cache: A copy of W is included in this row when and only if a valid bit is set and the cache row tag matches the row tag in the address of W.
(3) Direct mapping of Word extraction in cache: Block shift provides the offset of the first byte of the desired word.
(4) line substitution when a direct map is not hit in cache: the requested block needs to be fetched from the next layer in the storage hierarchy, and the new block is stored in a cache line in the group indicated by the group index bit.
6.4.3 Group-connected cache
- 1 < E < c/b
- Each group is saved with more than one cache line.
Group selection in a group-connected cache: The group index bit identifies the group as the group selection in the direct map cache.
Row matching and word selection in a group-attached cache: treat each group as a small associated memory, an array of (key,value) pairs, and return the value values in the corresponding array with key as input. The cache must search for each row in the group, looking for a valid row whose tag matches the address.
Row substitution When a group is not hit in a cache: The simplest substitution strategy is to randomly select the rows to be replaced, while other complex strategies use local principles, such as least frequently used, least recently used, and so on.
6.4.4 fully-connected cache
- E = c/b
- There is only one group, and this group contains all the cache rows.
- Because the full-phase cache requires parallel search for many matching rows, the construction is relatively difficult, so it is only suitable for small caches, such as TLB in a virtual storage system, which caches page table entries.
6.4.5 questions related to writing
- Write back, postpone the memory update as much as possible, and write it back to memory only if the replacement algorithm is to evict the updated block.
- Processing misses: Write allocation (write-allocate), that is, load the corresponding memory block into the cache, and then update the cache.
6.4.6 Anatomy of a real cache hierarchy
- So far, we have assumed that the cache only stores program data. In fact, instructions are also saved.
- The cache program data that only saves instructions is called I-cache.
- The cache that only saves program data is called D-cache.
- A cache that saves both instructions and data is called a unified cache.
- Modern processors include separate I-CAHCE and D-cache.
6.4.7 performance impact of cache parameters
Factors that affect performance
Impact of cache Size: Larger caches may improve hit ratios, but making large memory runs faster is harder.
Block size impact: Larger blocks can take advantage of possible spatial locality in the program to help improve the hit ratio, but larger chunks mean less cache lines and damage time locality.
The effect of the degree of coupling: a greater degree of coupling (greater E value) has the advantage of reducing the likelihood that the cache is jitter due to conflict misses, but at a higher cost.
Impact of Write policies: write-through caches are easy to implement, and can be used in cache-independent write buffers to update memory with little overhead.
There are many metrics to measure the performance of the cache:
- No hit: It calculates this: number of misses/number of references
- Hit rate: Memory reference ratio of hits
- Hit time: The order of magnitude of hit time is several clock cycles
- No hit penalty
6.5 Write Cache-friendly code
1. Temporal locality: Repeated references to local variables are good because the compiler is able to cache them in a register file.
2. Spatial locality: The reference pattern of step 1 is good, because the cache at all levels in the memory hierarchy is storing the data as contiguous blocks.
6.6 synthesis: The impact of cache on program performance
6.6.1 Memory Mountain
Memory system performance is not a number can be described. Each computer has the only memory mountain that indicates the capabilities of his memory system. It is a local mountain of time and space, and the elevation of this mountain can vary more than one order of magnitude. If the program runs on a mountain rather than a trough.
Objective: To make use of time locality to remove frequently used words from L1 and to use spatial locality to make as many words as possible from a L1 cache line.
Focus
Key exercises: 6.2,6.3,6.4,6.8,6.9,6.10,6.11,6.12, 6.13
Resources
1. The textbook "in-depth understanding of computer Systems": Sixth chapter "Processor Architecture", detailed Learning Guide: http://group.cnblogs.com/topic/73069.html
2. Course Materials: https://www.shiyanlou.com/courses/413 Experiment Six, course invitation code: W7FQKW4Y
3. The code in the textbook run, think about, read the Code learning method: Http://www.cnblogs.com/rocedu/p/4837092.html.
4. Baidu Encyclopedia
20135306 Sixth Chapter Study Summary