Sixth chapter Memory Hierarchy
"Learning time: 6 hours"
"Learning tasks:" In-depth understanding of computer systems, chapter sixth "
6.1 Storage Technology 6.1.1 Random access memory
Divided into two categories: static (SRAM) and dynamic (DRAM).
(1) SRAM: Cache memory, both on the CPU chip, can also be under the chip.
(2) DRAM: main memory and the frame buffer of the graphics system.
1. Static RAM
SRAM stores each bit in a bistable memory unit. Each unit is implemented using a six-transistor circuit. The circuit has an attribute: it can be maintained indefinitely in one of two different voltage configurations or states.
2. Dynamic RAM
|
Number of transistors per bit |
Relative access time |
Continuous? |
A sensitive? |
Relative cost |
Application |
Sram |
6 |
1x |
Is |
Whether |
100xx |
Cache memory |
Dram |
1 |
10x |
Whether |
Is |
1x |
Main memory, Frame buffer |
3. The traditional DRAM
(1) the unit (bit) is divided into a D-cell, each of which consists of a W DRAM unit.
(2) A d*w dram stores the DW bit information in total.
(3) The unit is organized into a rectangular array of R row C columns, Rc=d. The address of each element as tangible as (I,J).
(4) line address I:ras; column Address: CAS
Note: RAS and CAS requests share the same DRAM address pins.
4. Memory Module
(1) Dual-row in-line memory module: 168 pins to transmit data to the storage controller and outgoing data from the storage controller in 64-bit blocks.
(2) Single inline memory module: 72 pins, 32 bits to transmit data.
5. Enhanced DRAM
- Fast-page Mode dram
- Extended Data output dram
- Synchronous DRAM
- Double Data rate synchronous DRAM
- Rambus DRAM
- Video dram
6. Non-volatile memory
Read-only memory rom
(1) Differentiate: Differentiate by the number of times they can be reprogrammed and the mechanisms used to reprogram them.
PROM (programmable ROM): can only be programmed once.
EPROM (rewritable programmable ROM):
(1) Ultraviolet light shines through the window, and the EPROM unit is cleared to 0.
(2) To complete the EPROM programming by using a special device that writes 1 to the EPROM.
EEPROM (Electronic erasable Prom):
(1) There is no need for a physically separate programming device.
(2) The number of times can be programmed to reach the order of 10^5.
Flash (Flash):
Solid-state drives: Flash-based disk drives.
7. Accessing main memory
(1) Bus: A group of parallel wires, can carry the address, data, and control signals.
6.1.2 disk storage 1. Disk Construction
(1) composed of platters, each disc has two sides or a surface, the surface is covered with magnetic recording material, the center of the disk can be rotated spindle, so that the disc at a fixed rotational rate of rotation, the disk usually contains one or more such platters, frozen in a sealed container.
(2) Each surface consists of a set of concentric circles called tracks. Each track is divided into a set of sectors, each of which contains an equal number of data bits (512 bytes), separated by some gaps between sectors, and no data bits are stored between gaps. The Gap store is used to identify the format bits of the sector.
2. Disk capacity
(1) Maximum capacity (capacity): the maximum number of digits that can be recorded on a disk.
(2) Determining factors:
- Recording density: The number of bits that can be placed in an inch of a track.
- Track density: The number of tracks that can be in the one-inch radius from the center of the disc.
- Surface density: The product of recording density and track density.
(1) Formula:
3. Disk operation
(1) The disk is a block of sector size to read and write data.
(2) Access time:
- Seek time: The time required to move the drive arm. Usually 3~9ms, the maximum time for a single seek is up to 20ms.
- Rotation time: Depends on the position of the disc and the rotational speed of the disk when the read/write head reaches the target track.
Maximum rotation delay: T (max rotation) = (1/rpm) * (60secs/1min)
Average rotation time: T (avg rotation) =1/2t (max rotation)
L Transfer time: The transfer time of a sector depends on the rotational speed and the number of sectors per track.
T (avg transfer) = (1/rpm) * (1/(average sector number/track)) * (60secs/1min)
Attention:
- The time to access 512 bytes in a disk sector is mainly the seek time and the rotation delay.
- The seek time is roughly equal to the rotational delay, and the estimated disk access time X2 the Seek time
4. Logical Disk Block
(1) A logical fast sequence with a sector size of B, numbered 0,1,...., B-1.
(2) There is a small hardware/firmware device in the disk called the disk controller. Maintains a mapping relationship between logical block numbers and actual disk sectors.
5. Connecting to I/O devices
(1) The system bus and the memory bus are CPU dependent.
(2) Third-party I/O devices
- Universal Serial Bus Controller: is a transit mechanism for a device connected to a USB bus.
- Graphics Card (Adapter): Contains hardware and software logic, responsible for the CPU on the display of the image.
- Host Bus adapter: Connect one or more disks to the IO bus, using a communication protocol defined by a particular host bus interface.
6. Accessing the disk
A) cup initiates a disk read by writing the command, logical block number, and destination memory address to the memory map address associated with the disk.
b) Direct Memory access: Disk controller read sector and perform DMA transfer to main memory
c) When the DMA transfer is complete, the disk controller notifies the CPU in an interrupted manner
6.2 Local Sex
Two kinds of forms: temporal locality and spatial locality
6.2.1 locality of reference to program data
1. Sequential reference mode: 1-step reference mode.
2.c arrays are stored in memory in line order.
The locality of the 6.2.2 take instruction
1. One important attribute of code that differs from program data is that it cannot be modified at run time.
2. When the program is executing, the CPU only reads its instructions from memory. The CPU will never rewrite or modify these instructions.
Summary of local 6.2.3
- A program that repeatedly references the same variable has good time locality.
- For a program with a reference pattern with a step size of K, the smaller the step size, the better the spatial locality. A program with a reference pattern with a step size of 1 has a good spatial locality. In the memory of the large-step jump to jump to the program space locality will be very poor.
- For the instruction, the loop has good space and time locality. The smaller the loop body, the more the loop iteration number, the better the locality.
6.3 Memory Hierarchy
Storage technology: Access times vary widely between different storage technologies. Faster technologies cost more per byte than slower technologies and have smaller capacity. The speed gap between CPU and main memory is increasing.
Computer software: A well-written program tends to show good locality.
6.3.1 cache in the memory hierarchy
The central idea of the memory hierarchy: For each k, the faster and smaller storage devices located on the K-tier act as caches of larger and slower storage devices located on the k+1 layer.
1. Cache Hits
When a program needs a data object D in Layer k+1, first look for D in a block that is currently stored in the K layer, and if D is just cached in the K layer, it is called a cache hit.
2. Cache Misses
(1) There is no cached data object D in Layer K.
The K-tier cache then extracts the block containing d from the k+1 cache. If the level K cache is full, it is possible to overwrite an existing block
The process of covering an existing block is called substitution or expulsion.
3. Types of Cache Misses
(1) Mandatory not hit/cold not hit
That is, the K-tier cache is empty (called a cold cache), and access to any data object is not hit.
(2) Conflict not hit
Because of a placement policy, placing a block limit on the k+1 layer in a small subset of the K-layer block causes the cache to be not full, but the corresponding block is full and will not be hit.
(3) Capacity not hit
When the size of the working set exceeds the size of the cache, the cache undergoes a capacity miss, which means that the cache is too small to handle the working set.
4. Cache Management
Some form of logic must manage the cache, while the logic for managing the cache can be either hardware, software, or a collection of both.
Summary of the concept of 6.3.2 memory hierarchy structure
6.4 Cache Memory
Early computer systems have a memory hierarchy of only three tiers: CPU registers, DRAM master memory, and disk storage.
L1 Cache (first-level cache): SRAM cache memory. Located between the CPU register file and main memory, the access speed is 2-4 clock cycles.
L2 cache: between L1 cache and main memory, access speed of 10 clock cycles
L3 cache: between L2 cache and main memory, access speed of 30 or 40 clock cycles
6.4.1 General-Purpose cache memory Architecture 1. Cache is an array of cache groups (S,E,B,M)
S: There are s=2^s cache groups in this array
E: Each group contains an E cache line
B: Each row is made up of a b=2^b byte block of data
M: Each memory address has a M-bit, which forms m=2^m different address
2. Mark bit and valid bit
(1) Valid bit: Each row has a valid bit indicating whether the row contains meaningful information
(2) Mark bit: t=m-(b+s), unique identification of blocks stored in this cache line
(3) Group index bit: s
(4) Block shift: B
6.4.2 Direct Mapping Cache
Only one row per group (e=1) of cache is called direct mapping cache.
The cache determines whether a request is hit, and then the process of removing the requested word is divided into three steps:
- Group selection
- Row matching
- Word extraction
1. Group selection
(1) The cache extracts s group index bits from the middle of the address of W.
Group Index bit: a unsigned integer that corresponds to a group number.
2. Row matching
Note that there are two sufficient prerequisites to determine the cache hit:
- The row has a valid bit set
- The tag in the cache line matches the tag in the W's address
3. Word selection 4. Row substitution when cache misses 5. Direct mapping cache in background run
- The tag and index bits are linked together to uniquely identify each block in the memory
- Blocks mapped to the same cache group are uniquely identified by the tag bit
6 conflict misses in direct map cache
(1) Jitter: Caches repeatedly load and evict groups of the same cache block.
(2) Reason: These blocks are mapped to the same cache group.
(3) Workaround: Place B-byte padding at the end of each array (b-byte is the length of a block, one row is a block, which is the equivalent of separating rows) so that they map to different groups.
6.4.3 Group-linked cache
E-channel group-linked cache: 1<e<c/b
1. Group selection
The same as the direct one.
2. Line matching and word selection
(1) Key: Mark and valid bit.
(2) Value: The contents of the block.
3. Line substitution when not hit
A blank line replaces a blank row, there is no blank line, and the substitution policy is applied:
(1) Random replacement
(2) The most infrequently used policy LFU: Replace the line that has the least number of references in a previous time window.
(3) Least recent use of the policy LRU: replace the last time you visited the line.
6.4.4 Fully-connected cache 1. Group selection
There is only one group, default group 0, no index bits , and addresses are divided into only one tag and one block offset.
2. Line matching and word selection
(1) associate with the group.
(2) It is only suitable for small caches.
6.4.5 Questions about writing
1. Write directly: Writes the cache block of W immediately back to the lower layer.
(1) Disadvantage: Each write will cause bus traffic.
2. Write back; Postpone the memory update as much as possible, and only write it to the lower layer immediately after the replacement algorithm has to evict the updated block.
(1) Advantages: Due to locality, significantly reduce bus traffic.
(2) Cons: Added complexity.
3. How to handle write misses
(1) Write assignment: Load the block in the corresponding lower layer into the cache, and then update the cache block.
(2) Non-write allocation: Avoid the cache, write the word directly in the lower layer.
6.4.6 Anatomy of a true cache hierarchy 1. The cache saves both data and instructions.
- Save instructions only: I-cache
- Save only the program data: D-cache
- Save the instruction and save the data: Unified cache
Performance impact of 6.4.7 cache parameters
- No hit = number of misses/number of references
- Hit Ratio = 1-No hit
- Hit time
- No hit penalty: Because of the extra time required to miss the
6.5 Writing Cache-friendly code 1. Basic methods:
- Let the most common situation run fast
- Minimum number of cache misses within each loop
2. Important issues:
- Repeated references to local variables are good (temporal locality)
- The reference pattern of step 1 is good (spatial locality)
6.6 Memory Mountain
Each computer has the only memory mountain that indicates the capabilities of his memory system.
--that is, the performance of the memory system is expressed as a mountain of time and space locality.
What you want to accomplish: make the program run at the peak rather than the trough
Objective: To make use of time locality to remove frequently used words from L1 and to use spatial locality to allow as many words as possible to be accessed from a L1 cache line
Learning experience:
Although this chapter does not have a lot of hands-on code to compile and implement, but the concept and theory too much, although the book editor is very organized, but the person feels that still need to maintain a great deal of patience to browse and understand .
Resources
1. "In-depth understanding of computer systems" book and PDF chapter Sixth
2.http://www.cnblogs.com/lwr-/p/4908540.htm Simaoyang Blog
20135223 He Weizin-Information security system design basics Seventh Week study summary