Sixth chapter Memory Hierarchy
A memory system is a hierarchical structure of storage devices with different capacities, costs, and access times.
CPU registers, cache memory, primary storage, disk.
The first section of storage technology one, random access memory (RAM) RAM classification:
- of Static sram-faster, more expensive, as cache memory, CPU on-chip or on-chip
- of Dynamic darm-as a frame buffer for main memory and graphics system
1. Traditional dram (1) Hyper-unit
- The cell bits in the chip are divided into D-units, each with a W DRAM unit, and a dxw dram that stores the DW bit information in total.
- The cells are organized into a rectangle of row C of R, i.e. Rc=d.
- Each element is tangible such as the address of (I, j), I represents the row, and J represents the column.
(2) Inflow and outflow of information
The information flows through the pins to the outflow chip, each pin carrying a 1-bit signal.
(3) Storage controller
This circuit can pass in or out W bit at a time.
- ras-Line Access strobe pulse-line address I
- cas-Column Access strobe pulse-column address J
- RAS and CAS request to share the same DRAM address pins .
2. Memory Module
The DRAM chip is packaged in a memory module and is plugged into the expansion slot on the motherboard.
- 168-pin dual-inline memory Module -transmits or leaves data in blocks of 64 bits .
- 72-pin single inline memory module -transmits data in blocks of 32 bits .
Read the contents of a memory module:
By connecting multiple memory modules to the storage controller, which can aggregate main memory, when the controller receives an address A, the controller chooses the module k containing a, converts a to its (i, j) mode, and sends (I, j) to Module K.
Exercise 6.1 concludes that the layout is likely to be close to the square. Note that the organization form is DXW, and has a relationship rc=d.
3. Enhanced DRAM
- Fast page Mode-fpm DRAM: allows continuous access to the same row can be served directly from the row buffer. (the original DRAM on the same line of four instructions, after each instruction is discarded, and then re-read.) )
- Extended Data Output-edo DRAM: allows individual CAS signals to be tighter in time.
- Synchronous-sdram: Replace many of these control signals with the rising edge of the same external clock signal as the drive storage controller-faster than asynchronous.
- Double Data rate synchronization-ddr SDRAM: doubles the speed of the DRAM by using two clock edges as the control signal. Category: DDR (2-bit), DDR2 (4-bit), DDR3 (8-bit)
- RDRAM
video-vram: used in the frame buffer of the graphics system, the thought resembles fpm DRAM, the difference:
1.VRAM的输出是通过依次对内部缓冲区的整个内容进行移位得到的2.VRAM允许对存储器并行的读和写。
4. Non-volatile memory--romRam Power loss data, is volatile ;
ROM is non- volatile, collectively referred to as read-only memory
(1) Classification
- prom-programmable ROM, can only be programmed once
- eprom-erasable programmable ROM, the number of times that can be erased and written is approximately 1000 times
EEPROM, an electronic erasable prom, can be programmed in the order of magnitude of 10 of the five times.
Flash Flash (2)Based on EEPROM, it provides fast and durable nonvolatile storage for a large number of electronic devices.
Stored in: Digital camera, mobile phone, music player, PDA, notebook, desktop, server computer system
(3) Firmware
Programs stored in ROM devices are often referred to as firmware, and when a computer system is powered on, he runs the firmware stored in the ROM.
5. Accessing main memory (1) bus
A bus is a set of parallel conductors that can carry addresses, data, and control signals.
Bus classification:
A. System bus-connect CPU and I/O bridge
B. Memory bus--connect I/O bridge and main memory c.i/o bus (see 6.1.2.4 for details)
The I/O bridge translates the electronic signal of the system bus into the electronic signal of the memory bus and also connects the system bus and the memory bus to the I/O bus.
Second, disk storage 1. Disk Construction
- Platter
- Surface: two surfaces per platter
- Spindle: Center of disc, rotatable
- Rotational rate: usually 5400~15000/min
- Track: Concentric Circles
- Sectors: Each track is divided into a set of sectors
- Data bits: Each sector contains an equal number of ~, typically 512 bytes
- Gap: Stores the format bits used to identify sectors
- Disk drives-disks-Rotating disks
- Cylinder: The set of tracks that are equal to the center of the spindle on all disc surfaces.
2. Disk capacity--the maximum number of digits (1) that can be recorded on a disk impact factor:
- Recording density-bits per inch
- Track density-road/inch
- Surface density-bits per square inch
Increased surface density increases capacity.
(2) Modern large-capacity disk--multi-zone record
Splits the collection of cylinders into disjoint subsets (record areas), each containing a contiguous set of cylinders;
Each track of each cylinder in a zone has the same number of sectors, and the number of sectors is determined by the number of sectors that the innermost track in the area can contain
Note: Floppy disk is still an old-fashioned method, the number of sectors per track is constant
(3) Calculation formula:
3. Disk operation
The disk reads and writes data as a block of sector size.
Access time by Category:
(1) Seek time
-The time it takes to move the drive arm.
Depends on the position of the read/write head and the speed at which the drive arm moves on the disc.
Usually 3-9ms, up to 20ms maximum.
(2) Rotation time
--The drive waits for the first bit of the target sector to rotate to the read/write header
Depends on the disc position and the rotational speed.
Maximum Rotation delay =1/rpm X 60secs/1min (s)
The average rotation time is half the maximum value.
(3) Delivery time
dependent on rotational speed and number of sectors per track
Average transfer time = 1/rpm x 1/(average sectors/track) x 60s/1min
The average time to access a disk sector content is the sum of average seek time, average rotation delay, and average transfer time.
Based on the 393-page example of a textbook, you can conclude:
1. The primary time is the seek time and rotation delay.
2. X2 The Seek time is a simple and reasonable way to estimate disk access time.
3. Logical Disk Block
Disk, track, sector, this ternary group uniquely identifies the corresponding physical sector.
Analogy: Memory can be seen as a byte array, disk can be seen as a block array.
4. Connecting to I/O devices (I/O bus)
The I/O bus is connected to CPU, main memory and I/O devices.
See the 395-page chart for the specific process.
Three, solid disk
SSD is a flash -based storage technology that "differs from rotating disks: The solid-state disk has no moving parts.
1. Composition
An SSD package consists of one or more flash chips and a flash translation layer:
闪存芯片——对应旋转磁盘中机械驱动器闪存翻译层(硬件/固件设备)——对应磁盘控制器
2. Read/write (1) sequential read/write
Speed is quite, sequential read is slightly faster than sequential write
(2) Random reading and writing
Write slower than read an order of magnitude
reason: underlying flash basic properties are determined.
A flash memory consists of a sequence of B blocks, each of which consists of P pages. The size of the page is usually 512~4kb, the block is made up of 32~128 pages, and the size of the block is 16kb~512kb.
Data is read and written in pages .
3. Advantages
- Consisting of semiconductors, with no moving parts--
- Random access times are faster than spinning disks
- Lower energy consumption
- More robust
4. Disadvantages
- More prone to wear
- More expensive
Four, storage technology advantages
- Different storage technologies have different price and performance tradeoffs
- The price and performance attributes of different storage technologies vary at a very different rate
- Increased density to lower costs than reduced access times
- DRAM and disk performance lags behind CPU performance
The second section of the locality
Principle of locality:
A well-written computer program often tends to refer to data items that are adjacent to other recently referenced data items, or to the data item itself that has recently been referenced.
Classification:
- Time locality
- Spatial locality
Application:
1. Hardware layer:
By introducing cache memory to save the most recently referenced directives and data items, the access speed to main memory is increased.
2. Operating system level:
The system uses main memory as the cache of the most recently referenced block in the virtual address space, using main memory to cache the most recently used disk blocks in the disk file system.
3. In the application:
The Web browser places the most recently referenced documents on the local disk.
I. Locality of reference to program Data 1. Reference mode with step size K
definition: in a continuous variable, every k element is accessed, which is referred to as the reference pattern of the step size K.
the 1-step reference pattern: The sequential access to each element of a vector, sometimes called the sequential reference pattern , is a common and important source of spatial locality in the program.
In general, spatial locality decreases as the step size increases.
The code above is executed in line precedence , and the following code is executed in column precedence , while the C array is stored in row order in memory, so the first spatial locality is good and the second space is poorly localized.
Because the loop experience is executed several times, it also has a good time locality.
Second, the locality of taking instruction
Program instructions are stored in memory, and the CPU must take out (read out) these instructions.
But one important attribute of code that differs from program data is that it cannot be modified at run time.
Iii. Summary of Local
The simple principle of quantitative evaluation of locality in a program:
- A program that repeatedly references the same variable has good time locality
- For programs with reference patterns with a step size of K, the smaller the step size, the better the spatial locality
- The loop has a good time and spatial locality for taking orders. The smaller the loop body, the more the loop iteration number, the better the locality.
Section III Memory hierarchy
That is, each tier of storage devices is the next level of "cache"
First, the cache
Cache: is a small and fast storage device that acts as a buffer area for data objects stored in larger, slower devices.
Caching: The process of using a cache is called caching.
Data is always copied back and forth between the level K and the k+1 layer with the block size as the transmission unit. The block size is fixed between any pair of adjacent layers, but the other hierarchy pairs can have different block sizes.
Generally speaking: the lower the layer, the larger the block.
1. Cache Hits
When a program needs a data object D in Layer k+1, first look for D in a block that is currently stored in the K layer, and if D is just cached in the K layer, it is called a cache hit.
The program reads D directly from level K, faster than reading d from the k+1 layer.
2. Cache Misses
That is, there is no cached data object D in Layer K.
The K-tier cache then extracts the block containing d from the k+1 cache. If the level K cache is full, it is possible to overwrite an existing block
Overwrite--Replacement/expulsion
Replacement policy:
- Random substitution strategy-randomly sacrificing a block
- The least recently used substitution strategy lru-sacrifices the last accessed time distance now to the furthest block.
3. Types of Cache Misses (1) mandatory misses/cold misses
That is, the K-tier cache is empty (called a cold cache), and access to any data object is not hit.
It is usually a transient event that does not repeatedly access the memory to make the cache warm (understood to be repeated access to the memory, so that the memory is not empty?). ) appears in the stable state after.
(2) Conflict not hit
Because of a placement policy, placing a block limit on the k+1 layer in a small subset of the K-layer block causes the cache to be not full, but the corresponding block is full and will not be hit.
(3) Capacity not hit
When the size of the working set exceeds the size of the cache, the cache undergoes a capacity miss, which means that the cache is too small to handle the working set.
4. Cache Management
Some form of logic must manage the cache, while the logic for managing the cache can be either hardware, software, or a collection of both.
Second, the memory hierarchy structure concept summary
Section fourth cache memory L1 cache:
Between the CPU register file and main memory, the access speed is 2-4 clock cycles
L2 Cache:
Between L1 cache and main memory, access speed of 10 clock cycles
L3 Cache:
Located between the L2 cache and main memory, with access speeds of 30 or 40 clock cycles
A general-purpose cache memory structure
A cache is an array of cache groups whose structure can be described using tuples (s,e,b,m):
S:这个数组中有S=2^s个高速缓存组E:每个组包含E个高速缓存行B:每个行是由一个B=2^b字节的数据块组成的m:每个存储器地址有m位,形成M=2^m个不同的地址
In addition, there are markers and valid bits:
有效位:每个行有一个有效位,指明这个行是否包含有意义的信息标记位:t=m-(b+s)个,唯一的标识存储在这个高速缓存行中的块组索引位:s块偏移位:b
The cache structure divides m addresses into T-markers, S-group index bits, and B-block offsets.
1. Cache Size/Capacity C
Refers to the size of all blocks and, not including the marker bit and the valid bit, so:
C=S*E*B
2. Working process
S,b divides the M address bits into three fields, see, and then:
- Find out which group the word must be stored in by the S group index bit first
- Then the T-marker bit tells us which line in this group contains the word ( when and only if a valid bit is set and the tag bit of the row matches the marker phase in the address )
- B-block biased shift gives the word offset in the B-byte data block
Exercise 6.10: Just remember the quantity relationship between several parameters
Second, direct mapping cache
The cache is divided into different classes according to E (the number of cache rows per group), E=1 is called direct mapping cache, as an example:
The cache determines whether a request is hit, and then the process of removing the requested word is divided into three steps:
1.组选择2.行匹配3.字抽取
1. Group selection
Cache Extract S group index bits from the middle of the address of W
组索引位:一个对应于一个组号的无符号整数。
Analogy: Cache-about an array of groups, the group index bit is the index to this array.
2. Row matching
Note that there are two sufficient prerequisites to determine the cache hit:
- The row has a valid bit set
- The tag in the cache line matches the tag in the W's address
3. Word Selection
The same analogy: block-an array of bytes, the byte offset is an index to this array.
In my understanding, I can also compare the subscript of an array, the offset of a valid address, and so on.
4. Row substitution when cache misses
--Replace the current row with the newly fetched row ...
5. Direct-mapped cache in post-run
- The tag and index bits are linked together to uniquely identify each block in the memory
- Blocks mapped to the same cache group are uniquely identified by the tag bit
※ Note the CPU of the textbook 第413-414 page to perform a series of actions to read
1.先利用索引位,确定是针对哪个组2.然后看对应的组是否有效:(1)如果无效则缓存不命中,高速缓存从存储器或低一层中取出要找的块,存储在对应的组中,再把有效位置1,返回需要的值(2)如果有效,再根据标记找是否有匹配的标记: -如果有,则缓存命中,返回需要的值 -如果没有,则替换行,然后返回。
6. Collision-Not-hit (1) jitter in direct map cache:
--cache repeatedly load and evict groups of the same cache block
(2) Reason:
These blocks are mapped to the same cache group.
(3) Workaround:
Place B-byte padding at the end of each array (b-byte is the length of a block, one row is a block, which is equivalent to separating rows) so that they map to different groups.
Why do you index with intermediate bits? See 415-page Exercise 6.12 and 416-page marginal notes. High, at any time, the cache stores only one block-sized array of content.
Three, group-linked cache
E-channel group-linked cache: 1<e<c/b
1. Group selection
The same as the direct one.
2. Line matching and word selection
The form is (key, value), matches with key as token and valid bit, and returns value after matching.
Important idea: any row in the group can contain any memory block mapped to the group, so tell the cache that each row in the group must be searched.
The criteria for judging a match are still two sufficient and necessary:
1.有效2.标记匹配
3. Line substitution
A blank line replaces a blank row, there is no blank line, and the substitution policy is applied:
- Random substitution
- The most infrequently used policy LFU: Replace the row that has the fewest references in a window in the past.
- Least recently used policy LRU: replaces the last line that was visited the longest time.
Iv. full-phase-linked cache (e=c/b) 1. Group selection
There is only one group, default group 0, no index bits , and addresses are divided into only one tag and one block offset.
2. Line matching and word selection
associated with the group.
It is only suitable for small caches.
V. Write 1. When a write hit, update the copy in the lower layer of the method: (1) write directly, the cache block of W immediately to the lower layer
Cons: Each write will cause bus traffic.
(2) write back, only if the replacement algorithm is to evict the updated block, it is written to the lower layer immediately below
- Advantages: Conform to the principle of locality, significantly reduce bus traffic
- Cons: Added complexity, you must maintain an additional modification bit for each cache line
2. Write misses processing method (1) Write allocation---usually write back to the corresponding
Load the blocks in the corresponding lower layer into the cache, and then update the cache block.
(2) Non-write assignment---usually write directly to the corresponding
Avoid the cache and write the word in the lower layer directly.
Six, the real cache hierarchy:
The cache saves both data and instructions.
- Save instructions only: I-cache
- Save only the program data: D-cache
- Save the instruction and save the data: Unified cache
Vii. performance impact of cache parameters 1. Performance:
- No hit = number of misses/number of references
- Hit Ratio = 1-No hit
- Hit time
- No hit penalty: Because of the extra time required to miss the
2. Specific impact:
- Cache Size: Hit ratio +, hit Time +
- Block Size: Spatial locality +, hit ratio +, number of cache rows-, time locality-, no hit penalty +
- Degree of coupling: E value big, jitter-, Price +, hit Time +, no hit penalty +, control logic + "compromise for not hit penalty low, low degree of coupling, not hit penalty high, use high degree of coupling"
- Write policy: The farther down, the more likely it is to write back rather than write directly
Several concepts are easily confused in this section, and errors are generated when you do a problem. The distinction is as follows:
The fifth section writes the cache-friendly code 1. Basic methods:
- Let the most common situation run fast
- Minimum number of cache misses within each loop
2. Important issues:
- Repeated references to local variables are good (temporal locality)
- The reference pattern of step 1 is good (spatial locality)
Sixth section Memory Mountain
Each computer has the only memory mountain that indicates the capabilities of his memory system.
--that is, the performance of the memory system is expressed as a mountain of time and space locality.
What you want to accomplish: make the program run at the peak rather than the trough
Objective: To make use of time locality to remove frequently used words from L1 and to use spatial locality to make as many words as possible from a L1 cache line.
20145225 "Information Security system Design Fundamentals" 7th Week Study Summary