Sixth chapter Memory Hierarchy
The first section of storage technology
One, random access memory ram
1. Static RAM (SRAM): Each bit is stored in a bistable memory unit, each unit is implemented with a six transistor circuit.
Properties: Can be kept indefinitely in one of two different voltage configurations or states. Any other state is not stable.
Features: Due to the bistable characteristics of SRAM, as long as there is electricity, it will always maintain its value, even if there is interference, such as electronic noise, to disturb the voltage, when the interference is eliminated, the circuit can also be restored to a stable value.
Application: SRAM is used as a cache memory, which can be on the CPU chip or under the chip.
2. Dynamic DRAM (DRAM): Each bit is stored as a charge to the capacitor. The capacitance is approximately 30x10-15f.
Characteristics: Particularly sensitive to interference, when the voltage of the capacitor is disturbed, it will never recover. Exposure to light can cause a change in the capacitance voltage.
Application: DRAM is used as a frame buffer for primary and graphical systems.
Ii. Non-Volatile memory (ROM)
PROM: can only be programmed once.
EPROM: Erasable programmable ROM, ultraviolet light clear unit content, erasable number of orders of magnitude 1000.
E2prom: Electronic erasable Prom, can be directly programmed on the printed circuit card, the number of times to erase the order of 10^5.
Flash: Flash, based on EEPROM. (SSD SSD based Flash)
Third, access to main memory
1. Bus: A group of parallel wires, can carry the address, data control signal.
2. Read transaction: Transferring data from main memory to CPU
Assembly statement: Movl A,%eax
A.CPU Place address A on the memory bus
B. Main memory reads a from the bus, receives the word X, and then places X on the bus
C.cpu reads the word X from the bus and copy it to the register EAX.
3. Write transactions: Transferring data from the CPU to main memory
Assembly statement: Movl%eax,a
A.CPU address A to the memory bus, main memory read out this address, waiting to receive data word
B.CPU put the data word y on the main road
C. Main memory reads the data word y from the bus and stores it in address a.
Four, disk storage
1. Disk Construction
2. Disk capacity: The maximum number of bits that can be recorded on a disk is called its maximum capacity/capacity.
Recording density: The number of bits that an inch of a track can fit into.
Track density: The number of tracks that can be in the one-inch radius from the center of the disc.
Surface density: The product of recording density and track density.
Disk capacity = number of bytes/sectors * Average number of extents/tracks * Number of tracks/surface * surface Number/disc * Number of discs/disk
Example:
3. Disk operation
The disk reads and writes the bits stored on the magnetic surface, and the read-write head is connected to one end of the rotating arm. The Seek is to move the rotating arm forward and backward along the radius axis so that the drive can position the read-write head on any track on the disc surface.
(1) Dynamic characteristics
All reading and writing heads are located on the same cylinder at any one time.
The read/write head at the end of the drive arm flies on a thin cushion of about 0.1 microns above the surface of the disk, at a speed of about 80km/h.
The disk reads and writes data as a block of sector size.
(2) Access time = seek time + rotation time + transfer time
A. Seek time: In order to read the contents of a target sector, the transmission arm will first locate the read/write head to the track containing the target sector, the time required is the seek time, approximately equal to the maximum rotation time.
The seek time Tseek relies on the position of the reading and writing head and the speed at which the rotating arm moves on the disk surface.
B. Rotation time: After locating the desired track, the drive waits for the first bit of the target sector to rotate below the read/write header. Depends on the location of the disc and the disk rotation speed when the read-write header reaches the target sector.
After locating the desired track, the drive waits for the first bit of the target sector to rotate below the read/write header.
Maximum rotation time = 1/maximum rotation number ratio
Average rotation time = (1/2) * Maximum rotation time.
C. Transfer time: When the first bit of the target sector is under the read-write header, the drive can begin reading or writing the contents of the sector. Depends on the rotational speed and the number of sectors per track.
Average Transfer time = (1/maximum rotation number) * (1/Average number of sectors per track)
Example:
Five, Logical disk block
Memory can be viewed as a byte array, and disks can be seen as block arrays
Modern disk construction is complex and has multiple disk faces, which have different memory areas. To hide such complexity from the operating system, modern disks present their constructs as a simple attempt, a sequence of logical blocks of sector size B, numbered 0,1,...,b-1.
The disk has a small hardware/firmware device called a disk controller that maintains a mapping between the logical block number and the actual (physical) sector.
The firmware on the controller performs a quick table lookup, translating a logical block number into a ternary group (disk, track, sector) that uniquely represents the corresponding physical sector. The hardware on the controller interprets this ternary group, moving the read-write head to the appropriate boiling surface, waiting for the sector to move to the read-write header, placing the read-write head-aware bits in a small buffer on the controller, and then copying them into main memory.
Vi. connecting to I/O devices
Input, such as graphics cards, monitors, mice, keyboards, and disks, are connected to the CPU and main memory via the I/O bus.
The system bus and memory bus are CPU-related and the I/O bus is designed to be independent of the underlying CPU.
The I/O bus is slower than the system bus compared to the memory bus, but it can accommodate a wide variety of third-party I/O devices.
Universal Serial Bus usb:2.0 maximum bandwidth 60mb/s,3.0 Maximum bandwidth 600mb/s
Graphics Card (Adapter)
Host Bus Adapter
Vii. Accessing the disk
The CPU uses a memory-mapped I/O technique to issue commands to I/O devices, where a block of addresses in the address space is reserved for communication with I/O devices, called I/O ports, in a system that uses memory-mapped I/O. When a device is connected to a bus, it is connected to one or more ports.
Direct Memory Access: The device can perform its own read or write bus transactions without the need for CPU interference. This data transfer is called DMA transfer.
Example:
Eight, SOLID disk: Flash-based storage technology
An SSD package consists of one or more flash chips and a flash translation layer, a flash chip instead of a mechanical drive in a traditional rotating disk; The Flash translation layer (a hardware/firmware device) replaces the disk controller, translating requests for logical blocks into access to the underlying physical device.
1. Performance characteristics
Sequential reads and writes (CPU accesses logical disk blocks sequentially) perform fairly well, and sequential reads are slightly faster than sequential writes.
When a logical block is randomly accessed, the write is one order of magnitude slower than the read.
The performance difference between read and write is determined by the underlying Flash basic properties.
2. Advantages
Consisting of semiconductors with no moving parts
Faster random access times than rotating disks, low energy consumption, strong
3. Disadvantages: Easy to wear, more expensive
IX. Storage Technology Trends
Different storage technologies have different price and performance tradeoffs
The price and performance attributes of different storage technologies change at a very different rate (increasing density to reduce costs is easier than reducing access time)
DRAM and disk performance lags behind CPU performance
The second section of the locality
The principle of locality: A well-written computer program often has good locality, that is, tends to refer to data items that are adjacent to other recently referenced data items, or the data item itself that has recently been referenced.
(1) Two forms: spatial locality and temporal locality
(2) A program with good locality runs faster than a program with poor local performance
Hardware layer: The local principle allows computer designers to increase access to main memory by introducing cache memory to hold recently referenced instructions and data items.
Operating system level: The local principle allows the system to use main memory as the cache of the most recently referenced block in the virtual address space, using main memory to cache the most recently used disk blocks in the disk file system.
Application: For example, a Web browser places the most recently referenced document on a local disk.
Second, the impact of the program data reference
1. Reference mode with step size K
Definition: In a continuous variable, every k element is accessed.
The 1-Step reference pattern: The sequential access to each element of a vector, sometimes referred to as sequential reference mode, is a common and important source of spatial locality in programs.
In general, spatial locality decreases as the step size increases.
2. Multidimensional Array Program
In the C language, arrays are stored in row order in memory, so the locality of code executed in line precedence is better than the column-precedence code.
Third, the locality of taking instruction
Program instructions are stored in memory, and the CPU must take out (read out) these instructions.
One important attribute of code that differs from program data is that it cannot be modified at run time. When the program is executing, the CPU only reads its instructions from memory, and the CPU will never rewrite or modify the instructions.
Iv. Quantitative evaluation of the simple principle of locality in a program
A program that repeatedly references the same variable has good time locality
For programs with reference patterns with a step size of K, the smaller the step size, the better the spatial locality
The loop has a good time and spatial locality for taking orders. The smaller the loop body, the more the loop iteration number, the better the locality.
Section III Memory hierarchy
First, storage technology and computer software
Storage technology: Different storage technology access time varies greatly, faster technology per byte cost is higher than the slow technology, and the capacity is small, the speed gap between the CPU and main memory is increasing.
Computer software: A well-written program tends to show good locality.
The upper level goes to the bottom, and the storage device becomes slower, cheaper, and larger. Each tier of storage devices is the next level of caching.
Second, cache
Cache and Cache: Cache Cash is a small, fast storage device that acts as a buffer zone for data objects stored in larger, slower devices. The process of using cache is called caching.
The central idea of the memory hierarchy: For each k, the faster and smaller storage devices located on the K-tier act as caches of larger and slower storage devices located on the k+1 layer. That is, each tier of storage device is the next level of "cache"
Data is always copied back and forth between the level K and the k+1 layer with the block size as the transmission unit. The block size is fixed between any pair of adjacent layers, but the other hierarchy pairs can have different block sizes.
1. Cache Hits
When a program needs a data object D in Layer k+1, first look for D in a block that is currently stored in the K layer, and if D is just cached in the K layer, it is called a cache hit.
The program reads D directly from level K, faster than reading d from the k+1 layer.
2. Cache Misses
That is, there is no cached data object D in Layer K.
The K-tier cache then extracts the block containing d from the k+1 cache. If the level K cache is full, it is possible to overwrite an existing block. The process of covering an existing block is called replacing/expelling the Block; the evicted block is sometimes called a sacrificial block.
Replacement policy: Decide which block to replace
Kinds:
Mandatory misses/cold misses: The K-tier cache is empty (called a cold cache), and access to any data object is not hit. This is usually a transient event that does not occur in a stable state after repeated access to the memory to warm the cache.
Conflict not hit: Due to a placement policy: placing a block limit on the k+1 layer in a small subset of the K-layer block, which causes the cache to be not full, but the corresponding block is full, it will not hit.
Capacity misses: When the size of the working set exceeds the size of the cache, the cache undergoes a capacity miss, which means that the cache is too small to handle the working set.
Third, cache management: Some form of logic must manage the cache, while the logic for managing the cache can be hardware, software, or a collection of both.
Iv. Summary of Concepts
1. Use of time locality
A unified object may be used more than once. Once a data object is copied to the cache for the first time, we expect a series of access hits to the target later. Because the cache is faster than the first tier of storage, service to the rear hits is much faster than the initial hit.
2. Use of spatial locality
A block typically contains more than one data object. Due to spatial locality, we expect access to other objects in the block later to compensate for the cost of copying the block after a miss.
3. Cache everywhere
Section Fourth cache memory
First, Introduction
1. Memory hierarchy of early computer systems: CPU registers, DRAM master memory, and disk storage
Insert between 2.CPU and main memory
L1 cache: Between CPU register file and main memory, access speed of 2-4 clock cycles
L2 cache: between L1 cache and main memory, access speed of 10 clock cycles
L3 cache: between L2 cache and main memory, access speed of 30 or 40 clock cycles
3. Typical bus structure of cache memory
Second, the general cache memory structure
The cache is an array of cache groups, and the cache structure divides m addresses into T-markers, S-group index bits, and B-block offsets. Its structure can be described by tuples (s,e,b,m)
M: Each memory address has a M-bit, which forms m=2^m different address
S: There are s=2^s cache groups in this array
E: Each group contains an E cache line
B: Each row is made up of a b=2^b byte block of data
Mark bit: t=m-(b+s), unique identifier of the block stored in this cache line
Valid bits: Each row has a valid bit that indicates whether the row contains meaningful information
Cache size/Capacity c:c=s*e*b. Refers to the size of all blocks and, not including the marker bit and the valid bit.
Third, direct mapping cache
1. The cache is divided into different classes according to E (number of cache rows per group), E=1 is called direct mapping cache.
Cache determines whether a request is hit, and then takes out the requested word process, divided into three steps:? group selection, row matching, Word extraction
Cache Extract S group index bits from the middle of the address of W
Group Index bit: a unsigned integer that corresponds to a group number.
Two sufficient and necessary conditions to determine the cache hit:
The row has a valid bit set
The tag in the cache line matches the tag in the W's address
Determine where the desired word begins in the block.
2. Row substitution When cache misses: replaces the current row with the newly fetched row.
3. Direct-mapped cache in post-run
The tag and index bits are linked together to uniquely identify each block in the memory.
Because there are 8 storage blocks, 4 cache groups, multiple blocks are mapped to the same cache group (because they have the same group index).
Blocks mapped to the same cache group are uniquely identified by the tag bit.
4.CPU perform a series of read steps:
Use index bits to determine which group is targeted
Then see if the corresponding group is valid:
If it is not valid, the cache misses, the cache is removed from the memory or the lower layer to find the block, stored in the corresponding group, then the valid position 1, return the required value
If it is valid, then the tag is searched for a matching tag: If so, the cache hits, returns the desired value, or, if not, the replacement row, returned.
5. Conflict misses in direct map cache
Cause: jitter between blocks, that is, the blocks are mapped to the same cache group.
(jitter: Cache repeatedly load and evict groups of the same cache block?)
Workaround: Place B-byte padding at the end of each array (b-byte is the length of a block, one row is a block, which is equivalent to separating rows) so that they map to different groups.
Iv. Group-linked cache
1. Group selection: As with the group selection in the direct map cache, the group index bit identifies the group.
2. Line matching and word selection
Consider each group as a small associated memory, an array of (key,value) pairs, with key as input, returning the value values in the corresponding array. The cache must search for each row in the group, looking for a valid row whose tag matches the address.
The form is (key, value), matches with key as token and valid bit, and returns value after matching.
Any row in the group can contain any memory block mapped to the group, so tell the cache that each row in the group must be searched.
3. Row substitution when missing in group-attached cache: random substitution
The most infrequently used policy LFU: Replace the row that has the fewest references in a window in the past.
Least recently used policy LRU: replaces the last line that was visited the longest time.
V. Full-attached cache
1. Group selection: Only one group, no group index bit.
2. Row matching and word selection: the same as the group-attached cache, but much larger, so it is only suitable for small caches, such as the translation of backup buffers in virtual storage systems.
Six, write
1. When a write hit, update the lower layer of the Copy method
Direct write: Immediately W's cache block association to the lower layer
Cons: Each write will cause bus traffic.
Writeback: only if the replacement algorithm is to evict the updated block, it is written to the lower layer immediately thereafter.
Advantages: Conform to the principle of locality, significantly reduce bus traffic
Cons: Added complexity, you must maintain an additional modification bit for each cache line
2. How to handle write misses
Write assignment (corresponding writeback): Load the block in the corresponding lower layer into the cache, and then update the cache block.
Non-write allocation (corresponding to direct write): Avoid caching and write this word directly on the lower level.
Vii. performance impact of cache parameters
1. Performance
No hit = number of misses/number of references
Hit Ratio = 1-No hit
Hit time: The time required to transfer a word from the cache to the CPU, including group selection, line matching, and the time of the word extraction.
No hit penalty: Because of the extra time required to miss the
2. Specific impact
Cache Size: Larger caches may improve hit ratios, but making large memory runs faster is harder.
Block Size: Larger blocks can help improve hit ratios by taking advantage of possible spatial locality in the program, but larger chunks mean less cache lines and damage time locality.
Degree of coupling: a large (large e-value) advantage is that it reduces the likelihood of a cache jitter due to conflict misses, but at a higher cost.
Writing policy: Write-through cache is easy to implement, and can use a cache-independent write buffer, used to update the memory, the cost of a small hit. The write-back cache causes fewer transfers, allowing more bandwidth to the memory to be used to perform DMA I/O devices. The lower the level, the more likely it is to write back rather than write straight.
Section Fifth writing cache-friendly code
1. Basic methods to ensure code cache friendliness:
Let the most common situation run fast.
The minimum number of cache misses is within each loop.
2. Repeated references to local variables are good because the compiler is able to cache them in a register file (temporal locality).
3. The reference pattern of step 1 is good because the cache at all levels in the memory hierarchy is storing the data as contiguous blocks (spatial locality).
Reference: 1. "In-depth understanding of computer Systems", chapter sixth
2. Experimental building course-in-depth understanding of computer system experiment Seven
3.20135317 Han Yuqi Blog http://www.cnblogs.com/hyq20135317/p/4905723.html
Information Security system design basics Sixth Week study summary