5233 Yeung Kwong--Seventh week trial report

Last Update:2015-10-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Duration of study: 5 hours

Learning tasks: "In-depth understanding of computer systems" chapter sixth-storage technology and cache parts

(i) Storage technology

A memory system is a hierarchical structure of storage devices with different capacities, costs, and access times. The CPU registers hold the most commonly used data.

The small, fast cache registers are close to the CPU, and the underlying storage devices are slow, large, and inexpensive.

Basic Storage Technology

SRAM Memory
DRAM memory
ROM Memory
Rotating and solid-state hard drives

Random access memory

Classified as static (SRAM) and dynamic (DRAM), SRAM is faster and more expensive to use as cache memory. DRAM used as a frame buffer for primary and graphical systems

* * Static RAM: store Each bit in a bistable memory unit, and always keep its value as long as it has power

* * Dynamic RAM: each bit is stored as a charge to a capacitor, when the voltage of the capacitor is disturbed, the memory unit will never be restored. The memory system must be periodically read out and then rewritten to flush every bit of memory.

* * Traditional DRAM: The units (BITS) in the DRAM chip are divided into D-units, each of which is connected to a circuit called a storage controller.

* * Memory module: The DRAM chip is packaged in a memory module, and the common packaging includes a dual inline memory module and a single inline memory module.

* * Enhanced DRAM:

Fast-page Mode dram
Extended Data output dram
Synchronous DRAM
Double Data rate synchronous DRAM
Rambus DRAM
Video RAM

* * nonvolatile Memory: information is still stored after power off. ROM is collectively known as read-only memory, and ROM is distinguished by the number of times they can be reprogrammed (written) and the mechanisms used for reprogramming.

PROM: Can only be written once
Erasable Programmable ROM
Flash

* * Access to main memory: The data stream is passed back and forth between the processor and the DRAM main memory via a shared electronic circuit called a bus. The reader transmits data from main memory to the CPU, writing things from the CPU to the main memory.

A bus is a set of parallel conductors that can carry addresses, data, and control signals.
Computer System configuration: CPU chip, I/O bridge, DRAM memory module composed of main memory
System bus connects CPU and I/O bridge, memory bus connects I/O Bridge and main memory

Disk storage

* * Disk Construction: disks consist of platters, each having two sides or a surface, and a rotatable spindle in the center of the disc that rotates the disc at a fixed rotational rate.

* * Disk Capacity: determined by the following technical factors

Recording density: The number of bits that can be placed in an inch of a track.
Track density: The number of tracks that can be in the one-inch radius from the center of the disc.
Surface density: The product of recording density and track density.

Disk capacity = Bytes/sector * Average number of sectors/tracks * Number of tracks/surface * number of surfaces/platters * Platters/disk

* * Disk operation: the disk with a read/write head to store the bits on the magnetic surface, the read-write head connected to one end of the transmission arm.

The disk reads and writes data as a block of sector size. Access time to sectors has three main parts: Seek time, rotation time, transfer time

Seek time: In order to read the contents of a target sector, the drive arm first locates the read-write head on the track containing the target sector. The time required to move the drive arm is called the seek time.
Rotation time: Maximum rotation delay Tmax rotation=1/rpm * 60secs/1min. The average rotation time tavg half of rotation.
Delivery time: Tavg transfer=1/rpm * 1/(average sectors/tracks) * 60secs/1min
Estimated total access time =tavg seek+tavg rotation+tavg transfer. Because the seek time is roughly equal to the rotation delay, multiplying the seek time by 2 makes it easy to estimate the disk access time.

* * Logical Disk Block

The firmware on the controller performs a quick table lookup, translating a logical block number into a ternary group (disk, track, sector) that uniquely identifies the corresponding physical sector.
The disk controller maintains a mapping between the logical block number and the actual (physical) disk sector.

* * Connect to I/O devices

There are three different types of devices connected to the bus

Universal Serial Bus
Graphics card (or adapter)
Host Bus Adapter
Other devices, such as network adapters, can be connected to the I/O bus by inserting the adapter into an expansion slot over the motherboard, which provides a direct circuit connection to the bus.

Solid-state Drives

SSD is a flash-based storage technology. An SDD package consists of one or more flash chips and a flash translation layer.

Storage Technology Trends

Different storage technologies have different price and performance tradeoffs.
The price and performance attributes of different storage technologies vary at a very different rate.
DRAM and disk performance lag behind the performance of the CPU.

(ii) Locality

A well-written computer program has good locality and tends to reference data items that are adjacent to other recently referenced data items, or the data item itself that has recently been referenced.

Two different forms of locality: temporal locality and spatial locality

Programs that have good locality run faster than programs with poor local performance.

Locality of reference to program data

the 1-Step reference pattern (sequential reference mode): a function that accesses each element of a vector sequentially. In a continuous vector, every k element is accessed as a reference pattern called the step size K. as the step size increases, the spatial locality decreases.

Locality of the fetch instruction

The program instructions are stored in the memory, the CPU must take out these commands, so we can evaluate the locality of a program about taking instructions.
One important attribute of code that differs from program data is that it cannot be modified at run time, when the program executes, the CPU simply reads its instructions from memory, and the CPU will never rewrite or modify these instructions.

Summary of Local

A program that repeatedly references the same variable has good time locality.
For programs with reference patterns with a step size of K, the smaller the step size, the better the locality. A program with a reference pattern with a step size of 1 has a good spatial locality. In the memory of large-step jumping to jump to the program space locality is poor.
For the instruction, the loop has good time and space locality. The smaller the loop body, the more the loop iteration number, the better the locality.

(iii) Memory hierarchy

Memory structure

A cache is a small, fast storage device that acts as a buffer area for data objects stored in larger and slower devices. The process of using cache is called caching.
The central idea of the memory hierarchy: For each k, the faster and smaller storage devices located on the K-tier act as caches of larger and slower storage devices located on the k+1 layer.
The memory of the K+1 layer is divided into successive data object slices, called blocks. Data is always copied back and forth between the Layer K and section k+1, with the block size as the transmitting unit. Although the size of blocks between any pair of adjacent layers in the hierarchy is fixed, there can be different block sizes between the other levels.

* * Cache Hit: when a program needs a data object D in the k+1 layer, it first looks for D in a block currently stored in the K layer, and D is just cached at level K.

* * Cache misses: when a program needs a data object D in Layer k+1, first finds D in a block currently stored in layer K, and the K-tier does not cache data Object D.

The process of covering an existing block is called replacing or expelling the block, and this block of expulsion is also known as the sacrificial block. Determines which block to replace is controlled by the cache substitution policy.

* * Cache misses: if the cache of tier K is empty, access to any data object will not be hit.

Cold cache: An empty cache, this class of misses is called a mandatory miss (cold not hit). This is usually a short-lived event that does not occur after repeated access to the memory to allow the cache to warm up in a stable state.
As long as a miss occurs, the K-tier cache must execute a placement strategy to determine where to put the block it was removed from the k+1 layer.
When the size of the working set exceeds the size of the cache, the cache undergoes a capacity miss.

* * Cache management: The essence of a memory hierarchy is that each tier of storage devices is a lower-level cache, and in each tier, some form of logic must manage the cache. That is, something that divides the cache into chunks, transmits blocks between different layers, determines whether they are hit or miss, and processes them.

Summary of the concept of memory hierarchy structure

* * Using time locality: due to temporal locality, the same data object may be used multiple times. Because the cache is faster than a low-level storage device, the subsequent hit service is much faster than the initial miss.

* * Using spatial locality: blocks typically contain multiple data objects, and due to spatial locality, we expect that subsequent access to other objects in that block will compensate for the cost of copying the block after a miss.

(iv) Cache memory

* * General-purpose Cache memory Architecture

The cache structure can be described by Ganso (S, E, B, M), the size of the cache (capacity) C refers to the size of all blocks and the mark and valid bits are not included. C=s * E * B.

* * Direct Map cache

According to the E (number of cache rows per group) The cache is divided into classes, with each group having only one row (e=1) of cache called direct mapping cache.

The cache determines whether a request is hit, and then the process of extracting the requested word is divided into three steps:

Group selection
Row matching
Word extraction

* * Group selection in direct map cache

The cache extracts S group index bits from the middle of the address of W, which are interpreted as an unsigned integer corresponding to a group number.

* * Row matching in direct map cache

Determines whether a copy of the word w is stored in a cache line contained in group I. A copy of W is included in this row when and only if a valid bit is set and the tag in the cache line matches the mark in the address of W. If the valid bit is not set or the tag does not match, then a cache miss is obtained.

* * Word selection in direct map cache

Determine where the desired word begins in the block. Block shift provides the offset of the first byte of the desired word.

* * Row substitution when direct mapping is not hit in cache

If the cache is not hit, the requested block is fetched from the next layer in the storage hierarchy, and the new block is stored in a cache line in the group indicated by the group index bit.

* * Synthesis: direct mapping cache in operation

The tag and index bits are linked together to uniquely identify each block in the memory.
Because there are 8 memory blocks, but only 4 cache groups, multiple blocks are mapped to the same cache group (that is, they have the same group index).
Blocks mapped to the same cache group are uniquely identified by the tag bit.

* * Conflict misses in direct map cache

Even if the program has good spatial locality, and there is enough space in the cache to hold the block, each reference will still cause a conflict miss. This is because these blocks are mapped to the same cache group.

Group-linked Cache

* * Group selection in group-linked cache

Its group selection is the same as the group selection for the direct map cache, and the group index bit identifies the group.

* * Row matching and word selection in group-linked cache

Its row matching is more complex than the direct mapping cache, because it must check the tag and valid bits of multiple rows to determine whether the requested word is in the collection.

Important idea: Any row in a group can contain any memory block mapped to the group, so the cache must search for each row in the group, looking for a valid row whose tag matches the tag in the address. If the cache finds such a row, the block offset selects a word from the block, as before.

* * Row substitution when missing in group-linked cache

Cache misses the cache must remove the block containing the word from the memory, and if there is a blank line it can be replaced, and if there is no blank line, you must select a non-empty row from which the CPU will not quickly refer to the replaced row.

Least common strategy
Least recently used policy

Fully-connected cache

A fully-connected cache consists of a group that contains all the cache rows (that is, e=c/b).

* * Group selection in the fully-connected cache

There is only one group, there is no group index bit in the address, and the address is divided only by one token and one block offset.

* * Row matching and word selection in fully-connected cache

Row matching and word selection in the fully-connected cache are the same as in group-linked caches, the main difference being the size of the problem. The full-phase cache is only suitable for small caches.

Questions related to writing

* * Write Hit:

Write-through: The cache updates its W copy immediately after the cache block of W is written to the next lower layer. The disadvantage is that each write will cause bus traffic.
Write back: Delay the memory update as much as possible, and only write it to the lower layer if the replacement algorithm is to evict the updated block. Can significantly reduce bus traffic, the disadvantage is increased complexity. The cache must maintain an additional modification bit for each cache line, indicating whether the cache block has been modified.

* * Write not hit

Write assignment: Load the block in the corresponding lower layer into the cache, and then update the cache block. The disadvantage is that each miss will cause a block to be transferred from the lower layer to the cache.
Non-write allocation: Avoid caching and write the word directly to the lower layer. Write-through caches are typically non-write-allocated, and writeback caches are typically write-allocated.

Anatomy of a real cache hierarchy

Caches both save data and save instructions. A cache that only saves instructions is called I-cache, and the cache that saves program data is called D-cache. Caches that both hold data and save instructions are called unified caches.

Performance impact of cache parameters

Metrics to measure cache performance:

No hits
Shooting
Hit time
No hit penalty

* * Cache size Impact: Larger caches may increase hit ratios, making large memories run faster and harder, and larger caches may increase hit time.

* * Block size effect: Larger blocks can take advantage of possible spatial locality in the program, helping to improve the hit ratio. However, for a given cache size, the larger the block represents, the less the number of cache rows, which can damage the hit rate of a program with better temporal locality than spatial locality. Larger blocks also have a negative impact on the penalty for not hitting.

* * The effect of the degree of coupling: a higher degree of coupling (large e value) has the advantage of reducing the likelihood that the cache is jittery due to conflicting misses. But the higher the cost of the associated degree, it is difficult to make it faster, increase hit time, increase the penalty.

* * Write policy impact: Write Cache is easy to implement, and can be independent of the cache write buffer, to update the memory. Read misses are less expensive because memory writes are not triggered. In addition, the write-back cache causes fewer transfers. Generally, the lower the level of cache, the more likely it is to use write-back.

Resources

"In-depth understanding of computer systems"
"Embedded Linux application Development Standard Tutorial"

Experience

This time the content is generally better understood, there are still some problems:

Local part: When a for (i=0;i<m;i++) result is made up of x[i]+y[i], can not understand its spatial locality excellent reason, not first read X[0], and then jump to y[0], if so, not continuous reading AH.

Other aspects are fine, the computer storage has a deeper understanding.

5233 Yeung Kwong--Seventh week trial report

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

5233 Yeung Kwong--Seventh week trial report

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

5233 Yeung Kwong--Seventh week trial report

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support