Architecture Review CH7 Thread-level parallel 7.1 multi-processor and thread-level parallel 7.1.1 Multiprocessor architecture
thread-level parallelism is multi-processor support for simultaneous execution of multiple threads concurrently, and multiprocessor architectures are broadly divided into two types:
- Symmetric shared memory multiprocessor (SMP): Also known as centralized shared memory Architecture , with a small number of cores and sharing a centralized memory, all processors have equal access to it (also known as UMA, consistent memory access) The storage structure of SMP is broadly divided into three tiers: shared main memory, shared cache, and dedicated cache, the most important discussion in this chapter is the consistency between dedicated cache and shared storage.
- Distributed shared Storage (DSM): Multi-processing using physical distributed memory, multiple core/distributed memory through high-speed Internet connection; DSM is inconsistent with different memory access times, and it is clear that the core has greater internal memory access speed than other node memory accesses. , and access to other node memory access speed is also related to the network topology between nodes (also known as NUMA, non-uniform memory access); DSM shared memory refers to shared address space , and DSM needs to focus on the consistency of distributed shared storage.
7.1.2 Challenges of parallel processing
Parallel processing faces 2 important challenges:
- Limited parallelism in the program: the bottleneck of the acceleration ratio can be calculated by Amdahl law is the serial part proportion
- Multiprocessor remote access latency is large: latency between different cores of the same chip, different cores between different chips
The strategies to address these two issues are:
- A better algorithm using parallelism
- Seek better architecture and programming techniques
7.2 Set of Chinese shared memory architecture and monitoring consistency protocol
Assuming that processors A and B have read x (a copy of x in their respective caches), a modifies X and writes back to main memory, but at this point the X in the cache of B is still an unmodified x, and there is a cache inconsistency
7.2.1 Cache consistency Policies and methods
There are two strategies for ensuring cache consistency:
- Listener : If a cache has a copy of the data in a physical storage block, it can track the shared state of the Block; SMP mainly uses the listener-cache consistency Protocol
- directory : For the physical cache block dedicated to a save shared state directory, the cache to query the directory to get the shared state of the block, DSM multi-use Distributed Directory Consistency protocol
There are two ways of implementing cache consistency:
- Write invalidation : If a processor writes to a copy of a shared physical storage block, all other caches that have the shared physical storage block are invalidated
- Write Update method : If a processor writes to a copy of a shared physical storage block, all other caches that have the shared physical storage block are updated to write values
Because write updates require a considerable amount of bandwidth (and sometimes may not be necessary), half of them take a write-fail method
7.2.2 Monitoring Conformance Protocol
A simple listener conformance protocol assigns a valid bit ( valid or invalid ) to a cache block in a private cache and a status bit (flag share or exclusive ). Then there are three states of a cache block (invalid block state bit meaningless):
- Invalid: A valid physical memory block copy does not exist in the cache block
- Shared: The valid physical memory block copies stored in this cache block are shared by all other processors, meaning that the physical memory blocks in main memory are not modified , and that the shared block is not necessarily present in the other processor cache, but it is guaranteed to be consistent with the block once it exists
- Exclusive/modified: a copy of the valid physical memory block stored in the cache block is unique, inconsistent with the physical memory block in main memory and must be guaranteed not to have a copy of the block in the other processor cache
There are several key actions in the listening protocol:
- The processor writes to the shared block: writes directly to and notifies the other processor that the block is invalidated and modifies the state to exclusive
- Processor Write exclusive block: Direct write without notification not changed state
- The processor reads the exclusive block of another processor: Another processor receives a notification that the processor is attempting to read its own exclusive block, writes the exclusive block back to main memory, and modifies the state as shared, and then the processor reads the block read and marks it as shared
- The processor writes an exclusive block to another processor: Another processor receives a notification that the processor is attempting to write its own exclusive block, writes the exclusive block back to main memory, and modifies the state to fail, then the processor reads the block write and then marks it as exclusive
The above notice is given in the form of the request and the source classification According to it has a complete action:
Source |
Request |
addressing cache block status |
Cache operation Type |
Action |
Processor |
Read hit |
Shared or Exclusive |
Normal hit |
Read locally cached data directly |
Processor |
Read missing |
Failure |
Normal missing |
Send read missing requests to the bus, request data to be read after loading the cache, and mark as shared |
Processor |
Read missing |
Shares |
Replace |
Send read missing request to bus, request data mount cache to replace original shared block, read and mark as shared |
Processor |
Read missing |
Exclusive |
Replace |
Writes back the exclusive block and marks it as shared, sends a read missing request to the bus, requests the data to load the cache to replace the original exclusive block (now marked as shared), and marks it as shared |
Processor |
Write hit |
Shares |
Consistency |
Write and mark the exclusive, then put the invalidation request on the bus |
Processor |
Write hit |
Exclusive |
Normal hit |
Write directly to the local cache |
Processor |
Write missing |
Failure |
Normal missing |
Send write missing requests to the bus, request data to be written after the cache is loaded, and mark as exclusive |
Processor |
Write missing |
Shares |
Replace |
Send write-missing requests to the bus, request data to mount the cache to replace the original shared block, write, and Mark as exclusive |
Processor |
Write missing |
Exclusive |
Replace |
Writes back the exclusive block and marks it as invalid, sends a write missing request to the bus, writes the request data to the cache to replace the invalid block, and marks it as exclusive |
Bus |
Read missing |
Shares |
No action |
Allow shared blocks do not act |
Bus |
Read missing |
Exclusive |
Consistency |
Write back the exclusive block and mark it as shared |
Bus |
Write missing |
Shares |
Consistency |
Mark the shared block as invalid |
Bus |
Write missing |
Exclusive |
Consistency |
Write back the exclusive block and mark the invalid |
Bus |
Failure |
Shares |
Consistency |
Mark the shared block as invalid |
Note: The above exclusive status is the modified modified state of the MSI protocol (another salutation to the simple consistency protocol) (sometimes it is called exclusive, sometimes it needs to be distinguished from being modified)
7.2.3 MSI Protocol extension (1) MESI
MSI has a flaw, read a block (read missing) and then modify a block (write hit), will generate 2 bus transactions (read missing I->s, write hit when s->m and send invalidation), even a block "exclusive" This read block when write hits also publishes invalidation requests on the bus, and this situation is common in multi-channel program load
In order to reduce the bus transaction, a MESI protocol is proposed for this situation, which expands the status exclusive state to indicate that there is only a copy of the block in the current cache and that the block is clean (in order to distinguish it from the previously exclusive salutation, I call it a clean exclusive state ), that is, the blocks in the block and main memory are consistent
When a block write to a clean exclusive state does not produce a bus write invalidation request (since it is already known that the block copy is not in the other cache and the invalidation request is meaningless), then the above "read before write" operation produces only one bus transaction, Optimized (note: There is no invalidation request when writing a block of the modified state)
By the definition of a clean exclusive state We know that there is only one situation that can produce this state: no other cache has a copy of the block when read fails, and the block is loaded from main memory
我的疑问:判断其他缓存中是否有该块副本不是需要在存储器块中增加标志位(复杂情况增加标志位也无法解决),那么是否这个判断过程又讲产生其他的总线事务呢?我认为是的
After the boss raised points, Mesi also introduced the sharing of Cache-to-cache, if other caches hear read missing when the copy of the corresponding block is detected, terminate memory access and actively provide the block copy
然而我又有进一步的疑问:但这样多块都有S的相应副本,都去终止内存访问发送自己缓存中的副本?这不算是额外的开销?或者这样总线不会乱套(虽然发送的是相同数据)?[PS:这就是不听课的下场,花样作死]
(2) Moesi
Moesi adds the owned owning state on Mesi, which indicates that a block is owned by the cache and shares the block with other caches, and that the block is obsolete in main memory
Moesi is the case of cache-to-cache sharing, that is, not asking for a copy in main memory when missing, but seeking a copy in the other cache (the idea is that cache access is faster than main memory access)
In Moesi, when you try to share a modified block in cache A, the block is not written back , a tag is owned , an attempt is made to get the cached copy of the Block share from a, and is marked as shared (Only A is marked as owned), then you need to maintain this behavior later, that is, when there is a loss, the cache that owns a block must proactively provide a copy of the owning block, and then write the master when the owning block is replaced.
7.3 Distributed shared Memory Architecture and directory consistency protocol
The Monitoring consistency protocol is not used in DSM because:
- Limited bus scalability: Increased number of processors to compete for shared bus use, easy to monitor bandwidth bottleneck
- Difficulty listening on a network that is not a bus or a ring: Must broadcast consistency notification, high bandwidth consumption and inefficiency on networks with complex topologies
This introduces another kind of consistency protocol: Directory Consistency Protocol , the directory is also distributed in each node, and the node's memory one by one corresponding to the memory of the state of each block is recorded in the directory; Fabric directory addressing is the same as distributed memory addressing, and the same address space is shared across all directories in the DSM
There are three states of the Directory consistency protocol:
- Share: One or more nodes cache the block, and the block is up-to-date in memory (the directory also needs to record which nodes share the block)
- Not cached: All nodes do not have a copy of the cache block
- Modified: Only one node has a copy of this cache block, and the block has been written, the block has expired in memory (the directory also needs to record which node modified the block)
Therefore, not only the cache state of the memory block is recorded in the directory, but also a bit vector is used to record the shared/modified nodes.
Define three types of nodes:
- Local node: the node making the request
- Remote node: Other node of non-local node
- Master node: The location of the destination cache block storage and the location of the directory
Understanding the relationships between several types of nodes makes it easy to understand the directory consistency protocol:
- The primary node may be a local node, or it may be a remote node
- The target cache block may be available from the master node, or it may be obtained from other nodes
- When the local node discovers that the cache is missing, the directory of the master node is queried, the cache block status and the shared/modified nodes are known: The primary node directory is shared and read invalidated, a copy of the cache block is sent to the local node and the node is added to the shared node record, and the master node directory displays shared state and write invalidation. A copy of the cache block is sent to the local node and all nodes in the shared node record are notified that the cache block is invalidated, that it is last marked as modified, that the display in the master node is not cached, that the node number is marked as modified or shared based on the local node missing type (write or read) and that it is in the Master node directory A message is sent to the node in which the cache block was modified to write back to the modified cache block before providing the block copy to the local node
Finally, the complete state transitions and corresponding actions for the directory consistency protocol are given in the table:
[Note p for issuing the request node number, a for the requested address, D for the requested data]
Message Type |
Source |
Target |
Message Content |
Messaging Features |
Read missing |
Local cache |
Home Directory |
P,a |
Node P fails to read at address A, requests data and adds p to the shared node list |
Write missing |
Local cache |
Home Directory |
P,a |
Node P has a write missing in address A, requests data and records p as an exclusive node, and then the home directory fails to send |
Failure |
Local cache |
Home Directory |
A |
The home directory fails to send all nodes that have cached a (remote cache) |
Failure |
Home Directory |
Remote cache |
A |
Shared copy fails at a |
Fetch data |
Home Directory |
Remote cache |
A |
Retrieve the block of address a, send it to the home directory, and mark it as shared |
Invalid data access |
Home Directory |
Remote cache |
A |
Retrieve the block of address a, send it to the home directory, and mark it as invalid |
Data response |
Home Directory |
Local cache |
D |
Returning data values from the primary node |
Data writeback |
Remote cache |
Home Directory |
A,d |
Write back the data value of address a |
7.4 Synchronization
Some of the most important concepts of synchronization issues are:
- Atomic operation
- Critical section
- Mutual exclusion Lock
- Signal Volume
- Dead lock
- Sync Barrier
These concepts, the basic concepts of the operating system course, are not reviewed here.
7.5 Storage Identities
Storage Identity (also known as coherence) refers to a convention that each process sees when multiple processors concurrently read and write operations on different storage units are ordered to be completed
Storage Consistency is guaranteed to be visible to all readers when a cell in a shared storage space is modified, and it does not involve:
- When to make a write data visible
- Processor P1 and P2 order of access to different address units
- P2 the order in which read operations on different storage units are seen relative to P1
The sequential identity model requires that the global memory access order formed by the serial execution of all processor read and write operations must conform to the order of the original program, that is, regardless of how the instruction flow is alternately executed, the global order must guarantee the local order of all Programs
Sufficient conditions for sequential identity:
- Each process emits storage operations in accordance with the program execution sequence
- After a write is issued, the process waits for the write operation to complete before issuing the next action
- After a read operation, the process waits not only for the read to complete, but also for the completion of the write operation that produces the read data in order to issue the next operation
Note: This piece of reading is not very understanding, for example will be added later
Architecture Review 4--thread-level parallelism