Architecture Review 4--thread-level parallelism

Last Update:2015-06-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Architecture Review CH7 Thread-level parallel 7.1 multi-processor and thread-level parallel 7.1.1 Multiprocessor architecture

thread-level parallelism is multi-processor support for simultaneous execution of multiple threads concurrently, and multiprocessor architectures are broadly divided into two types:

Symmetric shared memory multiprocessor (SMP): Also known as centralized shared memory Architecture , with a small number of cores and sharing a centralized memory, all processors have equal access to it (also known as UMA, consistent memory access) The storage structure of SMP is broadly divided into three tiers: shared main memory, shared cache, and dedicated cache, the most important discussion in this chapter is the consistency between dedicated cache and shared storage.

Distributed shared Storage (DSM): Multi-processing using physical distributed memory, multiple core/distributed memory through high-speed Internet connection; DSM is inconsistent with different memory access times, and it is clear that the core has greater internal memory access speed than other node memory accesses. , and access to other node memory access speed is also related to the network topology between nodes (also known as NUMA, non-uniform memory access); DSM shared memory refers to shared address space , and DSM needs to focus on the consistency of distributed shared storage.

7.1.2 Challenges of parallel processing

Parallel processing faces 2 important challenges:

Limited parallelism in the program: the bottleneck of the acceleration ratio can be calculated by Amdahl law is the serial part proportion
Multiprocessor remote access latency is large: latency between different cores of the same chip, different cores between different chips

The strategies to address these two issues are:

A better algorithm using parallelism
Seek better architecture and programming techniques

7.2 Set of Chinese shared memory architecture and monitoring consistency protocol

Assuming that processors A and B have read x (a copy of x in their respective caches), a modifies X and writes back to main memory, but at this point the X in the cache of B is still an unmodified x, and there is a cache inconsistency

7.2.1 Cache consistency Policies and methods

There are two strategies for ensuring cache consistency:

Listener : If a cache has a copy of the data in a physical storage block, it can track the shared state of the Block; SMP mainly uses the listener-cache consistency Protocol
directory : For the physical cache block dedicated to a save shared state directory, the cache to query the directory to get the shared state of the block, DSM multi-use Distributed Directory Consistency protocol

There are two ways of implementing cache consistency:

Write invalidation : If a processor writes to a copy of a shared physical storage block, all other caches that have the shared physical storage block are invalidated
Write Update method : If a processor writes to a copy of a shared physical storage block, all other caches that have the shared physical storage block are updated to write values

Because write updates require a considerable amount of bandwidth (and sometimes may not be necessary), half of them take a write-fail method

7.2.2 Monitoring Conformance Protocol

A simple listener conformance protocol assigns a valid bit ( valid or invalid ) to a cache block in a private cache and a status bit (flag share or exclusive ). Then there are three states of a cache block (invalid block state bit meaningless):

Invalid: A valid physical memory block copy does not exist in the cache block
Shared: The valid physical memory block copies stored in this cache block are shared by all other processors, meaning that the physical memory blocks in main memory are not modified , and that the shared block is not necessarily present in the other processor cache, but it is guaranteed to be consistent with the block once it exists
Exclusive/modified: a copy of the valid physical memory block stored in the cache block is unique, inconsistent with the physical memory block in main memory and must be guaranteed not to have a copy of the block in the other processor cache

There are several key actions in the listening protocol:

The processor writes to the shared block: writes directly to and notifies the other processor that the block is invalidated and modifies the state to exclusive
Processor Write exclusive block: Direct write without notification not changed state
The processor reads the exclusive block of another processor: Another processor receives a notification that the processor is attempting to read its own exclusive block, writes the exclusive block back to main memory, and modifies the state as shared, and then the processor reads the block read and marks it as shared
The processor writes an exclusive block to another processor: Another processor receives a notification that the processor is attempting to write its own exclusive block, writes the exclusive block back to main memory, and modifies the state to fail, then the processor reads the block write and then marks it as exclusive

The above notice is given in the form of the request and the source classification According to it has a complete action:

Source	Request	addressing cache block status	Cache operation Type	Action
Processor	Read hit	Shared or Exclusive	Normal hit	Read locally cached data directly
Processor	Read missing	Failure	Normal missing	Send read missing requests to the bus, request data to be read after loading the cache, and mark as shared
Processor	Read missing	Shares	Replace	Send read missing request to bus, request data mount cache to replace original shared block, read and mark as shared
Processor	Read missing	Exclusive	Replace	Writes back the exclusive block and marks it as shared, sends a read missing request to the bus, requests the data to load the cache to replace the original exclusive block (now marked as shared), and marks it as shared
Processor	Write hit	Shares	Consistency	Write and mark the exclusive, then put the invalidation request on the bus
Processor	Write hit	Exclusive	Normal hit	Write directly to the local cache
Processor	Write missing	Failure	Normal missing	Send write missing requests to the bus, request data to be written after the cache is loaded, and mark as exclusive
Processor	Write missing	Shares	Replace	Send write-missing requests to the bus, request data to mount the cache to replace the original shared block, write, and Mark as exclusive
Processor	Write missing	Exclusive	Replace	Writes back the exclusive block and marks it as invalid, sends a write missing request to the bus, writes the request data to the cache to replace the invalid block, and marks it as exclusive
Bus	Read missing	Shares	No action	Allow shared blocks do not act
Bus	Read missing	Exclusive	Consistency	Write back the exclusive block and mark it as shared
Bus	Write missing	Shares	Consistency	Mark the shared block as invalid
Bus	Write missing	Exclusive	Consistency	Write back the exclusive block and mark the invalid
Bus	Failure	Shares	Consistency	Mark the shared block as invalid

Note: The above exclusive status is the modified modified state of the MSI protocol (another salutation to the simple consistency protocol) (sometimes it is called exclusive, sometimes it needs to be distinguished from being modified)

7.2.3 MSI Protocol extension (1) MESI

MSI has a flaw, read a block (read missing) and then modify a block (write hit), will generate 2 bus transactions (read missing I->s, write hit when s->m and send invalidation), even a block "exclusive" This read block when write hits also publishes invalidation requests on the bus, and this situation is common in multi-channel program load

In order to reduce the bus transaction, a MESI protocol is proposed for this situation, which expands the status exclusive state to indicate that there is only a copy of the block in the current cache and that the block is clean (in order to distinguish it from the previously exclusive salutation, I call it a clean exclusive state ), that is, the blocks in the block and main memory are consistent

When a block write to a clean exclusive state does not produce a bus write invalidation request (since it is already known that the block copy is not in the other cache and the invalidation request is meaningless), then the above "read before write" operation produces only one bus transaction, Optimized (note: There is no invalidation request when writing a block of the modified state)

By the definition of a clean exclusive state We know that there is only one situation that can produce this state: no other cache has a copy of the block when read fails, and the block is loaded from main memory

我的疑问：判断其他缓存中是否有该块副本不是需要在存储器块中增加标志位（复杂情况增加标志位也无法解决），那么是否这个判断过程又讲产生其他的总线事务呢？我认为是的

After the boss raised points, Mesi also introduced the sharing of Cache-to-cache, if other caches hear read missing when the copy of the corresponding block is detected, terminate memory access and actively provide the block copy

然而我又有进一步的疑问：但这样多块都有S的相应副本，都去终止内存访问发送自己缓存中的副本？这不算是额外的开销？或者这样总线不会乱套（虽然发送的是相同数据）？［PS：这就是不听课的下场，花样作死］

(2) Moesi

Moesi adds the owned owning state on Mesi, which indicates that a block is owned by the cache and shares the block with other caches, and that the block is obsolete in main memory

Moesi is the case of cache-to-cache sharing, that is, not asking for a copy in main memory when missing, but seeking a copy in the other cache (the idea is that cache access is faster than main memory access)

In Moesi, when you try to share a modified block in cache A, the block is not written back , a tag is owned , an attempt is made to get the cached copy of the Block share from a, and is marked as shared (Only A is marked as owned), then you need to maintain this behavior later, that is, when there is a loss, the cache that owns a block must proactively provide a copy of the owning block, and then write the master when the owning block is replaced.

7.3 Distributed shared Memory Architecture and directory consistency protocol

The Monitoring consistency protocol is not used in DSM because:

Limited bus scalability: Increased number of processors to compete for shared bus use, easy to monitor bandwidth bottleneck
Difficulty listening on a network that is not a bus or a ring: Must broadcast consistency notification, high bandwidth consumption and inefficiency on networks with complex topologies

This introduces another kind of consistency protocol: Directory Consistency Protocol , the directory is also distributed in each node, and the node's memory one by one corresponding to the memory of the state of each block is recorded in the directory; Fabric directory addressing is the same as distributed memory addressing, and the same address space is shared across all directories in the DSM

There are three states of the Directory consistency protocol:

Share: One or more nodes cache the block, and the block is up-to-date in memory (the directory also needs to record which nodes share the block)
Not cached: All nodes do not have a copy of the cache block
Modified: Only one node has a copy of this cache block, and the block has been written, the block has expired in memory (the directory also needs to record which node modified the block)

Therefore, not only the cache state of the memory block is recorded in the directory, but also a bit vector is used to record the shared/modified nodes.

Define three types of nodes:

Local node: the node making the request
Remote node: Other node of non-local node
Master node: The location of the destination cache block storage and the location of the directory

Understanding the relationships between several types of nodes makes it easy to understand the directory consistency protocol:

The primary node may be a local node, or it may be a remote node
The target cache block may be available from the master node, or it may be obtained from other nodes
When the local node discovers that the cache is missing, the directory of the master node is queried, the cache block status and the shared/modified nodes are known: The primary node directory is shared and read invalidated, a copy of the cache block is sent to the local node and the node is added to the shared node record, and the master node directory displays shared state and write invalidation. A copy of the cache block is sent to the local node and all nodes in the shared node record are notified that the cache block is invalidated, that it is last marked as modified, that the display in the master node is not cached, that the node number is marked as modified or shared based on the local node missing type (write or read) and that it is in the Master node directory A message is sent to the node in which the cache block was modified to write back to the modified cache block before providing the block copy to the local node

Finally, the complete state transitions and corresponding actions for the directory consistency protocol are given in the table:

[Note p for issuing the request node number, a for the requested address, D for the requested data]

Message Type	Source	Target	Message Content	Messaging Features
Read missing	Local cache	Home Directory	P,a	Node P fails to read at address A, requests data and adds p to the shared node list
Write missing	Local cache	Home Directory	P,a	Node P has a write missing in address A, requests data and records p as an exclusive node, and then the home directory fails to send
Failure	Local cache	Home Directory	A	The home directory fails to send all nodes that have cached a (remote cache)
Failure	Home Directory	Remote cache	A	Shared copy fails at a
Fetch data	Home Directory	Remote cache	A	Retrieve the block of address a, send it to the home directory, and mark it as shared
Invalid data access	Home Directory	Remote cache	A	Retrieve the block of address a, send it to the home directory, and mark it as invalid
Data response	Home Directory	Local cache	D	Returning data values from the primary node
Data writeback	Remote cache	Home Directory	A,d	Write back the data value of address a

7.4 Synchronization

Some of the most important concepts of synchronization issues are:

Atomic operation
Critical section
Mutual exclusion Lock
Signal Volume
Dead lock
Sync Barrier

These concepts, the basic concepts of the operating system course, are not reviewed here.

7.5 Storage Identities

Storage Identity (also known as coherence) refers to a convention that each process sees when multiple processors concurrently read and write operations on different storage units are ordered to be completed

Storage Consistency is guaranteed to be visible to all readers when a cell in a shared storage space is modified, and it does not involve:

When to make a write data visible
Processor P1 and P2 order of access to different address units
P2 the order in which read operations on different storage units are seen relative to P1

The sequential identity model requires that the global memory access order formed by the serial execution of all processor read and write operations must conform to the order of the original program, that is, regardless of how the instruction flow is alternately executed, the global order must guarantee the local order of all Programs

Sufficient conditions for sequential identity:

Each process emits storage operations in accordance with the program execution sequence
After a write is issued, the process waits for the write operation to complete before issuing the next action
After a read operation, the process waits not only for the read to complete, but also for the completion of the write operation that produces the read data in order to issue the next operation

Note: This piece of reading is not very understanding, for example will be added later

Architecture Review 4--thread-level parallelism

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Architecture Review 4--thread-level parallelism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Architecture Review 4--thread-level parallelism

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support