Http://www.infoq.com/cn/articles/cache-coherency-primer
Http://www.cnblogs.com/xybaby/p/6641928.html
This article is a translation of RAD Game tools programmer Fabian "Ryg" Giesen the "Cache coherency Primer" published on its blog, which is shared by the author for the Infoq Chinese station. There are two articles in this series, this is the first article.
I plan to write some articles about data organization in multi-core scenarios. Wrote the first one, but I quickly realized that there was a lot of basic knowledge I need to talk about first. In this article, I will try to illustrate this knowledge.
Caching (Cache)
This article is a quick introduction to CPU caching. I assume you already have a basic concept, but you may not be familiar with some of the details. (If you are already familiar with it, you can ignore this part.) )
On modern CPUs (most), all memory accesses need to be done through layers of caching. There are also exceptions, such as the I/O port, write merge (write-combined) memory mapped to a memory address, which at least will bypass the part of the process. But both are rare scenarios (meaning that most of the user-state code does not experience both), so in this article I will ignore both.
The CPU's read/write (and fetch instruction) units normally do not even have direct access to the memory-this is the physical structure, and the CPU does not have a pin attached directly to the memory. Instead, the CPU communicates with the first-level cache (L1 cache), while the primary cache communicates with the memory. About 20 years ago, a first-level cache could transmit data directly and in memory. Today, more levels of caching are added to the design, and a primary cache is already unable to communicate directly with the memory, and it communicates with level two cache-and the level two cache can communicate with memory. Or there may be a level three cache. You know what that means.
The cache is divided into "segments" (line), a segment corresponding to a piece of storage space, the size is 32 (earlier arm, 90/2000 years of the early x86 and PowerPC), 64 (newer arm and x86) or 128 (newer Power ISA machine) bytes. Each cache segment knows its corresponding range of physical memory addresses, and in this article I'm not going to distinguish between the physical cache segment and the memory it represents, which sounds a bit hasty, but for convenience, familiarize yourself with this reference. Specifically, when I refer to the "cache segment", I refer to a memory that is aligned with the cache size and does not care if the content is actually cached (that is, stored in any level of cache).
When the CPU sees a read-memory instruction, it passes the memory address to the first-level data cache (or l1d$, because the English "cache" and "cash (cash)" are pronounced the same). The first-level data cache checks to see if it has a cache segment corresponding to the memory address. If not, it loads the entire cache segment from memory (or from a higher-level cache, if any). Yes, loading the entire cache segment at once is based on the assumption that memory access tends to be localized (localized), and if we currently need data for an address, then it is likely that we will have access to its neighboring address immediately. Once the cache segment is loaded into the cache, read commands are read normally.
If we only deal with read operations, then things will be very simple, because all levels of caching follow the following rules, I call it:
Basic Law: At any time, the content of the cache segment in any level cache is equivalent to the contents of its corresponding memory.
Once we allow write operations, things become a little more complicated. There are two basic write modes: Write-in (Write-through) and write-back (Write-back). Write it a little bit simpler: we write the data directly to the next level cache (or directly into memory) through this cache, and if the corresponding segment is cached, we update the contents of the cache (or even discard it directly), as simple as that. This also follows the previous law: the segment in the cache is always matched to its corresponding memory content.
Write-back mode is a bit more complicated. Instead of immediately passing the write to the next level, the cache modifies only the data in the cache and marks the corresponding cache segment as a "dirty" segment. The dirty segment triggers write-back, which means that the contents are written to the corresponding memory or to the next level of cache. After writing back, the Dirty section becomes "clean" again. When a dirty segment is discarded, a write-back is always performed first. The rules of writeback follow a little differently.
Writeback law: When all the dirty segments are written back, the contents of the cache segments in any level cache are equivalent to the contents of their corresponding memory.
In other words, in the law of writeback mode, we remove the modifier "at any moment" and replace it with a weaker condition: either the content of the cache segment is consistent with the memory (if the cache segment is clean), or the contents of the cache segment are eventually written back to memory (for dirty cache segments).
The direct mode is simpler, but the write-back mode has its advantages: it can filter out repeated writes to the same address, and if most of the cache segments are working in write-back mode, then the system can often write a large chunk of memory at once rather than dividing it into small chunks to write, the former more efficient.
Some (mostly older) CPUs use write-only mode, some use write-back mode only, and some, one-level cache uses write-through and level two caches use write-back. This creates unnecessary data traffic between the first level and the level two cache, but still retains the benefit of write-back between the level two cache and the lower level cache or memory. What I want to say is that there are a number of trade-offs involved, and different designs have different solutions. No one specifies that the cache size must be the same at all levels. For example, we will see that the first-level cache with CPU is 32 bytes, while the level two cache has 128 bytes.
To simplify the problem, I omitted some of the content: cache affinity (cached associativity), cache sets, use allocation write (write-allocate), or non-allocation write (the direct write I described above is combined with the assignment of writes, While writeback is combined with non-assignment writes), non-aligned access (unaligned access), based on the virtual address of the cache. If you're interested, all of these things can be checked, but I'm not going to talk about it here.
Conformance Protocol (coherency protocols)
As long as the system has only one CPU core at work, everything is fine. If there are multiple cores, each with its own cache, then we have a problem: what happens if the corresponding memory content in one CPU cache segment is secretly changed by another CPU?
Well, the answer is simple: nothing will happen. It's bad. Because if a CPU caches a block of memory, we want to be notified when the other CPU modifies the memory. When we have multiple sets of caches, we really need them to stay in sync. Or, the memory of the system can not be synchronized between the various CPUs, we actually need a method that everyone can follow to achieve the purpose of synchronization.
Note that the root cause of this problem is that we have multiple sets of caches, not multiple CPU cores. We can also solve the problem by having multiple CPU cores share a set of caches: that is, there is only one level of cache, and all processors must share it. In every instruction cycle, only one lucky CPU can do the memory operation through the first level cache and run its instructions.
It's no problem in itself. The only problem is that it's too slow, because the processor's time is spent queuing up to use the first level cache (and the processor does a lot of this, at least once for each read-write instruction). I point this out because it shows that the problem is not caused by multicore, but by multiple caches. We know that only one set of caches can work, just too slowly, and then it's best to do it: use multiple sets of caches, but make them behave as if they were just a set of caches. The cache consistency protocol is designed to do this. As the name implies, such protocols are meant to keep the content of multiple sets of caches consistent.
There are a number of cache consistency protocols, but most of the computer devices that you deal with on a daily basis belong to the "snooping" protocol, which is what I'm going to snooping here. (There is also a "directory-based (directory-based)" protocol, which has a large latency, but in a system with many processors, it has better scalability. )
The basic idea behind "snooping" is that all memory transfers occur on a shared bus, and all processors can see the bus: The cache itself is independent, but memory is a shared resource, and all memory accesses are arbitrated (arbitrate): In the same instruction cycle, Only one cache can read and write memory. The idea of a snooping protocol is that the cache does not only deal with the bus when it does the memory transfer, but instead keeps the data exchanged on the snooping bus, keeping track of what other caches are doing. So when a cache represents the processor it belongs to to read and write memory, the other processors are notified that they will keep their caches in sync. As soon as a processor writes memory, the other processors immediately know that the corresponding segment in their own cache is invalidated.
In direct write mode, this is straightforward, because once the write operation occurs, its effect will be "published" immediately. But if you mix back to write mode, there's a problem. Because it is possible that the data will be actually written back to physical memory long after the write instruction is executed-during that time, the cache of the other processor may be foolishly written to the same memory address, causing the conflict. In the writeback model, it is not enough to simply broadcast the information of the memory write operation to the other processors, what we need to do is to inform the other processors before modifying the local cache. To understand the details, we found the simplest way to deal with the problem of writeback mode, which we often call the Mesi protocol (translator Note: MESI is the acronym for modified, Exclusive, GKFX, invalid, representing four cache states, The following translation may refer to the corresponding state in a single letter).
Mesi and derivative agreements
This section is called "Mesi and derivative agreements" because Mesi derives a series of closely related conformance agreements. Let's start with the native Mesi protocol: MESI is the acronym for four cache segment states, and the cache segments in any multicore system are in one of these four states. I'll explain it in reverse order, because it's more logical:
- The Invalid cache segment is either not already in the cache or its contents are obsolete. In order to achieve the purpose of caching, the segment of this state will be ignored. Once the cache segment is marked as invalid, the effect is the same as it was never loaded into the cache.
- Shared cache segment, which is a copy that is consistent with the contents of the main memory, in which the cache segment can only be read and cannot be written. Multiple groups of caches can have shared cache segments for the same memory address at the same time, which is the origin of the name.
- The exclusive (Exclusive) cache segment, like the S state, is also a copy that is consistent with the contents of the main memory. The difference is that if a processor holds a cache segment of an e-state, the other processor cannot hold it at the same time, so it is called "exclusive." This means that if the other processor originally held the same cache segment, it would immediately become a "failed" state.
- Modified (Modified) cache segments, which belong to the dirty segment, have been modified by the owning processor. If a segment is in the modified state, then its copy in the other processor cache will immediately become defunct, which is the same as the E state. In addition, if the modified cache segment is discarded or marked as invalid, then it is written back to memory-this is the same as the normal dirty segment handling in writeback mode.
If you compare these states to the cache of the write-back mode in a single-core system, you will find that I, S, and M states have a corresponding concept: invalid/not loaded, clean, and dirty cache segments. So the new knowledge here is only e-state, which represents exclusive access. This state solves the problem that we need to tell other processors before we start modifying a block of memory: only if the cache segment is in the E or M state, the processor can write it, that is, the processor is exclusive of this cache segment in both states. When the processor wants to write a cache segment, if it does not have exclusive rights, it must first send a "I want exclusive" request to the bus, which notifies the other processors to invalidate the copy of the same cache segment they own (if they have one). The processor can start modifying the data only after it has been acquired--and at this point the processor knows that there is only one copy of the cache segment in my own cache, so there is no conflict.
Conversely, if there are other processors that want to read this cache segment (we'll know right away, because we've been snooping on the bus), the exclusive or modified cache segment must go back to the "shared" state. If it is a modified cache segment, the content is also written back to memory first.
The MESI protocol is a suitable state machine that can handle requests from local processors as well as broadcast information to the bus. I'm not going to talk more about the details of the state diagram and the different types of state transitions. If you're interested, you can find more in-depth content in the book about hardware architecture, but for this article, it's a little too much to say. As a software developer, you can only understand the following two points:
First, in a multicore system, reading a cache segment actually involves communication with other processors and may cause them to transmit memory. It takes several steps to write a cache segment: Before you write anything, you first have to obtain exclusive rights and a copy of the current contents of the requested cache segment (the so-called Read for Ownership) request.
Second, although we have done extra work on the issue of consistency, the end result is very promising. That is, it adheres to the following theorem, which I call:
Mesi law: After all the dirty cache segments (m states) are written back, the contents of all cache segments at any cache level are consistent with the contents of their corresponding memory. Also, at any point in time, when the memory of a location is loaded into the exclusive cache segment by a single processor (e-state), it will no longer appear in the cache of any other processor.
Note that this is actually what we have already said about the law of write-back and exclusive rules. I think the existence of the MESI protocol or multicore system does not weaken our existing memory model at all.
Well, here we are (roughly) talking about the native Mesi protocol (and the CPU that uses it, such as arm). Other processors use the Mesi extended variant. Common extensions include the "O" (owned) state, which is similar to the e-state and a means to guarantee consistency between caches, but it directly shares the contents of the dirty segments without having to write them back into memory ("dirty segment sharing"), resulting in the Mosei protocol. and Mersi and Mesif, these two names represent the same idea, which specifies that a processor specifically handles read operations on a cache segment. When multiple processors have an S-state cache segment at the same time, only the specified processor (the corresponding cache segment is R or F state) can respond to the read operation, not every processor. This design can reduce the data traffic on the bus. Of course you can join the r/f state and O State, or more states at the same time. These are optimizations, none of which change the basic laws, nor do they change the results guaranteed by the MESI protocol.
I'm not an expert on this, and there's a good chance that there are other protocols in the system that don't guarantee consistency, but if they do, I don't notice them, or I don't see any popular processors using them. So in order to achieve our goal, we can really assume that the consistency protocol can guarantee the consistency of the cache. It's not basically consistent, it's not "writing for a while to stay consistent"--it's completely consistent. On this level, the state of memory is always consistent unless there is a problem with the hardware. In technical terms, Mesi and its derivative protocols, at least in principle, provide complete sequential consistency (sequential consistency), which is the strongest model for ensuring memory order in the memory model of C + + 11. This also begs the question, why do we need a weaker memory model, and "when will it be used?"
Memory model
Different architectures provide different memory models. By the time this article was written, the arm and power architecture machines had relatively weak memory models: such CPUs had considerable latitude in the reordering of Read and write commands (reordering), which could change the semantics of programs in multicore environments. With memory barrier, the program can limit this: "Reorder operations do not allow crossing this boundary." Conversely, x86 has a strong memory model.
I'm not going to delve into the details of the memory model here, which is easy to fall into the technical jargon of piling up and beyond the scope of this article. But I want to say a little bit about "how they happen"-that is, how the weak memory model guarantees correctness (compared to the sequential consistency of the MESI protocol to the cache), and why. Of course, everything is due to performance.
The rule is this: if the following conditions are met, you can get complete sequential consistency: First, the cache receives a bus event and can respond quickly in the current instruction cycle. Second, the processor faithfully according to the order of the program, the memory operation instructions to the cache, and so on before the execution of the previous one to send the next. Of course, modern processors are not generally able to meet these conditions:
- The cache does not respond to bus events in a timely manner. If a message is sent on the bus to invalidate a cache segment, but if the cache is processing something else (such as transferring data to the CPU), then the message may not be processed in the current instruction cycle, but will enter the so-called "invalidation queue". This message waits in the queue until the cache is empty.
- Processors generally do not send memory operations instructions to the cache in the order of the program. Of course, a processor with a disorderly execution (Out-of-order execution) function is certainly the case. Sequential execution (In-order execution) processors sometimes do not fully guarantee the order of memory operations (for example, if the desired memory is not in the cache, the CPU cannot stop working in order to load the cache).
- Write operations are particularly special because they are divided into two phases: before we write, we need exclusive access to the cache segment. If we do not currently have exclusive rights, we need to negotiate with the other processors, which also takes some time. Similarly, it is a waste of resources to leave the processor idle in this scenario. In fact, the write operation first initiates an exclusive request and then enters the so-called queue of "write buffer" (Some places use "write buffer" to refer to the entire queue, which I use to refer to a queue's entry). The write operation waits in the queue until the cache is ready to process it, at which point the write buffer is "emptied (drained)" and the buffer is recycled to handle the new write operation.
These features mean that, by default, read operations are likely to read stale data (if the corresponding failed request is still not executed in the queue), and that the write operation can actually finish at a later time than they were in the code, and it becomes ambiguous once it involves a disorderly execution. Back to the memory model, essentially there are only two major camps:
In a weak memory model's architecture, the processor's work is minimized for the developer to write the correct code, the order reordering and the various buffering steps are formally allowed, that is, there is no guarantee. If you need to ensure a certain result, you need to insert the appropriate memory barrier yourself-it prevents reordering and waits for all operations in the queue to complete.
An architecture that uses a strong memory model will do a lot of logging in-house. For example, x86 keeps track of all memory operations that are waiting, and these operations are not fully completed (called "retired (retired)"). It will keep their information in the chips inside the mob ("Memory ordering buffer", a sort buffer of memories). x86, as part of an architecture that supports the execution of an order, can undo a "retired" instruction in the event of a problem-for example, page fault, or a branch prediction failure (branch mispredict). I've mentioned some of the details in my previous article, "Curiously," and some interaction with the memory subsystem. The main thrust is that the x86 processor proactively monitors external events (such as cache invalidation), and some operations that have already been performed are revoked due to these events, but not "retired". This means that x86 knows what his memory model should look like, and when something happens to this model, the processor will fall back to the previous state compatible with the memory model. This is what I mentioned in a previous article, "Clearing the Memory Sequencer (ordering machine clear)." The final result is that the x86 processor provides a strong consistency guarantee for memory operations-although not achieving perfect sequential consistency.
Anyway, it's enough for an article to say so much. I put it on my blog. My idea is that a future article is just a reference to it. Let's see how it works. Thanks for reading!
View Reference Original: http://fgiesen.wordpress.com/2014/07/07/cache-coherency/
ZT: Caching Consistency (cache coherency) Getting Started Cach coherency