Reference: HTTP://WWW.ANANDTECH.COM/SHOW/3851/EVERYTHING-YOU-ALWAYS-WANTED-TO-KNOW-ABOUT-SDRAM-MEMORY-BUT-WERE-AFRAID-TO-ASK/4
The process of moving data in and out of the memory Array and the memory Bus are not overly complicated, although the Massive parallelization of the actual effort can make it somewhat difficult to fully envision what's really happening with Out some pretty concise visual aids. We'll try our best-to-help-you-regard.
Both Read and Write access to ddr[3] SDRAM is burst oriented; Access starts at a selected location and continues in a pre-programmed sequence for a Burst Length (BL) of 8-bits, or 1 by Te, per bank. This begins with the registration of ACT command and are followed by one or more READ or WRI commands.
Chip Select (s0#, s1#), one for each rank, either enables (low) or disables (high) the command decoder which works like a Mask to ensure commands is acted upon by the desired rank only.
The length of the Each Read Burst (Tburst) are always 4 clocks (4T) as DDR memory transmits data at twice the host clock RA TE (4 clocks x 2 Transactions/clock = 8 transactions or 8 bits per bank).
The address bits registered coincident with the ACT command is used to select the Bank and page (row) to be accessed. For our hypothetical 2GB DIMMs described on Page 2 of this article, bank selects BA0-BA2 indicate the bank and ADD Ress Input selects a0-a13 indicate the page. Three bits is needed to uniquely address all eight banks; Likewise bits is needed to address all 16,384 (214) pages.
The address bits registered coincident with the READ or WRI command is used to select the targeted starting column for th E burst. A0-A09 Select the column starting address (210 = 1,024). A12 is also sampled during this operation to determine if a Burst Chop (BC) of 4-bits have been commanded (A12 high). Even though a Burst Chop delivers only half this data of a regular Read Burst, the time period to complete the transfer is Still the same:4t. The SDRAM core simply masks the outgoing data clock strobe for the second half of the full read cycle.
Figure 3. Memory Read and write operations can is broken down to a series of well defined events
During a Precharge command A10 is sampled to determine whether the precharge are intended for one bank (A10 low; BA selects) or all banks (A10 high).
Data input/output pins dq0-dq63 provide the 64-bit wide Data interface between the memory controller embedded in the CP U and each DIMM. Those with a triple-channel capable CPU, like the Intel Core I7-series processor, would come to understand why the memory B US width is reported as 192-bit–three independently operated channels each with a 64-bit interface makes 192. Those of you running a core 2 or a core i3/i5 would have the to make does with just the channels for a total bus width of S.
Each channel can is populated with an up to a DIMMs. This means there could is a maximum of four ranks per channel, assuming we install a matched pair of dual-rank modules. Installing more than one DIMMs per channel does not double the Memory Bus bandwidth, as modules co-located in the same Chan Nel must compete for access to a shared 64-bit Sub-bus; However, adding more modules does has the added benefit of doubling the number of pages that may be open concurrently (TW Ice the ranks for twice the fun!).
Figure 3 attempts to provide a top-down look at the minimum cycle needed to first open a page in memory, and then Read data from the activated page; Figure 4 shows the same, only from a much more fundamental perspective; and Figure 5 provides a detailed accounting of the timing involved.
Figure 4. Now it all makes sense! (pun intended)
In this example we assume the bank have no open page, thus is already in the proper precharged a new PA GE Access command. Step 1 selects the bank; Step 2 selects the column; and Step 3 bursts the data out over the Memory Bus. A 1-bit row address and a 2-bit column address is all we need to read any data stored in our 2 x 4-bit x 1 (Bank) Memory Array.
An Activate command prompts the routing of the specified page address to being accessed to the Row Decoder where it triggers The chosen Word line to appear at the input of the sense Amps. As previously stated, this takes a finite time-row-column (or Command) Delay (TRCD) was used to program the minimum wait Time the memory controller allows the occur before it issues the next command in the sequence. Attempting to set too low a timing can leads to inconclusive operation, often resulting in data corruption and other data a Ccess issues that ultimately leads to the system crashes and other application errors.
Next, the column address provided with the Read command selects the right Bit line, beginning the process of disregarding Those bits, were not addressed. The wait associated with these events are the CAS Latency (CL or TCAS).
The sense Amps work by sensing the direction of the voltage swing induced on the sense line when the Word line is activate D. Activating the page gates-on the switching element holding back the accumulated charge in a trench filled with dielectr IC material used to create the capacitive storage element of the memory cell. When this happens, starting from VREFDQ (? VDDQ), either swings positive or negative, depending on the potential of the sampled memory cell. An increase in voltage encodes a 1, while a decrease means 0.
Click to enlarge
Figure 5. Shown here is pair of "back-to-back" reads. Our example Row Cycle time (TRC) lets us transfer-to-bytes of data with a minimum Page open time of 24T using CL-TRC D-trp-tras Timings of 6-6-6-18
The sense Amps is not comparators. Rather, each sense AMP interfaces with a pair of memory cells, reducing the total number of amplifiers needed to otherwise Sense the entire array by a factor of.
Following the read, any charge stored in the memory cells is obliterated. This is meant by a destructive read:not only does the sense Amps cache the page for access, they Known copy of that page of memory! Precharging the bank would force the sense Amps to "write" the page back to the array and would prepare the sense lines for The next page access by ' precharging ' them to? VDDQ. This accomplishes-thing: (1) It returns all sense rails to a known, consistent potential, and (2) it sets the Pre-sens E line voltage at exactly half the full-scale value of VDDQ, ensuring whatever the potential stored in the cell, there wil L be-a swing in voltage when the proper Word line is activated.
Every Read/write memory transaction can be segmented by type into one of three performance bins depending on the status of The bank/page to be accessed. These bins, in order of the best to worst, is page-hit, Page-empty, and Page-miss. For the most part, anything we can does to increase the number of page-hit transactions or reduce the number of Page-miss TR Ansactions is a good thing.
A page-hit access is defined as any read or write operation to an open page. That's, the bank containing the open page is already active and are immediately ready to service requests. Because the target page is already open, the nominal access latency for any memory transaction falling into this category is approximately TCAS (the CAS Latency of the device).
Click to enlarge
Figure 6. Page-hit timing (with Precharge and subsequent bank access)
Figure 6 shows the minimum read latency associated with a best-case page-hit scenario. For a and a CAS Latency of 6T, the memory controller waits only six short clocks before the start of data return. During a read with Auto-precharge, the Read command would execute as normal except the active bank would begin precharging C As-latency (CL) clock cycles before the end of the burst. This feature allows the precharge operation to be partially or completely hidden during periods of burst read cycles, Depe Ndent on CL. When tuning our systems we always seek to set TRTP such that TRTP + TRP equals CL + Tburst for exactly this reason. Put another, if CL and TRP is the same set 4T for DDR3 (2T for DD2).
Sequential reads to the same page make these types of transactions even more profitable as each successive access can s Cheduled at a minimum of Tburst (4T) clocks from the last. The timing is captured as the Cas-to-cas delay (TCCD) and are commonly referred to as ' back-to-back CAS Delay ' (Business-to-business), as SH Own per Figure 7. This feature makes possible extremely high data transfer rates for total burst lengths of one page or less-in we case, 8KB.
Click to enlarge
Figure 7. Triple Burst Chop read with Precharge and subsequent banks access
Although not ideal, a page-empty access was still preferred to a miss. In the "The bank" is accessed are Idle with no page open. Common sense tells us any attempt to read or write data to a page in this bank first requires we Activate the bank. In other words, nominal access latency now includes the time to open the Page-row-column (or Command) Delay (TRCD). This was a doubling of the minimum access latency when compared to that of the Page-hit case! Twelve cycles (TRCD + CL) now elapse before the first word is returned. Figure 8 shows this detail.
Click to enlarge
Figure 8. Page-empty timing. Page remains open
Finally, as if the relative penalty of page-empty access wasn ' t bad enough, here comes Page-miss. A miss occurs anytime a memory transaction must first close an open page on order to open an alternate page in the same BA Nk. Only then can the specified data access take place. First closing an open page requires a precharge, adding the RAS precharge (TRP) delay to any already lengthy operation. As you can see by figure 9, the nominal latency of a access of this type are three times that of one Page-hit ope ration!
Click to enlarge
Figure 9. Page-miss timing. Page remains open
The relative Gain/loss ratio for each access type can is quickly assessed simply through a cursory review of the most BASI C Device Timings. Imagine a Memory kit rated for operation in ddr3-1600, 6-6-6-18 (Cl-trcd-trp-tras): With nothing more we can estimate six Cycles for a page-hit access, cycles for a page-empty access, and the cycles for a page-miss access.
Normalized to the Page-hit access latency, page-empty access was twice as long, and Page-miss access is a whole three times As Long. If we combine this with what we know about the inner functions of the SDRAM state machine we see Page-hit and Page-miss ar E really just subsets of the same bank state (active). Of course, page-empty access necessarily implies an idle bank. The following proof rewards us with some powerful insight.
The variable n also represents the percentage of accesses to banks with open pages This must result in a Page-hit Access if we are to simply pace the nominal access latency this would be achieved if every read access is to an idle ban K. and the thing This depends the RAS precharge delay and the Row-column (or Command) delay of the Dev Ice in question.
You would think by working to maximize N, performance would is maximized as well. And you're ' d is right. Let's take the what we've learned thus far and step it up a notch. We Promise, after this you'll never see memory timings in the same light ever again.
Before proceeding, we ' ve prepared a video for those of what would like to view a few simple animations meant to help V Isualize each transaction type:
What does mean you ' ve never heard of Adaptive Page Management (APM) technology? Well, that must is because Intel Marketing doesn ' t seem to feel the need to bring it up.
Simply put, Intel ' s APM determines, based on the potential implications of pending memory transactions, whether closing op En pages, or allowing them to remain open longer is beneficial to overall memory performance. In response, the memory controller may (or could not) elect to issue commands to close pages, depending on the programmed op Eration.
Figure Ten provides the general flow of events required to manage such a process. In our explanation we intend to introduce-known register settings needed to adjust the functional control polic Y, but first we need to detail the necessary actions, and purpose, the of the design elements, a such. A better understanding of the underlying logic would pay dividends as you attempt to dial in measurable performance improve ments through experimentation.
Per Figure One, the Transaction Queue stores memory transactions generated by the processor. Unlike a typical first-in-first-out (FIFO) queue with a tail, into which memory transaction could be pushed, and a head, fro m which memory transactions may being popped, this transaction the queue is a plurality of storage elements allowing single Memor Y transactions to is removed from the list and dispatched toward the memory in a different ordering than when originally a Dded to the queue.
Figure 10. Generic method used by the memory controller to adaptively generate page close messages. Different system usage patterns would most likely necessitate changes to the base decision logic
Command re-ordering can improve perceived memory performance by grouping together reads/writes to a common physical page I n memory, saving the time that would otherwise is needed to later re-open the same page, should a concurrent access to the Same bank force it to close early. After all, the minimum delay between sequential accesses to the same open page was equal to the CAS Latency (CL or TCAS) of The device. Accessing a bank (opening a page) increases the latency of the post-interleaved operation by the Row-column (or Command) D Elay (TRCD), approximately doubling the effective data access time.
One should also appreciate that there is varying degrees of freedom when shuffling transactions in time. Like in the case of a read and write to the same memory Location:the memory controller would is disallowed from moving th E dependent read either ahead of or behind the associated write as the ordering must be implicitly maintained or coherency would be lost.
The Address Decoder partially decodes the memory transactions stored in the Transaction Queue as needed to determine the B Ank and page selected by each queued request. From there, the bank Select messages control the multiplexers used to input the contents of a Bank Register to a Comparato R used to check if the selected page was also the very recently opened page for that bank (as such, each bank Register is) Large enough to store n bits where each bank comprises 2n pages). A match results in the creation of a page-hit Result message.
Figure 11. Our ASUS rampage III Extreme Beta BIOS includes settings used to establish the boundary regions this define when each pre- Programmed algorithm is active, the operating frequency of the policy adaption feedback loop, and the maximum Single-insta NCE lifetime for each decision to allow a page to idle open just a little longer
Triggered by the Page state Logic, the Scheduler fetches pre-identified queued memory transactions for re-ordering based O n the Memory selects (both Bank and page) and the associated Page-hit Results. An array of bank state registers track actions performed upon each Bank by storing a State-word indicating, among other th Ings, whether the Adaptive Page Close Logic decided to Close the bank in response to a previous memory transaction to the Same bank.
Finally, based on the policy instantiated by the algorithm Selector, a page-close Message either was or is not generated BA sed on the same page-hit Results, Bank State registers, and Bank/page selects in a effort to increase the number of SUBSE Quent Page-hit accesses and/or decrease the number of Page-miss accesses.
An immediate and tangible gain are achieved for every successfully re-ordered transaction as a page-hit access are more Effi Cient than a page-empty, or at worst, a page-miss. This was always the case with Core i7 and is one of this architecture's well known shinning points. Switch off Adaptive page Management (disable Adaptive page Closing in BIOS) and this is where the process ends. The page may stay open for some finite time or it is closed right away; We ' re not sure as there ' s really no-to-know without some inside help.
The Adaptive page close Logic must now decide whether to collect all winnings, and Close the Page, or let it ride, and Le Ave It open just a while longer. While another page-hit access may yield further gains, "guessing" wrong would cause a costly page-miss access in place Would has been just a page-empty access. If only there were some is the system could measure the effectiveness of previous close decisions and then adjust policy To fit ...
surprise! The page Manager-made the page state Logic, Adaptive page Close Logic, and scheduler-does exactly this. How this effectiveness was measured, and how the result of that evaluation was used to adapt the decision-making process is Our next topic of discussion.
DRAM Memory Introduction (II)