Deep understanding of Intel Core microarchitecture

Source: Internet
Author: User
Tags prefetch

Deep understanding of Intel Core microarchitecture

Level 2 cache of Core 2

Level 1 cache is divided into 32kb l1i cache and 32kb l1d cache, which are both 8-way groups of associated write back buffer, 64 bytes per line. Each core has an independent L1 cache, shared L2 cache and bus interfaces. L2 cache is a 16-Channel group, with 64 bytes per line. The data bandwidth between L2 cache and L1 cache is 256bit. Two core l1d caches can transmit data to each other. L1 cache has several data and command hardware prefetchers, the L2 cache prefetch works based on the Access Mode and density of the L1 cache prefetch, and uses an improved round-robin Algorithm Dynamically allocates bandwidth between two processors. The frontend bus interface adopts a similar method to ensure balance. L1 cache and L2 cache adopt independent access design. That is to say, core can directly retrieve data from L2 L1 or main memory without increasing the data level by level. Intel cache uses the mainly inclusive design, which is exclusive compared with AMD. Data and command address alignment importance modern microprocessor architecture in load and store memory, if the access address is an integer multiple of N, n Bytes can be operated at a time (n = 2 ^ m), but if the n Bytes of the operation are in the address, the remainder of N is not 0, two or more clock cycles are required, especially when the address crosses the cache line, the performance impact is more serious. For example, if you want to read the 4 byte int variable, if the physical address of the variable in the memory (note that it is a physical address, not a linear address) is exactly an integer multiple of 4, the operation can be completed efficiently (in the same buffer ). In addition, alignment of data based on its natural length also reduces the probability of Data crossing the cache line to a minimum (we will discuss in detail the working principle of the core microprocessor architecture cache later ). Since alignment is so important, how can we place data on the alignment address? For global variables, we can use _ declspec (align (N). Of course, we can also use intel-related types such as _ m128, for example: typedef _ declspec (align (16) int a16int; a16int I; now the address of this variable I is 16 bytes aligned. Variables in the stack are particularly complex. Of course we can also use _ declspec (align (N) to align the variable address, but note that, microsoft Visual C ++ Compiler's default stack alignment is 4 bytes, but the official documentation says the compiler intelligently judges the data organization and performs the necessary alignment conversion. The stack alignment is quite strange by default, I have tested it and it has something to do with the compilation optimization options. In fact, the order of the variables in the stack is usually different from the defined order, and I cannot guarantee the byte alignment of struct, double is 8-byte aligned unless _ declspec (align (N) is manually added. When the optimization condition is/O2, the test shows that the stack is naturally aligned, only for my personal tests, I did not find the relevant documentation clearly stating that the stack must be naturally aligned with data under/O2 conditions, so if Data Alignment is explicitly required, or add an alignment sign. For aligned_malloc and _ mm_malloc aligned_malloc aligned_aligned_malloc alignloc alignment of dynamic memory allocation, if the memory size allocated with malloc is greater than 8, The alignloc, for example, if 7 bytes are allocated, 4 bytes are aligned. Pentium Pro, 2 and 3 Pipeline command to take the specified unit to send the command Code Get 16 bytes from the 16-byte alignment address of the Code buffer to another 32 bytes buffer at a time. The main purpose of the buffer is to then decode the commands that span the 16 byte boundary. The code is then transmitted from the 32-byte buffer to the decoder, And the transferred code block is named ifetch. The ifetch block can be 16 bytes at most. In most cases, getting the finger unit will enable the ifetch block to start from a command boundary, rather than 16 bytes. But to do this, you need to know where the boundary of the command is, if you do not know at this time, it will start the ifetch block from the 16-byte alignment boundary. The 32-byte buffer does not have enough size to buffer the code block after the jump, so there is a delay in the jump. If the jump command contained in the ifetch block spans the 16-byte boundary in the middle, therefore, the 32-byte boundary must store two chunks to satisfy the jump decoding. In this way, a block of Code cannot be taken into the 32-byte buffer during this clock cycle, if the first command of the code block after jump spans the 16-byte boundary of the memory, it will take another period to complete, because the first instruction can be decoded only after two blocks are obtained (the delay is because the Unit blocks the entire pipeline ). Each clock cycle of a specified unit can take 16 bytes. If an ifetch block is decoded to be greater than a clock cycle, additional code operations can be performed in the redundant clock cycle, this may offset the previous delay. if the 32-byte buffer only takes 16 bytes after the instruction jump, the first ifetch block is the 16-byte alignment block, instead of starting from the actual instruction, if two 16-byte codes are available, the 32-byte buffer actually takes the code block starting from the first command after jump. the 32-byte buffer only takes 16 bytes from the 16-byte alignment cache address after the jump, because no instruction length information is available after the jump, therefore, the first ifetch block after jump is the block. Obviously, the first command after jump does not start at the beginning of the ifetch block, but the game jump spans 16 bytes, there are 16 bytes with a higher address. The complete ifetch must be waited quickly. The 32-byte buffer can only be generated after it is filled with 32 bytes. the instruction length ranges from 1 to 15 bytes. Therefore, we cannot determine whether the 16-byte I The fetch block contains the complete number of commands. It is very likely that the first or last command spans the 16-byte boundary and extends to the first 16-byte code block or the last 16-byte code, then, the fetch unit must know where the last instruction in the 16-byte block ends, so that the fetch unit can correctly generate the next ifetch block, the last incomplete instruction in the previous ifetch block will not be decoded, so it does not occupy the clock period. the instruction length information can be completed by the instruction Length Decoder. This unit is located at ifu2, that is, 2nd finger fetch units. for each clock cycle, it can obtain the length of three commands. For example, if an ifetch block contains 10 commands, you must wait for three clock cycles before the next ifetch block is generated. decoder unit instruction Length Decoder is a highly serial unit, because before you know the start of the next instruction, you must know the end of all the preceding instructions in sequence, this seriously reduces the degree of parallelism of the entire system. We always try to get multiple commands at a time, decode multiple commands at a time, and execute multiple micro commands at a time to get the fastest speed possible. A simple instruction Length Decoder (one Only the length of one instruction can be known, but the ppro microprocessor architecture can determine the length of three instructions at a time, or even return the results to the specified unit at the same time, it is a very exciting Function to make the specified unit correctly generate ifetch blocks from the 32-byte buffer at the same time, I believe it does this by parsing all 16 Possible starting points in parallel. when an ifetch block is generated and sent to the decoding unit, the command is translated as a microinstruction, where three decoding units work in parallel, so that a clock cycle can have up to three instructions decoded, at the same time, several codes (up to 3) are called a decoding group, and the three decoding units are called D0, D1, D2, and D0, respectively, it can be decoded into four sub-instructions in one clock cycle. d1d2 is relatively weak and can only produce one UOP command, and the instruction length cannot exceed 8 bytes, the first instruction in the ifetch block is always distributed to D0. If the remaining instruction can be handled by d1d2, it is distributed to D1 or D2. Otherwise, the remaining instruction must be decoded after D0 completes the work. as follows: mov [E Si], eax; 2 uopsadd EBX, [EDI]; 2 uopssub eax, 1; 1 uopcmp EBX, ECx; 1 uopje L1; 1 UOP the First Command is distributed to D0, the second Command finds that there is no place for decoding, because it generates two records which must be in D0, so it must delay a period from the beginning, and wait until D0 completes decoding of the first command, in the second clock cycle, I2 I3 I4 is distributed to D0 D1 D2, and 3rd clock cycles I5 to d0. therefore, in ppro, the best command combination is one 2 ~ 4 uops commands and 2 1 UOP commands. commands greater than 4 UOP must be decoded by D0, and two or more clock cycles are required for decoding. In this process, other commands cannot be decoded concurrently. block boundary problem, the most complicated one is that the first instruction in the ifetch block always enters D0. If the instruction length is in the 411 mode, and unfortunately there are always 1st instructions in the 1uop command, if the 4-1-1 mode is destroyed, decoding a 141 combination always takes two clock cycles. If all the commands are in the 141 141 141 mode, it will be finished and every time it will be delayed for one cycle, however, it is clear that it is okay to simply move the mode back to a command, but the number of uops generated by the command is absolutely unknown when the specified unit is used, I believe that this result will not be known until the next two phases. this problem is very difficult to solve, because it is very difficult to know the command boundary. The best way is to organize the code so that the code can generate three uops in an average clock cycle, the rat and RRF phases of the pipeline can process up to three uops at a clock cycle. If we organize them in 411 mode In this way, we may have the following situation: 114 3 uops/C 411 6 uops/C 141 3 uops/C. In this way, we can reach 4 uops/C, in this way, we can compensate for the latency of one clock at the ifetch boundary, so as to maintain the efficiency of decoder 3 uops/C. another solution is to make the instruction in an ifetch block as many as possible, so that the 16-byte boundary will be interrupted much less. for example, using a pointer instead of an absolute address can shorten the instruction length .; example 5.2. instruction fetch blocksaddress instruction length uops expected decoder limit 1000 h mov ECx, 1000 5 1 d01005h LL: mov [esi], eax 2 2 d010 07 h mov [mem], 0 10 2 d01011h Lea EBX, [eax + 200] 6 1 d11017h mov byte PTR [esi], 0 3 2 d0101ah BSR edX, eax 3 2 d0101dh mov byte PTR [ESI + 1], 0 4 2 d01021h dec edX 1 1 d11022h jnz Ll 2 1 D2 let's use the above example to assume that the ifetch block starts from 1000h to 10h, then there is an incomplete "mov [mem], 0" instruction in this ifetch block. Then the second ifetch block will end at h, and the result is as follows: ifetask-------> 1000 h ~ 100fh = 16 bytesifetch2 -------> 1007 H ~ 1016 H = 16 bytes obviously 2nd ifetch blocks end at the end of the lea command, so the third fetch block is: ifetch3 -------> 1017 H ~ 1022 H = 11 bytes

All the commands are fetch. Now let's take a look at the number of clock cycles required for decoding In the first loop. The jump target is at H, and apparently does not span the 16-byte boundary, if the branch prediction is correct, the generation of ifetch blocks will not be delayed. ifetch Block 1 is Between 1014 H and H. Obviously, an incomplete Lea command is included at the end, the decoding of the First fetch block requires two clock cycles, because the 2uops commands all need to wait for D0 decoding, and the 2nd ifetch blocks need to be decoded from 1020 h to H, it takes four clock cycles to decode four commands. This is too painful. The LEA command is scheduled to decode D0 because it is the beginning of the ifetch block, 3rd ifetch blocks are between 1022 H and H. One decoding cycle is required, which is decoded in D0 and D1 respectively. the decoding cycle of the first round of loop requires a total of 7.

The instruction prefix may also delay the Instruction Decoding unit. 1. if the instruction contains 16 or 32-bit immediate numbers, the prefix of the operand may lead to some clock latency, because in this case, the prefix will change the length of the operand, thus changing the instruction length. 2. modifying the prefix of the address Length attribute will delay decoding because the prefix changes the R/M bit interpretation of the instruction. however, commands with implicit memory operations, such as string operation commands, will not be delayed. 3. segment-modified prefixes do not cause performance loss to decoding. 4. repeated prefixes and lock prefixes do not delay decoding.

5. If an instruction has multiple prefixes, the decoding will usually be delayed. Generally, one clock delay is generated for every multiple prefixes.

Register renaming

Register alias table (rat) controls register name change. The decoded uops come to rat through a queue, and then to the Rob stage, and finally Rs. Rat can process three uops in a clock cycle, that is to say, the throughput of the entire system will not operate on 3 uops/C on average, and there is basically no limit on the number of Register names to be changed. Generally, three registers can be renamed in a clock cycle, you can even name a register three times.

the register name change principle is very simple, although the number of registers that can be used directly in the Program is very limited, however, there are indeed a large number of hidden registers in the microprocessor, And the CPU can use them to replace the logic registers that appear in the program, so that uops can run in parallel to the maximum extent possible. Each time the program modifies a logic register, allocate a hidden register to this logical register for the processor. at the same time, this phase also calculates the relative jump address branch and returns the computer results to btb0 for use. the so-called hidden register may be Rob entry

Reorder Buffer read when the hidden register is renamed as a general register, you need to write the value of the General Register to the hidden register that has been renamed. If these values are available, they are stored in the Rob entry. Each Rob entry can have up to two input registers and two output registers. The values in the input register are as follows. 1. the value in the general register is available, that is, the permanent memory file. then directly read the value to rob entry 2. the value has been modified, but the modified UOP has not been retired, that is, it has not been written back to Permanent register file, then, read the value from not-yet-retired Rob entry to rob entry in the response. 3. the value is not available yet, because the dependency UOP has not been executed, you can only wait for it first. After the value is available, it will be immediately written to rob entry.1, which looks like the simplest and the least problematic, however, what is strange is that the only case 1 is that it can cause latency in the Rob-read phase because PRF only has two read ports In the Rob-read stage, each clock can collect three uops from the rat stage, and each uops has two input accumulators, in this way, there will be six memory units waiting for input in one clock cycle, and the Rob-read stage has to use three clock cycles to complete the input task, when the queue between the rat and decoder is full, the decoder and the retrieval unit also have to stop, while the queue between the rat and decoder is only about 10 units, so the queue will be filled immediately. on the contrary, there will be no latency in the Rob-read phase in the third case, and if the memory has not passed the ROB-WRITEBACK phase, there will be no delay in the Rob-read memory, the memory generator must have at least three clock cycles to complete the name-changing read by Rob and rob write back. Therefore, if you have written a memory generator in the UOP of a triplet, there will be no delay in reading the memory in the next three triplets. If the write-back phase is delayed because the reordering slow Command Execution dependency chain cache miss or in other cases, so there is more time to read this memory. no The decoding group and uoptriplet are mixed. A decoding group can generate 1 ~ Even if the decoding group is decoded with three uops, these three uops are not guaranteed to be transmitted to the rat together. in addition, the queue buffer between the decoder and rat is very short. We cannot assume that the read delay of the memory generator will not block the decoder, or that the UOP traffic fluctuation of the decoder will not block the rat. unless the queue between decoder and rat is empty, it is difficult to predict which uops are going through the rat stage at the same time. After the branch prediction fails, the queue should be empty. the uops generated by the same command are not necessarily transmitted to the rat together, and the rat command is also simply retrieved from the queue. 3 at a time, the jump prediction will not interrupt the queue. only when the prediction fails will the UOP in the queue be discarded and translated from the beginning. At this time, three consecutive uops will enter the rat stage together. A read register delay can be monitored by monitoring the number of 0a2h counters. Unfortunately, this blocking condition cannot be separated from other cases. if three consecutive uops read more than two different registers, you certainly do not want them to enter the rat stage together. The possibility of them entering the rat stage together is 1/3. Otherwise, one cycle will be delayed, execute Rob in disorder. Save 40uops and 40 temporary registers, while reservation station can save 20 uops. UOP will be executed after the operands are ready and there are idle execution units. the write memory cannot be executed in disorder. For more information, see speculative execution. the PM pipeline is short for Pentium M, core single core and core dual core, but does not include core2.pm architecture and ppro P2 P3. The main pipeline processing phase is: branch prediction, specify, decode, register name change, fill in the reorganization buffer, uops buffer station, unordered execution, result write back the reorganization buffer, CPU status changes. PM pipeline intel did not publish details, but simply said it is longer than ppro, so the following conclusions are all agner Fog's personal test guesses. the length of the entire assembly line of PM, based on the delay after branch prediction failure, is estimated to be 3 ~ More than ppro ~ In the four stages, the prediction of PM points seems to be more complex than ppro. It takes about three stages, one more than ppro, and the retrieval unit is also more complex, because the 16-byte boundary issue does not have any latency during the jump process, it is estimated that the ifu needs 3 ~ There are four stages. There are new stack engine stages, which may be added after the decoding phase, because only one UOP d1 ~ can be generated ~ After D2, a stack is synchronized to the UOP without any additional time. another is the UOP fusion operation, which does not seem to require additional stages to separate these uops. I guess they may share the same Rob entry, this entry can be registered with two different execution ports. therefore, after execution, there may be no need to integrate the separate uops before the retirement site. core2's powerful debut commands and pre-decoding core2 are added to a queue between the prediction of points and the command prefetch. This queue is mainly used to solve various delays in the predicted branch. the Instruction Decoding unit is divided into pre-decoding units and decoding units. The pre-decoding unit is mainly used to detect where each instruction starts from the ifetch block, and to distinguish the instruction prefix and other components, the bandwidth of the pre-decoder is 16 bytes or 6 instructions per clock period, whichever is smaller. other units in the pipeline are usually four instructions per cycle, or five if macro-fusion exists. obviously, if there are fewer than four instructions in 16 bytes, the pre-decoder is the bottleneck. in another worse case, when the pre-decoder does not Before processing 16 bytes of code, it will not fetch the new 16 bytes. For example, if the first 16 bytes have 7 instructions, then, the processing of six commands is completed in the first cycle, and the processing of the remaining commands is completed in the second cycle. After all the operations are completed, the next 16 bytes are processed. any command that spans the 16-byte boundary is left to the next 16 bytes for processing. the Code queue of the circular code buffer in core2 can be used as a 64-byte circular code buffer, and the pre-decoded circular code in the buffer can be reused by the decoder. therefore, if the code of a loop is included in 64 bytes, of course, decoding is not required. The 64-byte buffer can be used in l0 cache and is organized into 4 lines and 16 bytes. the decoder core2 has four decoders. The first decoder can decode a command to generate up to four uops in one cycle. The other decoder can only produce one UOP command, if the command generates more than four uops, The D0 needs to be decoded by using the microcode ROM for multiple cycles. the decoder can read two from the 64-byte buffer in one clock cycle. 16 bytes. therefore, 32 characters can be fully decoded at a time. However, since the throughput of the pre-decoder is only 16 bytes/C at most, only when the number of bytes processed by the decoder in the previous cycle is less than 16 bytes, it is possible to process 16 bytes of code in the next cycle, because there is accumulation in the previous cycle, and up to 32 bytes/C can only be obtained in a small loop buffer, this is because the pre-decoding information of the code can be reused. intel's early processor can decode the number of commands in one cycle with a prefix, but this restriction is eliminated in core2, the only limit is that the length of the instruction plus the instruction prefix cannot exceed 15 bytes. In addition, any decoder can decode any command with multiple prefixes within a cycle (of course, there are requirements on the number of uops responses). Of course, no command requires up to 14 prefixes, however, the redundant prefix can be replaced by the NOP operation and used as the 16-byte alignment of the loop entry. some micro-code fusion commands will be decoded into two uops. When using micro-code fusion technology, two uops can be fused into one, which can reduce the internal bandwidth usage, however, the microcode dispatcher treats the fusion microcode as two uops and cannot send different executions. But it is still a separate microcode command. there are two possible microcode fusion situations: Read-Modify operation fusion and write fusion. for example, add eax, [mem]. This command contains two microcodes, one read memory, and one add operation. The two microcodes can be fused to form one, moV [ESI + EDI], eax, a write operation command also contains two microcodes, one physical address for computing and writing, and one write operation. They can also be fused. core2 has more fusion microcodes than PM. For example, a read-operation-write command can use two fusion methods at the same time, and most XMM commands can be fused, one fused UOP has three input dependencies, while the common UOP has only two, the write operation has two uops, And the read operation has only one? Why? Some Opinions on # pragma pack (N) and _ declspec (align (N) are similar to those of align in concept, align is used to specify the Data Alignment position, while pack specifies the maximum number of white spaces allowed by the compiler to achieve alignment. Note that pack only affects structural data. let's take a look at the following example: # pragma pack (4) struct a {char C; double D1; short S;} First let's take a look at the situation on msdn: unless overridden with _ declspec (align (#), the alignment of a scalar structure member is the minimum of its size and the current packing. how can I not declare _ declspec (align (#)? The data member alignment address in the struct is the smaller part of its own length and pack length. unless overridden with _ declspec (align (#), the alignment of a structure is the maximum of the individual alignments of its member (s ). if _ declspec (align (#) is not declared, the alignment address of the struct itself is the one with the largest alignment length among all members. A structure member is placed at an offset from the beginning of its parent structure which is the smallest multiple of its alignment greater than or equal to the offset of the end of the previous Member. the address of a member of the struct type placed after a member of the parent struct is an integer multiple of the alignment length of the struct. the size of a structure is the smallest multiple of its alignment larger greater than or equal the offset of the end of its last member. the size of the struct is greater than or equal to the offset of the last member. It is the smallest multiple of its alignment. If I do not understand it, the alignment address of the structure should be a multiple of 8, it has nothing to do with pack (n. let's see if the address of his member C is a multiple of 8, and D1 should also be a multiple of 8, so that seven white spaces must be filled before D1, however, we have specified a maximum of four blank spaces. No way, we have compiled three blank spaces and set the location alignment address to an integer multiple of 4. reference: http://msdn.microsoft.com/en-us/library/aa290049.aspxhttp://msdn.microsoft.com/en-us/library/83ythb65 (vs.80 ). aspx # vclrfalignexamples

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.