6. Pentiumpro,ii and III assembly line
6.1. Pipeline in PPRO,P2 and P3
The 1995-year-old Pentiumpro is an Intel processor that is equipped to execute in a disorderly sequence. Its micro-architecture design is quite successful. This design has been further developed, spanning many generations, until today's processors-a small detour in a less successful PENTIUM4 or NetBurst architecture.
Explain the PPRO,P2 and P3 pipeline Intel's manuals and textbooks are now gone. So I'm going to explain this line here.
Figure 6.1. Pentium Pro Assembly line
This pipeline is shown in Figure 6.1. The pipeline is divided into the following stages:
BTB0, 1: Branch prediction. Tells where to get the next instruction.
IFU0, 1, 2: Command acquisition unit.
ID0, 1: Instruction decoder.
RAT: Register alias table. The register is renamed.
ROB rd:μop reflow Buffer read.
RS: Recycle Bin.
Prot0, 1, 2, 3, 4: The port that is connected to the execution unit.
ROB WB: Reflow buffer results writeback.
RRF: Register recovery file (registerretirement files).
At least one clock cycle is required for each stage in the pipeline. Branch forecasts have been explained on page 14th. The other stages in the pipeline will be explained below (Ref: Intelarchitecture optimization Manual, 1997) 6.2. Command Acquisition
Gets the instruction code from the code cache with the aligned 16-byte block to a double buffer (doublebuffer) that can hold two 16-byte blocks. The purpose of the double cache is to decode an instruction that crosses a 16-byte boundary (that is, the address can be divisible by 16). From the double buffer code is passed into a block to the decoder, this block I called the Ifetch block (instruction fetch block, instructionfetch blocks). Ifetch blocks up to 16 bytes. In most cases, the instruction acquisition unit causes each Ifetch block to start at an instruction boundary, rather than a 16-byte boundary. However, the Instruction acquisition unit requires the instruction length decoder to tell it where the instruction boundary starts. If this information is not available in time, it may start a ifetch block at a 16-byte boundary. This issue is discussed in detail below.
Double buffering is not enough to handle the instruction fetch around the jump without delay. If the Ifetch block that contains the jump crosses a 16-byte boundary, the double buffer needs to hold two contiguous 16-byte blocks of code before producing a valid Ifetch block. This means that, in the worst case scenario, the decoding of the first instruction after a jump delays 2 clock cycles. In the Ifetch block containing the jump instruction, the cost of a 16-byte boundary is 1 clock cycles, and the cost of a 16-byte boundary in the first instruction after the jump is also 1 clock cycles. The Instruction acquisition unit can acquire a 16-byte block per clock cycle. If multiple clock cycles are required to decode a ifetch block, it is possible to use this extra time to prefetch. This compensates for the loss of the 16-byte boundary before and after the jump. The final delay is summarized in table 6.1 below.
If the double buffer is only enough time to get a 16-byte block after the jump, the first ifetch block after the jump will be the same as the block, which aligns to the 16-byte boundary. In other words, the first ifetch block after the jump will not start at the first command, but at the nearest address that can be divisible by 16. If the double buffer has time to load two 16-byte blocks, then the new Ifetch block can span a 16-byte boundary and begin at the first instruction after the jump. These rules are summarized in the following table:
Number of decoding groups in the Ifetch block containing jumps |
The 16-byte boundary in this ifetch block |
16-byte boundary in the first instruction after jumping |
Decoder time delay |
The alignment boundary of the first ifetch after a jump |
1 |
0 |
0 |
0 |
16 |
1 |
0 |
1 |
1 |
Instructions |
1 |
1 |
0 |
1 |
16 |
1 |
1 |
1 |
2 |
Instructions |
2 |
0 |
0 |
0 |
Instructions |
2 |
0 |
1 |
0 |
Instructions |
2 |
1 |
0 |
0 |
16 |
2 |
1 |
1 |
1 |
Instructions |
3 or more |
0 |
0 |
0 |
Instructions |
3 or more |
0 |
1 |
0 |
Instructions |
3 or more |
1 |
0 |
0 |
Instructions |
3 or more |
1 |
1 |
0 |
Instructions |
Table 6.1 Access to the instructions around the jump
The first column of this table represents the time required to decode all the instructions in a ifetch (the decoding group is explained below).
The length of the order from 1 to 5 bytes. Therefore, we determine that a 16-byte Ifetch block contains an integer number of instructions. If an instruction exceeds a ifetch block, it goes to the next ifetch block. This block will start at the first byte of this instruction. Therefore, when the next Ifetch cycle can be generated, the instruction acquisition unit needs to know where the last instruction ends in each ifetch block. This information is given by the instruction length decoder, which is in the IFU2 phase of the pipeline (Figure 6.1). The instruction length decoder can determine the length of three instructions per clock cycle. For example, if a ifetch block contains 10 instructions, 3 clock cycles are required before you know where the last instruction in the Ifetch block ends and the next Ifetch block can be generated. 6.3. Instruction decoding instruction length decoding
The Ifetch block goes to the instruction length decoder, which determines the start of each instruction. This is a very critical stage in the pipeline because it limits the degree of parallelism that can be achieved. We want to get more than one instruction per clock cycle, decode multiple instructions per clock cycle, execute multiple instructions per clock cycle to get speed. However, parallel decoding instructions are difficult when the instruction has different lengths. Before you can start decoding the second instruction, you need to decode the first instruction to see how long it is and where the second command starts. Therefore the simple instruction length decoder can only process one instruction per clock cycle. The instruction length decoder in the Ppro microarchitecture can determine the length of three instructions per clock cycle, and this information is fed back to the instruction acquisition unit early to generate a new Ifetch block that allows the instruction length decoder to operate on the next clock cycle. This is a pretty impressive implementation, and I'm sure that all 16 possible starting bytes are decoded by parallel inference. 4-1-1 Rules
After the instruction length decoder, it is the instruction decoder that translates the instruction into μop. There are 3 decoders working in parallel, so you can decode up to 3 instructions per clock cycle. Up to 3 instructions decoded in the same clock cycle are called a decoding group. Three decoders are called D0,d1 and D2. The D0 can process all instructions and produce a maximum of 4 μop per clock cycle. D1 and D2 can only handle simple instructions that produce a μop and do not exceed 8 bytes in length. The first instruction in the Ifetch block always goes to D0. Next two instructions, if possible, go to D1 and D2. If a D1 or D2 instruction should be entered, because multiple simple or more than 8 bytes are generated and cannot be processed by these decoders, then it must wait until D0 is idle. Subsequent directives have also been postponed. For example:
; Example 6.1a. Instruction decoding
Mov[esi], eax; 2 Uops, D0
ADDEBX, [edi]; 2 Uops, D0
Subeax, 1; 1 UOP, D1
CMPEBX, ECX; 1 UOP, D2
jeL1; 1 UOP, D0
The first instruction in this example goes to the decoder D0. The second instruction cannot enter D1 because it produces multiple μop. Therefore, it is postponed to D0 ready for the next clock cycle. The third instruction goes to D1, because the previous instruction went to D0. The fourth instruction goes to D2. The final instructions go to D0. The entire sequence requires 3 clock cycles to decode. By exchanging the second and third directives, the decoding can be facilitated:
; Example 6.1b. Instructions reordered for improved decoding
Mov[esi], eax; 2 Uops, D0
Subeax, 1; 1 UOP, D1
ADDEBX, [edi]; 2 Uops, D0
CMPEBX, ECX; 1 UOP, D1
jeL1; 1 UOP, D2
Now only 2 clock-cycle decoding is required because the instructions are better distributed among the decoders.
The maximum decoding speed is obtained when the instruction is arranged according to the 4-1-1 mode: If 4 Μop are generated per three instructions, and 1 μop per bar are generated for the next two instructions, the decoder can produce 6 μop per clock cycle. A 2-2-2 mode gives the minimum decoding speed of 2 μop per clock cycle, as all 2μop instructions go to D0. It is recommended that you schedule instructions according to the 4-1-1 rules, so that each command that produces 2, 3, or 4 Μop is followed by two instructions that produce 1 μop per bar. A directive that produces more than 4 μop must go to D0. It requires 2 or more clock cycles to decode, and no other instructions can be decoded in parallel. Ifetch block Boundary
More complicated is that the first instruction in the Ifetch block always goes to D0. If the code has been dispatched according to the 4-1-1 rule, and if one of the 1-ΜOP instructions scheduled to D1 or D2 happens to be the first one in the Ifetch block, the directive goes to D0, destroying the 4-1-1 pattern. This will delay decoding a clock cycle. The Command acquisition unit cannot adjust this ifetch block to 4-1-1 mode, because, I guess, it takes two stages to get information about that instruction generating multiple μop.
This problem is difficult to deal with because it is difficult to guess where the ifetch boundary is. The best way to deal with this problem is to dispatch the code so that the decoder can produce more than 3 μop per clock cycle. The rat and RRF stages in the pipeline (Figure 6.1) can handle no more than 3 μop per clock cycle. If the instructions are scheduled according to the 4-1-1 rules so that we can prefetch at least 4 μop per clock cycle, then we can assume a clock cycle for each ifetch block loss, and still maintain an average decoder throughput rate of 3 μop per clock cycle.
Another measure is to make the instruction as short as possible in order to command more in each ifetch block. More instructions per Ifetch block means fewer ifetch boundaries and thus less damage to 4-1-1 mode. For example, you can use pointers instead of absolute addresses to reduce the size of your code. For more details on how to reduce the instruction size, refer to Manual 2: "Optimizingsubroutines in assembly language".
In some cases, manipulating the code makes the instructions for machine code D to fall on the ifetch boundary. But it is often difficult to determine where the ifetch boundary is, and it may not be worth the effort. First, you need to align the snippet paragraph so that you know where the 16-byte boundary is. Then you must know where the first ieftch block of the code you want to optimize is. Look at the output list of the assembler to see how long each instruction is. If you know where a ifetch block starts, then you can find the next Ifetch block starting position: make the Ifetch block length 16 bytes. If it ends at an instruction boundary, then the next block will start there. If it ends with an unfinished instruction, the next block starts at the beginning of the instruction. Here only the length of the instruction is counted, not how many μop they produce or what they do. This way you can make the most of the code and mark where each Ifetch block starts and moves forward. The biggest problem is knowing where to start. Here are some guidelines:
· The first Ifetch block after a jump, call, or return can start at the first instruction or the nearest 16-byte boundary, according to table 6.1. If you align the first instruction to start at a 16-byte boundary, you can be sure that the first Ifetch block starts here. For this reason, one may want to align the important subroutine entry with the loop entry to 16.
· If the combination of two consecutive instructions is longer than 16 bytes, then you can be sure that the second instruction cannot, like the first one, put in the same ifetch block, and you will always have a ifetch block that starts at the second command. You can use it as a starting point for finding where subsequent ifetch blocks begin.
· The first Ifetch block after the branch is wrongly predicted starts at a 16-byte boundary. As explained on page 14th, a loop that repeats more than 5 times always has a false prediction when exiting. Therefore, the first Ifetch block after such a loop will start at the nearest 16-byte boundary at the front.
I think now you want to get one of the following examples:
; Example 6.2. Instruction fetch Blocks
Address instruction length Uops expected decoder
1000H mov ecx, 5 1 D0
1005h Ll:mov [esi], EAX 2 2 D0
1007H mov [mem], 0 2 D0
1011h Lea EBX, [eax+200] 6 1 D1
1017H mov byte ptr [esi], 0 3 2 D0
101Ah BSR edx, eax 3 2 D0
101DH mov byte ptr [esi+1], 0 4 2 D0
1021h dec edx 1 1 d1
1022h jnz LL 2 1 d2
Let's assume that the first Ifetch block begins at address 0x1000 and ends at 0x1010. This is in directive Mov[mem], before the end of 0, so the next Ifetch block will start at 0x1007 and end at 0x1017. This is an instruction boundary, so the third Ifetch block will start at 1017h, overwriting the rest of the loop. The number of clock cycles required for decoding is the number of D0 instructions, i.e. the LL loop is 5 per iteration. The final Ifetch block contains three decoding blocks covering the last five instructions, with 16-byte boundaries (0x1020). Looking at table 6.1 above, we find that the first Ifetch block after the jump starts at the first instruction after the jump, which is the LL mark at 0x1005 and ends at 0x1015. This is before the Lea command ends, so the next ifetch will be from 0x1011 to 0x1021, and the last one covering the remainder from 0x1021. Now the LEA and Dec directives fall at the beginning of a ifetch block, forcing them to go to D0. Now we have 7 instructions in the D0, and in the second iteration, the loop needs 7 clock cycles to decode. The last Ifetch block contains only one decoding group (DECECX/JNZ LL) and no 16-byte boundary. According to table 6.1, the next ifetch block after the jump starts at the 16-byte boundary, which is 0x1000. This will give us the same situation as the first iteration, and you will see that the cycle interval requires 5 and 7 clock cycles to decode. Because there are no other bottlenecks, running 1000 iterations, the entire cycle will require 6,000 clock cycles. If the starting address is different so that the first or last instruction of the loop is at a 16-byte boundary, then 8,000 clock cycles will be required. If you rearrange loops so that no D1 or D2 instructions fall at the beginning of a ifetch block, you can do just 5,000 clock cycles.
The above example is intentionally constructed to make acquisition and decoding the only bottleneck. One thing you can do to improve decoding is to change the starting address of the routine to avoid the 16-byte boundary you don't want. Remember to align the snippet paragraphs so that you know where the boundaries are. It is possible to manipulate the instruction length to place the ifetch boundary where expected, as in the chapter "Optimizingsubroutines in assembly language" of the manual "Making instructions longer forthe sake of alignment "as explained. instruction Prefix
The instruction prefix can also cause losses in the decoder. As listed in manual 2 "optimizingsubroutines in assembly Languag", directives can have several prefixes.
1. If the instruction has an immediate number of 16 or 32 bits, the operand size prefix is paid for several clock cycles, because the length of the operand is changed by this prefix. Example (32-bit mode):
; Example 6.3a. Decoding instructions with operand size prefix
Addbx, 9; No penalty because immediate operand is 8 bits signed
ADDBX, 200; Penalty for the immediate bit. Change to ADD EBX, 200
Movword ptr [MEM16], 9; Penalty Becauseoperand is
The final instruction can be changed to:
; Example 6.3b. Decoding instructions with operand size prefix
Moveax, 9
Movword ptr [mem16], Ax; No penalty because no immediate
2. Once an explicit memory operand exists, an address-size prefix has a cost (even if there is no displacement) because the r/m bit in the instruction code is modified by this prefix. Directives that have an implied memory operand, such as a string instruction, have no cost using the address size prefix.
3. In the decoder, the segment prefix has no cost.
4. In the decoder, there is no cost to duplicate prefixes and lock prefixes.
5. If the instruction has multiple prefixes, there is always a price. The cost is usually one clock cycle per prefix.