Architecture review-command-level parallelism (cyclic expansion and tomasulo algorithm)

Source: Internet
Author: User

Architecture Review CH5 Instruction-level parallel 5.1 instruction-level parallel concept 5.1.1 instruction-level parallelism

instruction-level parallelism (ILP) is a parallel technique that implements parallel execution of multiple instructions simultaneously through pipelining and other techniques.

The main ways to implement ILP are:

    • Rely on hardware dynamic discovery and development parallelism
    • Rely on software to statically discover parallelism at compile time
5.1.2 Inter-directive correlation

The correlation between directives limits the degree of parallelism of the instruction level, and the correlation is mainly divided into (true) data correlation, name correlation and control correlation.

(1) Data related

The instruction I is located in front of the instruction J, the following two cases are called the instruction J data related to the instruction I:

    • The result of command I may be used by instruction J
    • Instruction J data related to instruction K, and instruction K data related to instruction J

Data correlation to some extent limits the ILP, the most common method of customer service correlation is to eliminate the data risk by instruction scheduling without changing the relevance

(2) Name-related

When two instructions use the same register and memory location (called name ), but there is no data flow between the instructions associated with the name, there is a name correlation

It is divided into the following two cases (instruction I is in front of instruction J):

    • anti-correlation : When instruction J writes to the Register and memory location read by instruction I, an inverse correlation occurs
    • Output Correlation : The output correlation occurs when instruction I and instruction J write to the same register and memory location

Name correlation is not really data-related, and the name-correlation is eliminated by the register renaming technique

(3) Data Adventures

Data Adventure refers to the existence of correlations between instructions and the two instructions are very close enough to make the overlap during execution change the order in which the operands are accessed, and the data adventures are divided into three categories:

    • Raw Write read: J reads the same position when I is not yet written, reads the old value
    • Waw write: J writes to the same position when I is not written, will be overwritten by I write (present in I instruction execution time is longer, such as floating-point arithmetic or multiplication instruction)
    • The war reads: J writes to the same location when I is not yet read, I incorrectly reads the new value (the set of instructions that are normally read first will not appear, it may be the case that J writes ahead of time and I read it later in the pipelining)
(4) Control-related

control correlation means that the branch instruction defines the order of execution of the instruction I relative to it, and the instructions related to the branch condition must be executed before the branch instruction, the instruction controlled by the branch instruction must be executed after the branch instruction, the instruction that is not controlled by the branch instruction must be executed before the branch instruction.

5.2 Instruction-level parallelism of software methods-instruction-level parallelism within a basic block

A basic block is code that executes sequentially , except that there are no other branching branches at the entrance, except for exits where there are no other branches to be transferred.

Consider the C language code:

for (i1i1000i{    x[i] = x[i] + s;}

Its basic block corresponds to the assembler program:

Loop:   LD      F0,0(R1)        ADDD    F4,F0,F2        SD      0(R1),F4        DADDI   R1,R1,#-8        BNEZ    R1,Loop

The following directives are followed to delay the provision:

Then you can directly analyze the instruction cycle delay of the basic block assembler (total 9 cycles):

1Loop:  LD      F0,0(R1)2       <stall>3       ADDD    F4,F0,F24       <stall>5       <stall>6       SD      0(R1),F47       DADDI   R1,R1,#-88       <stall>9       BNEZ    R1,Loop
5.2.1 Static Dispatch

Static dispatch is to improve the instruction delay by changing the order of instruction without changing the data correlation between the instructions, changing the decrement of the above R1 to the front and using the delay slot technique (setting the delay slot to 1) can compress the above basic fast code to 6 cycles complete:

1Loop:  LD      F0,0(R1)2       DADDI   R1,R1,#-83       ADDD    F4,F0,F24       <stall>5       BNEZ    R1,Loop6       SD      8(R1),F4

Description

    • Daddi let R1 decrement in advance, then the storage location in SD is r1+8 instead of R1
    • A delay slot is an instruction that executes regardless of whether the branch is successful or not
5.2.2 Loop expansion

Static scheduling can greatly improve the basic fast execution efficiency (50%), but there is still a period of pause can not be eliminated, then introduce another method of intra-block elimination delay- cyclic expansion

Cyclic unwinding is the method of expanding a plurality of basic blocks in a loop into a basic block to fill the stall gap.

The upper section of the basic block do 4 expansion, and do scheduling:

1Loop:LDF0,0(R1)2       LDf6,-8(R1)3       LDf10,- -(R1)4       LDf14,- -(R1)5ADDD F4,F0,F26ADDD F8,F6,F27ADDD F12,F10,F28ADDD F16,F14,F29Sd0(R1), F4TenSD-8(R1), F8 OneDaddiR1,R1,#-32 ASd -(R1), F12 -BnezR1, Loop -Sd8(R1), F16

On average, only 14/4=3.5 cycles are required for each cycle, and the performance is greatly increased!

5.2.3 Compiler Perspective Code scheduling

The above optimal cyclic expansion code scheduling is done by hand, and can be efficiently dispatched because we know that R1 represents an enumeration variable and knows the same area of memory pointed to by -16 (R1) and (R1) before r1-32, but the compiler determines that the memory reference is not so easy, as it cannot be determined:

    • R4 and (R6) are equal in the same cycle
    • Whether the R4 and the (R4) are equal in different cycles
5.2.4 Cycle-related

The above example is not relevant for the different cycles, but if the following conditions occur:

for (i1i100i{    A[i] = A[i] + B[i];         //S1    B[i+1] = C[i] + D[i];       //S2}

S1 and S2 are not relevant, but S1 relies on the b[i of the previous cycle, which is often not executed directly in parallel and requires code modification:

a[1 ]  = A[1 ]  + B [1 ] ; for  (i  = 1 ; i  <= 99 ; i  + +) {b[i+1 ] = C[i] + D           [i]; S2 a[i+1 ] = A[i+1 ] + b[i+1 ]; S1}  b[101 ]  = C [100 ]  + d[100 ] ;  
5.3 Instruction-level parallelism for hardware methods

Prior to the static scheduling technology is difficult to understand, and once the data can not be eliminated, then the pipeline will be stalled, until the removal of the relevant pipeline will continue to flow

Dynamic dispatch put forward the idea of dynamic launch , if the pipeline is about to pause, the hardware will dynamically select the subsequent will not violate the relevance of the directive emission, which is often said sequence launch, disorderly execution, chaos sequence complete

Dynamic scheduling has many advantages:

    • A pipeline-compiled code can be executed efficiently on different pipelines without recompiling for different micro architectures
    • Dynamic scheduling overcomes the uncertain relevance of compiler static scheduling (the above problem)
    • Allows the processor to tolerate some indeterminate/dynamic delays, such as delays caused by cache misses, and static scheduling is not the way to do it anyway
5.3.1 Scoreboard Dynamic Scheduling

The scoreboard algorithm divides the ID segment into two steps:

    • Emission: decoding, and detection of structure-related
    • Read operand: read-in operand when no data is relevant

The main idea is that, in the absence of structural conflict, as far as possible to launch instructions, if an instruction to stop the follow-up command to select and the pipeline is executing and stopping the command launch

(1) Scoreboard four stage
Stage content
Issue Launch If the function used by the current instruction is idle, and no other active instruction (execution and pause) uses the same purpose register (avoid waw), the scoreboard launches the command to the feature and updates the scoreboard's internal data. If there is a structure-related or waw correlation, the launch of the instruction is suspended, and no subsequent instruction is emitted until the relevant release.
Read reads operand This operand is valid if a running instruction that was previously fired does not write to the current instruction's source operand register, or if a functioning part has completed a write operation to the register. When the operand is valid, the scoreboard controls the function part read operand, ready for execution. (Avoid raw)
Execute execution The feature starts executing after the operand has been received. When the result is calculated, it informs the scoreboard that the execution of the instruction is complete and the scoreboard record
Write writes the result Once the scoreboard has been executed by the functional parts, the scoreboard detects war-related. If no war is relevant, write the result, or pause the instruction.
(2) Scoreboard hardware components

1.Instruction status record which step of the active instruction is in phase four

The 2.Function unit status records the state of the functional part (the unit that completes the operation), where each FU has 9 record parameters:

    • Busy: Whether the feature is in use
    • Op: The currently completed operation of the feature
    • Fi: Purpose Register number
    • FJ,FK: two source register number
    • QJ,QK: Generating a functional part of the source operand FJ,FK
    • RJ,RK: Flag bit that identifies whether the FJ,FK is ready

3.Register result status records which Fu writes to a register (not yet written), if the field is empty

(3) Dynamic scheduling algorithm

With the above three parts, you can complete some of the four-stage viewing operations, such as whether FU is idle, whether the operand is ready (can be executed), whether there are waw, etc.

This algorithm is called the Scoreboard algorithm , because these three parts are like the public scoreboard, each operation in the pipeline needs to look at the status of the scoreboard and write the corresponding parameter information in the scoreboard according to the performance.

The four-stage and scoreboard control are given in the form of pseudo-code, and wait util is a necessary condition for the down-level flow of an instruction, and book keeping is the information that needs to be recorded after the completion of the running segment:

Book
Status Wait Untilkeeping
Issue ! Fu.busy && result[d] = = NULL Fu.busy = true; Fu.op = op; FU. Fi = ' D '; FU. Fj = ' S1 '; FU. Fk = ' S2 '; Qj = Result[s1]; Qk = Result[s2]; Rj = (Qj = = null? true:false); Rk = (Qk = = null? true:false); RESULT[D] = = ' FU ' (if the source operand is an immediate number or an R integer register, the corresponding RJK is directly yes)
Read Rj && Rk Rj = true; Rk = true;
Execute FU Complete Record Complete cycle
Write Any other FU's FJ and FK are not fu.fi (WAR) or they are ready "Notice" All FJ and FK are the other fu of fu.fi the operand is ready, modify rj/k to true; Result[fu. Fi] = null; Fu.busy = false;
5.3.2 Tomasulo Dynamic Scheduling

Another dynamic scheduling algorithm is the tomasulo dynamic scheduling algorithm , which differs from the scoreboard mainly in:

    • Control and caching are concentrated in the scoreboard, and Tomasulo are distributed among the components
    • Tomasulo's FU is called Reservation Station Reserve (RS), which indicates that the register is either a register value or a pointer to the RS or load Buffer (also known as a special RS); Rs can complete the register rename , to avoid wars, Waw,rs can be more than registers, to achieve more optimizations that the compiler can not complete
    • The result recorded in register result status is the name of RS, not the Register
    • After the FU calculation is completed, broadcast to all other RS via common Data Bus (CDB) and modify the register result status record
    • Tomasulo can cross branches! Not limited to random execution of basic fast internal FP operations

Tomasulo is limited by the performance of CDB, generally using high-speed CDB

(1) Register rename

Why is the register renamed to avoid war and waw? Examples are as follows:

DIVD    F0,F2,F4ADDD    F6,F0,F8SD      F6,0(R1)SUBD    F8,F10,F14MULD    F6,F10,F8

The following three name-related names exist:

    • Subd The purpose of the operand is the f8,addd source operand is F8, there is a war adventure
    • Muld The purpose operand is the F6,SD source operand is F6, there is a war adventure
    • Muld The purpose of the operand is the f6,addd purpose operand is F6, existence waw Adventure

Use T, s to rename registers, which are:

DIVD    F0,F2,F4ADDD    S,F0,F8SD      S,0(R1)SUBD    T,F10,F14MULD    F6,F10,T

and subsequent F8 are replaced with T, then there are:

    • SUBD write t can be removed before addd read F8,war adventure
    • Muld Write F6 can be read into the SD before the war adventure is eliminated
    • Muld Write F6 can be addd before writing S, Waw adventure eliminates
(2) Component structure

The structure of the 1.RS is similar to that of the scoreboard algorithm, because with the register renamed, it eliminates the F and r two flag bits:

    • Busy: Whether the RS is in use
    • Op: The operation that the RS is currently completing
    • A: Store the memory address, start to save the immediate number, calculate valid address and save valid address
    • VJ,VK: The value of the source operand
    • QJ,QK: Generating the number of source operands RS

2.Register the name of the RS in which the write operation of a register is stored in the result status

(3) Three-stage
Stage content
Issue Launch If the corresponding RS is idle (no structure-dependent), then the firing instructions and operands (Rs rename to avoid war and WAW)
Execute execution When both operands are ready, RS starts executing, and if it is not ready to listen to the CDB at any time to get the required number of operations (avoid raw)
Write writes the result CDB transmits all results and modifies the register result status
(4) Tomasulo flow control

The pseudo-code representation of the Tomasulo dynamic scheduling algorithm is as follows:

1. Launch phase:

//RS, rt as source operand//Rd For the purpose of the operandvoidIssue () {ifOP = = FP Operation {wait until:rs[r].busy = =false;//R is the RS number corresponding to the FP operation        ifREGISTERSTATUS[RS]. Qi! =NULL{Rs[r]. Vj =NULL; RS[R]. Qj = Registerstatus[rs].        Qi; }Else{Rs[r].            Vj = Register[rs]; RS[R]. Qj =NULL; }ifREGISTERSTATUS[RS]. Qk! =NULL{Rs[r]. Vk =NULL; RS[R]. Qk = Registerstatus[rs].        Qi; }Else{Rs[r].            Vk = Register[rs]; RS[R]. Qk =NULL; } Rs[r].busy = =true; REGISTERSTATUS[RD].     Qi = R; }ifOP = = LoadorStore {wait Until:rs[r].busy = =false;//Load buffer and RS same data structure        ifREGISTERSTATUS[RS]. Qi! =NULL{Rs[r]. Vj =NULL; RS[R]. Qj = Registerstatus[rs].        Qi; }Else{Rs[r].            Vj = Register[rs]; RS[R]. Qj =NULL; } Rs[r].        A = IMM; Rs[r].busy =true;ifOp = = Load {//Load onlyREGISTERSTATUS[RT].        Qi = R; }Else{//Store only            if(Registerstatus[rd]. Qi! =NULL) {//Avoid wawRS[R]. Vk =NULL; RS[R]. Qk = Registerstatus[rt].            Qi; }Else{Rs[r].                Vk = Register[rt]; RS[R]. Qk =NULL; }        }    }}

2. Implementation phase:

void execute () {if  op = = FP Operation {wait until : Rs[r]. Qj = = null && rs[r]. Qk = = NULL compute result  with  Operand in  Vj and  Vk; } if  op = = Load or  Store {wait until : Rs[r]. Qj = 0  && R is  head of  load-store Queue (each processing team header Element) Rs[r]. A = Rs[r]. Vj + rs[r].        A if  op = = Load {wait until : Rs[r]. A write complete Read from  mem[rs[r]. A]}}}  

3. Write the result phase:

voidWrite () {ifOP = = FP Operation {Waituntil: Execution ofR isComplete & CDB Available forAll XinchRegisterstatus_index_set par- Do{//Hardware parallel execution, Analog direct forCyclic serial simulation canifREGISTERSTATUS[X].                Qi = = r {register[x] = result; REGISTERSTATUS[X]. Qi =NULL; }        } forAll XinchRs_index_set par- Do{ifRS[X]. Qj = = r {rs[x].                Vj = result; RS[X]. Qj =NULL; }ifRS[X]. Qk = = r {rs[x].                Vk = result; RS[X]. Qk =NULL; }} rs[r].busy =false; }ifOp = = Store {waituntil: Execution ofR isComplete & Rs[r]. Qk = =NULLMEM[RS[R]. A] = Rs[r].        Vk; Rs[r].busy =false; }}
5.3.3 Tomasulo processing Cycle

The Tomasulo algorithm can iterate over the execution, the key point is:

    • Register Rename: Handles different physical storage locations in different loops, registers rename registers as dynamic pointers, increases the number of registers
    • Integer part-First: to be able to emit multiple loops in action

Architecture review-command-level parallelism (cyclic expansion and tomasulo algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.