Computer consisting of 7 pipeline processor 7.5 Data adventure processing
In a program, we often use and modify the same variable repeatedly. In this way, for the pipeline processor, there will be a lot of data adventures, we must be good to deal with and solve. In this section, let's take a look at some of the different workarounds.
Let's take a look at this example of data adventures first. This data is risky because the second addition instruction uses the result of the first subtraction instruction. But in the pipeline, when the addition instruction reads the T0 register, its previous subtraction instruction has not written the result of the operation to the T0 register. So there's a data adventure here. The simplest way to solve this data adventure is to actually solve it at the software level.
Suppose that the pipelining of our processor does not solve this kind of data adventure. In fact, we just have to manually postpone this addition instruction by means of programming, so that he can read the Register heap and postpone it to the time after the subtraction instruction writes the register heap. So how do we do that?
We have an instruction called NOP, and its role is to do nothing. We will insert two NOP instructions between this subtraction instruction and the addition instruction. These two NOP instructions simply pass through the assembly line and occupy the corresponding time. So this data adventure is at least non-existent. And because of the two NOP instructions, the addition instruction is pushed back two cycles before entering the assembly line, then when this addition instruction needs to read the register heap, the preceding subtraction instruction has completed the write to the register heap, the addition instruction can read from the register heap the correct t0 value, thus completes the correct addition operation. So, the easiest way to solve this data adventure is to insert a NOP instruction, but this method also has a lot of problems.
First, you should insert a few NOP instructions, which are related to the structure of the pipeline. If we put this section of the program on a 5-level pipeline is normal operation. That in a few days, and out of an updated processor, its pipeline is level 8, then this program put up, may run error. As pipelining becomes deeper, the number of cycles that can be awaited to resolve data adventures may become much more. So, this method of inserting NOP is feasible, but not good. In general, we also want the software to mask these implementation details of the hardware. Now that you've added two NOP instructions to solve the problem, you can try to do the same thing on the hardware.
Just by inserting the NOP method, actually has provided the reference for us. As long as we find that there is such a data adventure, we are on the hardware pipeline to make each control signal the same as the implementation of the NOP instruction value. That in these two cycles, will produce the effect of the pipeline halt. And these are the same control signals as the NOP command effect, and the resulting state becomes an empty bubble. This bubble follows the clock cycle to the back, and in effect, the implementation of the NOP directive at the first level of the pipeline is the same. Only the difference is that such signals are generated by hardware.
Now there is a new problem, if you just inserted a NOP instruction in the software, it is strictly in line with the retrieval of an instruction to execute, this way to run. That now requires an automatic insertion of the bubbles into the hardware, which requires a way to detect whether a data adventure has occurred. Of course this is not difficult, if we do not look at this piece of code, but rather look at the processor of the five parts, then how can we judge the existence of a data adventure?
The so-called data adventure, is that there is currently a command to read the register, and its previous instructions to write registers, but did not complete. So, we only use the check, at this stage of decoding, the number of registers that need to be read, which can be obtained by linking the signals in the register read port. Then we look at each stage, in fact, at each level, there are some signals to indicate whether the instruction is to write a register, and which register to write. Therefore, we only need to check the number of registers to be written at each subsequent stage, and the current decoding phase, the number of the read register is the same. If there is the same, there is data risk. As soon as the data is in danger, we insert an empty bubble into the assembly line. This way we can solve the problem of data adventure through hardware. However, in the actual programming, this first write a register, and then quickly use the situation is often seen. If every occurrence, we have to let the pipeline pause, the impact on performance is too big. Therefore, we can not only pursue the right, but also to ask for good. We still hope that the pipeline will not pause.
That's what we looked at at the beginning of the analysis. The subtraction instruction starts writing registers after 800PS, and the addition instruction reads the register at the latest at 500PS. We can't reverse this time, so we certainly can't send the number of 800PS to 500ps at this time. But we can think about it in a different way. Does this subtraction command actually run at 800PS at this time? The actual subtraction operation is performed at the execution stage (EX) by the Alu component. So, at the latest at 600PS, the number of T0 registers to be written is done. So, from a time point of view, after 600ps, we can get the latest value of the T0 register. And for this addition instruction, it really needs to use the value of the T0 register in its execution phase, that is, the ALU component needs to use the value of T0 as one of the inputs, that this stage is only after 600ps, we can completely put the result of the subtraction operation to the addition operation as input. This method is called the data forward, that is, the previous instruction to pass the results of their operations forward to the next instruction.
We have just analyzed that, at 600PS, the output of the ALU is already t0 value, that after 600ps this clock rises along the past, T0 this value will be saved to the execution (EX) and the memory (MEM) between the pipeline register. If we pass it to the input of the ALU, we can complete the following addition operation correctly.
Since the time is feasible, we can look at how the hardware is modified.
After this subtraction instruction has been executed, the result of the operation has been saved to the Register (1). Now, this subtraction instruction enters the visiting stage, and the value of T0 will pass through this stage to the next Level line register (2 places). At the same time, the addition instruction is in the execution phase, it needs to send the value of the T0 register to an input end of the ALU (3), it is clear that the previous stage of the ALU from the register heap read the value (4), certainly not the latest. Now this latest value is on the link in the memory phase. So, we can bring this signal back from the hardware connection, from the new lead to the input of the ALU (5 places).
Of course, here (3 places) We also need to add a multi-selector, and we have just talked about how to judge in the pipeline there is a data adventure. So we can use the result as a choice signal for this selector, and in the event of a data adventure, we choose this forward signal. Of course, this addition directive may also use the T0 register on the second original operand (S4). Therefore, this forward signal should also be transmitted to the other input of the ALU, of course, here (another input of the ALU, 3 below) also need to add a selector to choose.
That way it is a front hand, and it has a name called a bypass. Basically, both forward and bypass refer to this matter. It's just a different angle of observation and description. Pre-recursion is described from the point of order of instruction execution, while bypass is described in terms of the structure of the circuit. The previous instruction should write the result of the run to the Register heap (6) before handing it over to the latter instruction. And we are now building a new path, which is equivalent to bypassing the register heap and directly transmitting the data. So, from the hardware implementation point of view, this is a bypass.
So this is the relationship between the pre-recursion and the bypass. So let's take a closer look, not only at this point (7) We can set up a bypass, we can also build a bypass at the next stage (8).
And under what circumstances will this bypass be used?
Let's take a look at the same example. The first two instructions in this example are the same as the one we just made, and on that basis we have written a third instruction. This is a with operation, then one of its source operands is also t0. Then we combine practice, for this and the operation of the command, it really to start the operation, is after 800ps, at this time, the previous subtraction command has completed the memory stage (MEM). Therefore, the newest value of the T0 register is now placed in the pipeline registers between the fetch stage and the writeback phase. Then we need to use the Purple Bypass line in the structure diagram to pass the contents of the T0 to the input of the ALU, so that this and the operation instructions can be run in time.
What if the next instruction is used again in the T0? Then this clock cycle, marked with 3, goes right into the decoding phase (ID) after 800ps, and it reads the registers in the second half of the cycle. At this point, the subtraction instruction has written the value of T0 to the register heap. So, for this 3rd directive, if it uses the T0 register, it can read the value of the T0 register from the register heap as normal, without the need to use the pre-recursive technique. So, for such an arithmetic instruction, the two sets of bypass paths that we set up can already solve the data adventure. But there is one exception, and we look at it in a new example.
In this example, the first three instructions are the same as just now, the fourth is a load command, it will also use the T0 register, but we have already analyzed, this time there is no data adventure. The load command is to take a number from the memory and store it in the T1 register. After it, a single or operation instruction uses the value of the T0 register, which is followed by a load instruction followed by an instruction that uses the load command's purpose register. In this case, data adventures can also occur. It has a special name, called Load-use Adventure.
So can this risk be solved with the technology of pre-handing? Actually can't do it, let's analyze why we can't do it. For this load command, let's look at the value to save to the T1 register, when exactly is it available? For the operation instructions just now, the value of the writeback register is required, in the execution phase, that is, through the ALU operation. However, for the load instruction, the ALU is used to calculate the address to be visited, and the number of writes back to the register heap will be obtained at the end of the fetch phase. So, in 1400ps this place, we get the value of the T1 register. And for the following or operation instructions, we have to be at the 1200PS this place at the latest, get T1 The value of this register, so that the ALU can perform the correct operation. Therefore, this requires us to transfer the number of 1400PS this place to the previous 1200ps moment. We can't do things backwards in time. So, we can only let the signal forward along the timeline, and it is never possible to pass backwards, so no matter how we modify the circuit, we cannot construct a forward pass. So how are we supposed to solve this load-use adventure? In fact, it is difficult to say that simple is very simple.
Or with our all-in-one approach. Since we cannot return to the earlier time, we can only allow this or the operation instruction to wait for a period, so that it can be 1400ps after the value of the T1 register, and the load command has been completed from the data memory of the operation of the removal of the number. This can be done by the second set of bypass channels that we have just established, that is, the bypass channel expressed with a Purple Line, which transmits the contents of the T1 register to the input port of the ALU. And of course, since we're going to have to postpone a cycle for an operation, we have to insert an empty bubble into the assembly line and let the pipeline have a pause. So, for this kind of adventure, we need to use pipelining and data pre-recursive way to solve.
It is certainly a pity that this solution does not allow the pipeline to get the highest command throughput rate, but it is our primary goal to ensure that the instructions are executed correctly. So, we can only accept such a proposal.
Now, for a basic pipelined structure, we've been able to handle data adventures. However, if you continue to increase the depth of the pipeline, or expand into a superscalar pipeline, there will be a new data adventure situation. Of course, there are many ingenious solutions. If you are interested in this, you can further study further.
7.5 Handling of data adventures