"CPU microarchitecture Design" uses Verilog to design branch predictors based on saturation counters and BTB

Source: Internet
Author: User

In a pipelined (pipeline)-based microprocessor, the branch prediction Unit (Branch Predictor unit) is an important feature that collects and analyzes the parameters and execution results of branch/jump instructions and, when processing new branch/jump instructions, BPU will predict its execution results according to the existing statistical results and the parameters of the current branch jump instruction, and provide the decision basis for the pipeline reference, thus improving the pipeline efficiency.

The main reasons and practical meanings of the branch prediction mechanism are discussed below:

In the pipeline processing branch jump instruction, the target address often need to defer to the execution stage of the instruction to calculate, before the processor can not be in time to know the next instruction address, and therefore cannot continue to refer to. One solution is to refer to the flow level and associated flow-level pauses (stall) after the branch instruction is identified, waiting for the branch destination address to be calculated before continuing to fetch the reference. This wastes several pipelined clock cycles, thus reducing performance. The pipeline suspension effect using this method is as follows:

(Drawings: The classic four-level pipeline in the branch instruction triggered by the pipeline suspended, sitting on the purple square represents the branch instructions.) Because of the simple ideal pipeline, the branch instruction target address can be obtained in the execution phase, which will waste a pipeline clock cycle.

To improve the above method, we consider introducing a static branch prediction mechanism, that is, the prediction branch jump instruction must not jump (not taken), then the above situation will be: the default branch jump instruction after the instruction also flows into the pipeline, after a number of pipelining, the address of the branch instruction will be calculated, At this point before the flow of instructions to the pipeline is the actual target address point to the instruction, if true, the pipeline can continue to run without pausing, but if false, the previous incoming instructions are not valid, the relevant water level will be flushed (flush), the processor can only re-start from the correct address value, This will also reduce the efficiency of the pipeline.

As you can see, with the static prediction, when the branch/jump instruction is processed, the pipeline will not pause by the branch at a certain probability, which reduces the possibility of the pipeline bubble cycle. However, in the case of high performance requirements, this method is still unsatisfactory.

To further improve the efficiency, we consider the dynamic branch prediction mechanism. Dynamic prediction predicts the "direction" and "Destination address" of a branch jump based on the record statistics of the branch history. If the processor follows the predicted results, then once the predicted results are the same as the actual jump results, the instructions that flowed into the pipeline will be fully effective and the pipeline will remain operational. The behavior of the processor based on the predicted "direction" and "Destination address" is called the Predictive value (speculative fetch), and the behavior of the instructions taken out according to the predicted results is called Predictive execution (speculative Execution)

With the continuous improvement of the prediction algorithm, the accuracy of the current branch prediction is approaching 1, and the branch prediction technology effectively improves the running efficiency of the pipeline and is widely used in the mainstream microprocessor.

First, the branch forecast needs to solve the question

(1), predict whether the branch occurs, that is to predict the "direction" problem;

(2), predict the value of the branch instruction set address, that is, to predict the "target address" problem.

Design and realization of branch prediction Unit

The common branch prediction mechanism can be divided into primary structure and level two structure. For level two predictors, common algorithms include the gshare and gselect algorithms, which take into account the context execution history of the branch instruction, the relatively high accuracy, but the relatively complex implementation, this article is not discussed. For a primary predictor, it is designed to organize multiple saturation counters and bpu into one-dimensional vector tables, using the hash map value of the branch/jump instruction PC value to address a one-dimensional vector table. This type of predictor structure is the simplest, this article will be a detailed discussion of this structure.

 2.1 Prediction of Branch direction

In this paper, 2bit saturation counter is used to register 4 kinds of states. Depending on the predicted results and actual execution results, the value of the counter will be transferred as shown in the state machine transfer diagram.

The figure jumps (taken) and does not jump (not taken) two States are further subdivided into strong (strong) and weak (weak) a total of four states, and provisions strongly taken and weakly taken for "Jump", strongly not taken and weakly not taken to "do not jump", and each time the prediction error, the counter will change the state in the opposite direction, the essence of this approach is actually a " Damping "switch.

2.2 Prediction of branch destination addresses

To simplify the design, this paper mainly discusses the predictor based on the branch target cache (Branch target buffer) technology. BTB uses a limited cache to store the destination address of the most recently executed branch instruction. For subsequent branch instructions, the predicted value is directly taken out of the address that is parked in the corresponding table entry as the target address. When the branch instruction is executed, the actual destination address is written back into the BTB, providing the basis for the next forecast.

In this design, the BTB table entry is addressed using a hash map value using the branch instruction PC value, so that each branch jump instruction can establish a mapping relationship with the BTB table item.

Where we define the hash mapping rules for PC values are as follows:

  ∫v (PC) →v (hash), F (P), f (P) = P & 1111111111b (i.e., the lower 10 bits of the PC value are taken as the corresponding hash map value)

 2.3 Overall Design Framework

  In summary, the overall structure of the Branch predictor is as follows, and it can be seen that the predictor has an extremely simple structure:

third, hardware description language implementation

  With the above discussion, it is easy to implement a branch predictor using Verilog HDL.

It is worth noting that only the saturation counter is incremented or decremented for non-overflow, and the highest bit of the count value is used to determine if a jump occurs, and 1 is taken,0 to not taken.

1 ModuleBpu2 #(3    parameterPCW = -,//The width of valid PC4    parameterBTBW =Ten,//The width of BTB address5 )6(/*Autoarg*/7    //Outputs8 Pre_taken_o, Pre_target_o,9    //InputsTen CLK, Rst_n, Pc_i, Set_i, Set_pc_i, Set_taken_i, Set_target_i One    ); A  -    //Ports -    inputCLK; the    inputRst_n; -    input[pcw-1:0] Pc_i;//PC of current branch instruction -    inputset_i; -    input[pcw-1:0] set_pc_i; +    inputSet_taken_i; -    input[pcw-1:0] set_target_i; +    Output RegPre_taken_o; A    Output Reg[pcw-1:0] Pre_target_o; at     -    //Local Parameters -    LocalparamScs_strongly_taken =2'B11; -    LocalparamScs_weakly_taken =2'B10; -    LocalparamScs_weakly_not_taken =2'B01; -    LocalparamScs_strongly_not_taken =2'b00; in     -     Wirebypass; to     Wire[btbw-1:0] Tb_entry; +     Wire[btbw-1:0] Set_tb_entry; -     the    //PC Address Hash mapping *    AssignTb_entry = pc_i[btbw-1:0]; $    AssignSet_tb_entry = set_pc_i[btbw-1:0];Panax Notoginseng    AssignBypass = set_i && Set_pc_i = =pc_i; -     the    //saturating counters +    Reg[1:0] Counter[(1&LT;&LT;BTBW)-1:0]; A    Generate begin: Counter the       integerentry; +        always@(PosedgeClassor Negedgerst_n) -          if(!rst_n) $              for(entry=0; Entry < (1&LT;&LT;BTBW); entry=entry+1)//Reset BTB Entries $Counter[entry] <=2'b00; -          Else if(set_i && set_taken_i && counter[set_tb_entry]! = Scs_strongly_taken)begin -Counter[set_tb_entry] <= Counter[set_tb_entry] +2'B01; the          Else if(set_i &&!set_taken_i && counter[set_tb_entry]! = Scs_strongly_not_taken)begin -Counter[set_tb_entry] <= Counter[set_tb_entry]-2'B01;Wuyi          End the    endgenerate -     Wu     always@(PosedgeCLK) -Pre_taken_o <= bypass? set_taken_i:counter[tb_entry][1]; About     $    //BTB Vectors -    Reg[pcw-1:0] Btb[(1&LT;&LT;BTBW)-1:0]; -     -    Generate begin: Btb_rst A       integerentry; +        always@(PosedgeClassor Negedgerst_n) the          if(!rst_n) -              for(entry=0; Entry < (1&LT;&LT;BTBW); entry=entry+1)begin //Reset BTB Entries $Btb[entry] <= {pcw{1'B0}}; the             End the    endgenerate the     the     always@(PosedgeCLK) -Pre_target_o <= bypass?Set_pc_i:btb[tb_entry]; in     the     always@(PosedgeCLK) the       if(set_i) AboutBtb[set_tb_entry] <=set_target_i; the     the Endmodule

The implementation needs to be explained as follows:

(1), BPU in each clock cycle to complete the forecast and update. Pc_i Port Enter the command PC value to be predicted, the Set_i port indicates whether to update the predictor when the next clock cycle arrives.

(2), considering the special instruction flow situation, the implementation adds a bypass (bypass) mechanism. For a specific processor implementation, bypassing may never be triggered, so this part of the implementation can be removed.

(3), for a branch instruction that is not recorded in bTB, the BPU default forecast destination address output is 0 (bTB resets before 0). For a specific processor implementation, consider the PC value of the next instruction of the branch instruction as the predicted value of the destination address.

Iv. Summary

Through the discussion, we propose the design idea of branch prediction unit based on saturation counter and BTB, and finally realize the prototype of the branch predictor by using Verilog. The realization has some advantages in the area, which can be applied to the experiment and verification of microprocessor, and the design also has some shortcomings:

(1), in the case of the value of the hash of the PC value, only simple to take the PC low 10bit of data as BTB index, which will lead to high PC value of the same branch instruction record state is confused with each other, thereby reducing the accuracy of the prediction;

(2), the design only considers the global state of the branch instruction, but does not track the specific context of the branch instruction;

(3), for the branch instruction with conditional judgment, or register direct/indirect address of the jump instruction, because its operands are stored in the register, and the value of the register is often constantly changing, which will lead to the target address of the branch to reduce the accuracy of prediction.

===================================================

This blog post is for reference only, inevitably has the mistake or the omission, welcome to put forward the valuable opinion.

"CPU microarchitecture Design" uses Verilog to design branch predictors based on saturation counters and BTB

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.