Branch Prediction and Branch Predication

Source: Internet
Author: User
Tags prefetch

Branch Prediction and Branch Predication are both techniques proposed to address the impact of program Branch statements on hardware execution efficiency. Branch Prediction is applied to the CPU to ensure the highest thread execution efficiency. Branch Predication is applied to computing devices in the SPMD structure. This type of equipment is designed with throughput as the primary goal, and GPU is a representative of this type of equipment.

Branch Prediction

Branch Prediction, that is, Branch Prediction, aims to ensure that the CPU assembly line is fully loaded most of the time and that the running threads run at the highest efficiency. The CPU assembly line can be roughly divided into four stages: Command prefetch, command decoding, execution, and result write-back. When the program code has a condition branch statement, the next command of the condition branch statement cannot be determined until the condition result is calculated. Before the result is calculated, the CPU assembly line cannot perform command prefetch and command decoding (this is because the next command cannot be determined), that is, the assembly line stall. In order to solve the stall in the pipeline and improve the execution efficiency, the hardware module of the CPU branch prediction predicts the position of the next instruction and applies it to the pipeline based on the historical running records of the program before the computing of the conditional result. If the computing result meets the expectation or the prediction is correct, the CPU pipeline will not be stall, and the execution efficiency is the highest.

If the CPU does not introduce branch prediction, the execution efficiency of the conditional branch statement will always be the lowest (because the pipeline must be stall). After the branch prediction is introduced, the CPU pipeline will have a certain chance of not stall, this improves the execution efficiency. In particular, when a program is written to cater to the CPU branch prediction mechanism, the program can be executed efficiently, for example, to operate more than 100 elements in an int array, the execution efficiency of a sorted array is obviously high when the code of an unsorted array is executed. The following is the pseudo code:

For (int I in IntArray)

{

If (I> 100)

{

// Do

}

}

When this Code is applied to sorted arrays, the failure of the CPU branch prediction (that is, the stall of the pipeline) may only occur in the first few cycles where I met the condition greater than 100, most of the time, the CPU pipeline is fully loaded. If the array is not sorted, the failure of the CPU branch prediction will be very common, and the code running efficiency will decrease.

Finally, branch prediction requires the support of hardware modules, which will inevitably increase hardware complexity and power consumption.

Branch Predication

First, parse the SPMD and the corresponding hardware structure. SPMD is the Single Program Multiple Data, which can be understood as the extension of SIMD. In the case of SIMD, the processing of multi-data is limited to a single command. In the case of SPMD, programmers can compile a program to process multiple data as needed, this program is executed in parallel in multiple threads. The GPU Shader Program is an example of SPMD.

In the hardware structure, multiple ALU instances are bound together to share a PC (Program Counter) to form a SPMD computing unit, such as the smx of the nv gpu, amd gpu Cus are all SPMD computing units. The number of ALU contained in a SPMD operation unit determines the theoretical maximum number of parallel threads. To simplify the hardware structure and reduce power consumption, the SPMD computing unit does not have a branch prediction hardware module for each ALU. More importantly, on the premise of sharing a PC, the branch prediction hardware module for each ALU has no improvement in execution efficiency.

The emergence of the condition branch will result in inconsistent execution paths of ALU In the SPMD operation unit. Because these ALU instances share the same PC, it is impossible for ALU to execute different paths in parallel. To implement the condition branch on SPMD, the compiler must perform equivalent transformation on the code of the condition branch program and convert the inconsistent execution paths into consistent execution paths. See the following example:

A. the original code and execution path are inconsistent.

Int condition = threadid + 100;

Int result = 0;

If (condition> 220)

{

Result + = 10;

}

Else

{

Result-= 10

}

B. A possible converted code

Int result = 0;

Int condition = threadid + 100;

Int temp1 = result + 10;

Int temp2 = result-10;

Result = (condition> 220 )? Temp1: temp2;

The converted Code has the same execution path on each ALU, and no conditional branch exists. If both the branches are executed, the Branch Predication Branch determines which Branch to accept Based on the condition.

Branch Predication requires the support of hardware instruction sets. commands supporting Branch Predication execute different logic based on the values of some status registers, such as x86 cmov series commands, for example, some manufacturers' GPU commands will be executed based on the Status Register value or NOP (the legendary command-level polymorphism ?).

The converted code using Branch Predication takes longer than the conditional Branch code. Because the converted code is executed by all branches. Therefore, a technology similar to Dynamic command fusion was proposed. In SPMD, condition branches can be retained and execution efficiency can be ensured-Dynamic warp formation.

Dynamic warp formation

Warp is the threadgroup of the thread group. It is the minimum parallel granularity of the SPMD operation unit and the minimum scheduling unit (NV is called warp, AMD is called wavefront ). For example, the NV warp is a 32 thread, that is, each 32 ALU In the SPMD operation unit executes the same command in parallel. You can also think of warp as a hardware thread, whose execution width is 32 ALU. The execution width of a CPU thread is usually 1 ALU. Of course, the width of a gpu alu can reach 4*64 bit, so the width of a GPU warp is 32*4*64 bit, And the GPU can execute multiple warp at the same time, therefore, the single-ratio throughput is not an order of magnitude between the GPU and the CPU.

The following uses warp to represent a thread group and expands on the premise that 32 threads are a warp. Review the condition where SPMD has A conditional branch. For example, if branch A and branch B appear, the 32 threads in warp execute both branch A and branch B. Finally, these threads choose whether to accept the result of A or B based on the results of the branch condition. In any case, some results are meaningless to these threads.

Splitting a warp into multiple sub-warp by branch path in parallel will not increase the efficiency, because the split sub-warp width must be less than 32. If this approach is applied to multiple warp, the situation will be different. For example, warp1, warp2, and warp3 execute the same code and reach the same condition branch statement, the Branch will generate path A and path B, and the sub-warp split by path will be reorganized into A new warp for parallel execution. After the branch path is executed, it will be restored to the original warp, in this way, we can efficiently cope with conditional branches without any meaningless operations. This is Dynamic warp formation. For more details, you can google.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.