AMD opencl university course (10)

Source: Internet
Author: User

GPU thread and Scheduling

This section describes how workgroups in opencl can be scheduled and executed on hardware devices. At the same time, we will also talk about the workitem in the same workgroup. If the commands they execute occur diverage (that is, the execution commands are inconsistent), the performance will be affected. Learning opencl parallel programming is not only about opencl spec, but also about the features of opencl hardware devices. At this stage, we mainly understand the architecture features of GPU, in this way, the hardware features can be optimized.Algorithm.

Currently, the spec of opencl is 1.1. With the development of hardware, we believe opencl will support more parallel computing features. Opencl-based parallel computing is just getting started ,...

1. workgroup to hardware thread

InIn opencl, the kernel function is (Thread, I may mix these two concepts. At the hardware layer, the workgroup is mapped to the Cu (compute unit) Unit of the hardware to execute specific calculations, while the Cu is generally composed of more SIMT (single-instruction, thread) PE (processing elements). These PES execute the specific workitem calculation, and they execute the same command, but the operation data is different. They use the SIMD method to complete the final calculation.

Due to hardware restrictions, such as the number of PES in Cu,In fact, threads in a workgroup are not executed at the same time.But there is a scheduling unit, the threads in the same workgroup are grouped by the scheduling unit, and then a group of scheduling hardware is executed. This scheduling unit is called warp on NV hardware, and wavefront on AMD hardware, or wave for short.

Shows the groups in which threads are divided into different waves in a workgroup. The threads in the wave execute the same commands synchronously, but each thread has its own register status and can execute different control branches. For example, a control statement

If ()

{

... // Branch

}

Else

{

... // Branch B

}

Assume that among the 64 threads in the wave, the odd thread executes branch a and the even thread executes branch B. Because the threads in the wave must execute the same command, therefore, this control statement is split into two executions [Branch prediction is performed in the compilation phase.]. For the first time, the odd thread of branch a executes, the even thread performs the null operation, the second even thread executes, and the odd thread does the null operation. The hardware system has a 64-bit mask register. For the first time, it is 01... 0101, the second operation will be reversed. 10... 1010. Different threads are executed based on the position of the mask register. It can be seen that for multi-branch kernel functions, if the execution of different threads has too many diverage situations, it will affectProgramPerformance.

2. AMD wave Scheduling

The thread scheduling unit of amd gpu is wave, and the size of each wave is 64. The command transmitting unit emits five VLIW commands, each of which is a stream core (SC) Executes a VLIW command, and 16 stream cores execute 16 VLIW commands in one clock cycle. For each clock cycle, 1/4 waves are completed, and four consecutive clock cycles are required to complete the whole wave.

We also have the following points worth understanding:

    • In the case of raw hazard, the entire wave must have four stall clock cycles. In this case, if other waves can be used, Alu will execute other waves to hide latency. After eight clock cycles, if you wait for the wave to be ready, Alu will continue to execute the wave.
    • Two waves can completely hide the raw latency. When the first wave is executed, the Second Wave waits for data during scheduling. When the first wave is executed, the second wave can be executed immediately.

3. NV warp Scheduling

The work group is divided into different warp in units of 32 threads, which are scheduled and executed by SM. Half of the threads in each warp are fired for execution, and these threads can be staggered for execution. The number of available warp depends on the resources of each block. In addition to the different sizes, wave and warp have similar hardware features.

4. Occupancy overhead

In each Cu, the number of waves activated at the same time is limited, which is related to the use of register for each thread and the local memory size, because for each Cu, the total amount of register and local memory is certain.

We use occupancy to measure the number of active waves in a CU. If more waves are activated at the same time, the latency can be better hidden. In the performance optimization section, we will discuss occupancy in detail.

5. Control Flow and branch prediction)

As I mentioned earlier, if else branch execution is implemented. When different threads in a wave have diverage, mask is used to control the thread execution path. This prediction method is based on the following considerations:

    • BranchCodeRelatively short
    • This prediction method is more efficient than conditional commands.
    • During the compilation phase, the compiler can replace switch or if else with predition.

Prediction can be defined:Based on the judgment condition, the condition code is set to true or false..

_ KERNEL
VoidTest (){

IntTid = get_local_id (0 );
If(TID % 2 = 0)
Do_some_work ();
Else
Do_other_work ();
}

For example, the above Code is predictable,

Predicate = true for threads 0, 2, 4 ....

Predicate = false for threads 1, 3, 5 ....

The following is an example of a control flow diverage.

    • In Case 1, all odd-number threads execute dosomework2 (), and all even-number threads execute dosomeworks. However, in each wave, IF and else Code commands must be released.
    • In Case 2, the first wave executes IF and the other waves executes else. In this case, only one If and else code is fired in each wave.

In prediction, the command execution time is the sum of the fast execution time of IF and else codes.

6. Warp Voting

Warp voting is a mechanism for implicit synchronization between threads in warp.

For example, a warp thread writes a local meory address at the same time. During Concurrent thread execution, the warp voting mechanism can ensure that their order is correct. For more details about warp voting, refer to the Cuda documents.

In opencl programming, due to different hardware devices, we must optimize different hardware. This is also a challenge in opencl programming, such as the number of warp and wave, so that when designing the workgroup size, we must optimize our own platform. If we select 32, for amd gpu, the 32 thread in a wave may be empty, and if we select 64, for NV GPUs, resource competition may increase, such as the allocation of register and local MEOMORY. We will not talk about the hybrid CPU device here. The path to opencl parallel programming is still long and we look forward to the emergence of a new opencl architecture.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.