CUDA 5, CUDA
GPU Architecture
SM (Streaming Multiprocessors) is a very important part of the GPU architecture. The concurrency of GPU hardware is determined by SM.
Taking the Fermi architecture as an example, it includes the following main components:
- CUDA cores
- Shared Memory/L1Cache
- Register File
- Load/Store Units
- Special Function Units
- Warp Scheduler
Each SM in the GPU is designed to support hundreds of threads for parallel execution, and each GPU contains a lot of SM. Therefore, the GPU supports hundreds of threads for parallel execution. When a kernel is started, thread will be allocated to these SM for execution. A large number of threads may be allocated to different SM instances, but the threads in the same block must be executed in parallel in the same SM.
CUDA uses the Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads. These threads constitute a unit in 32 units, called warps. All threads in the warp execute the same commands in parallel. Each thread has its own instruction address counter and Status Register, and uses its own data to execute commands.
SIMT and SIMD (Single Instruction, Multiple Data) are similar. Both of them broadcast the same command to multiple executive units for parallel execution. A major difference is that SIMD requires that all vector elements be synchronized in a unified synchronization group, while SIMT allows threads to execute independently in a warp. SIMT has three major features that SIMD does not have:
- Each thread has its own instruction address counter
- Each thread has its own status register.
- Each thread can have its own independent execution path.
A block is only scheduled by one SM. Once the block is allocated with an SM, the block will remain in the SM until the execution ends. An SM can have multiple blocks at the same time. Displays the terms for software and hardware:
Note that most threads are logically parallel, and not all threads can be physically executed simultaneously. As a result, threads in the same block may have different steps.
Shared data between parallel threads leads to a race state: when multiple threads request the same data, undefined behavior will occur. CUDA provides APIs to synchronize threads in the same block to ensure that all threads reach a certain time point before the next step. However, we do not have any atomic operations to ensure internal synchronization of the block.
Threads in the same warp can be executed in any order, and active warps are limited by SM resources. When a warp is idle, the SM can schedule another available warp in the SM. There is no consumption for switching between concurrent warp, because the hardware resources have been allocated to all threads and blocks, so the status of the newly scheduled warp has been stored in SM.
SM can be seen as the heart of the GPU. Registers and shared memory are scarce resources of SM. CUDA allocates these resources to all threads residing in SM. Therefore, these limited resources impose very strict limits on active warps in each SM, which limits parallel capabilities. Therefore, mastering some hardware knowledge helps improve CUDA performance.
Fermi Architecture
Fermi is the first complete GPU computing architecture.
- 512 accelerator cores (including ALU and FPU)
- 16 SM, each containing 32 CUDA Cores
- Six 384-bit GDDR5 DRAM, supporting 6 GB global on-board memory
- GigaThread engine (left side of the figure) allocates thread blocks to SM Scheduling
- 768KB L2 cache
- Each SM has 16 load/store units, allowing each clock cycle to be 16 threads (that is, half-warp, but it is not mentioned now) to calculate the source address and destination address.
- Special function units (SFU) is used to execute sin cosine and so on.
- Each SM has two warp schedction and two instruction dispatch units. When a block is allocated to one SM, all threads in the block are allocated to different warp instances.
- Fermi (compute capability 2.x) each SM can process a total of 1536 threads in 48 warp at the same time.
Each SM consists of the following parts:
- CUDA cores)
- Dispatch and allocate warp units
- Shared memory, register file, L1 cache
Kepler Architecture
Kepler is faster, more efficient, and better performance than Fermi.
- 15 SM
- 6 64-Bit memory Controllers
- 192 Single-precision CUDA cores, 64 dual-precision units, 32 SFU and 32 load/store units (LD/ST)
- Add register file to 64 K
- Each Kepler's SM contains four warp scheduler and eight instruction dispatchers, so that each SM can execute four warp at the same time.
- Kepler K20X (compute capability 3.5) Each SM can schedule 64 warp at the same time, totaling 2048 threads.
Dynamic Parallelism
Dynamic Parallelism is a new feature of Kepler. It allows the GPU to start a new Grid dynamically. With this feature, other kernel can be started in any kernel. In this way, the kernel recursion and the data dependency between kernel are implemented directly. The scattering of light in D3D can be implemented using this method.
Hyper-Q
Hyper-Q is another new feature of Kepler. It increases the hardware connection between the CPU and GPU so that the CPU can run more tasks on the GPU at the same time. In this way, the GPU utilization can be increased to reduce the idle time of the CPU. Fermi relies on a separate hardware working queue to transmit tasks from the CPU to the GPU. In this way, when a task is blocked, subsequent tasks cannot be processed, hyper-Q solves this problem. Correspondingly, Kepler provides 32 Working queues for the GPU and CPU.
Comparison of main parameters of different arch types