CUDA 5, CUDA

Last Update:2015-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

CUDA 5, CUDA
GPU Architecture

SM (Streaming Multiprocessors) is a very important part of the GPU architecture. The concurrency of GPU hardware is determined by SM.

Taking the Fermi architecture as an example, it includes the following main components:

CUDA cores
Shared Memory/L1Cache
Register File
Load/Store Units
Special Function Units
Warp Scheduler

Each SM in the GPU is designed to support hundreds of threads for parallel execution, and each GPU contains a lot of SM. Therefore, the GPU supports hundreds of threads for parallel execution. When a kernel is started, thread will be allocated to these SM for execution. A large number of threads may be allocated to different SM instances, but the threads in the same block must be executed in parallel in the same SM.

CUDA uses the Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads. These threads constitute a unit in 32 units, called warps. All threads in the warp execute the same commands in parallel. Each thread has its own instruction address counter and Status Register, and uses its own data to execute commands.

SIMT and SIMD (Single Instruction, Multiple Data) are similar. Both of them broadcast the same command to multiple executive units for parallel execution. A major difference is that SIMD requires that all vector elements be synchronized in a unified synchronization group, while SIMT allows threads to execute independently in a warp. SIMT has three major features that SIMD does not have:

Each thread has its own instruction address counter
Each thread has its own status register.
Each thread can have its own independent execution path.

A block is only scheduled by one SM. Once the block is allocated with an SM, the block will remain in the SM until the execution ends. An SM can have multiple blocks at the same time. Displays the terms for software and hardware:

Note that most threads are logically parallel, and not all threads can be physically executed simultaneously. As a result, threads in the same block may have different steps.

Shared data between parallel threads leads to a race state: when multiple threads request the same data, undefined behavior will occur. CUDA provides APIs to synchronize threads in the same block to ensure that all threads reach a certain time point before the next step. However, we do not have any atomic operations to ensure internal synchronization of the block.

Threads in the same warp can be executed in any order, and active warps are limited by SM resources. When a warp is idle, the SM can schedule another available warp in the SM. There is no consumption for switching between concurrent warp, because the hardware resources have been allocated to all threads and blocks, so the status of the newly scheduled warp has been stored in SM.

SM can be seen as the heart of the GPU. Registers and shared memory are scarce resources of SM. CUDA allocates these resources to all threads residing in SM. Therefore, these limited resources impose very strict limits on active warps in each SM, which limits parallel capabilities. Therefore, mastering some hardware knowledge helps improve CUDA performance.

Fermi Architecture

Fermi is the first complete GPU computing architecture.

512 accelerator cores (including ALU and FPU)
16 SM, each containing 32 CUDA Cores
Six 384-bit GDDR5 DRAM, supporting 6 GB global on-board memory
GigaThread engine (left side of the figure) allocates thread blocks to SM Scheduling
768KB L2 cache
Each SM has 16 load/store units, allowing each clock cycle to be 16 threads (that is, half-warp, but it is not mentioned now) to calculate the source address and destination address.
Special function units (SFU) is used to execute sin cosine and so on.
Each SM has two warp schedction and two instruction dispatch units. When a block is allocated to one SM, all threads in the block are allocated to different warp instances.
Fermi (compute capability 2.x) each SM can process a total of 1536 threads in 48 warp at the same time.

Each SM consists of the following parts:

CUDA cores)
Dispatch and allocate warp units
Shared memory, register file, L1 cache

Kepler Architecture

Kepler is faster, more efficient, and better performance than Fermi.

15 SM
6 64-Bit memory Controllers
192 Single-precision CUDA cores, 64 dual-precision units, 32 SFU and 32 load/store units (LD/ST)
Add register file to 64 K
Each Kepler's SM contains four warp scheduler and eight instruction dispatchers, so that each SM can execute four warp at the same time.
Kepler K20X (compute capability 3.5) Each SM can schedule 64 warp at the same time, totaling 2048 threads.

Dynamic Parallelism

Dynamic Parallelism is a new feature of Kepler. It allows the GPU to start a new Grid dynamically. With this feature, other kernel can be started in any kernel. In this way, the kernel recursion and the data dependency between kernel are implemented directly. The scattering of light in D3D can be implemented using this method.

Hyper-Q

Hyper-Q is another new feature of Kepler. It increases the hardware connection between the CPU and GPU so that the CPU can run more tasks on the GPU at the same time. In this way, the GPU utilization can be increased to reduce the idle time of the CPU. Fermi relies on a separate hardware working queue to transmit tasks from the CPU to the GPU. In this way, when a task is blocked, subsequent tasks cannot be processed, hyper-Q solves this problem. Correspondingly, Kepler provides 32 Working queues for the GPU and CPU.

Comparison of main parameters of different arch types

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CUDA 5, CUDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

CUDA 5, CUDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support