Cuda programming Basics

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cuda Programming Model

The Cuda programming model uses the CPU as the host, and the GPU as the co-processor or device. In this model, the CPU is responsible for logic-Oriented Transaction Processing and serial computing, while the GPU focuses on highly threaded parallel processing tasks. The CPU and GPU each have their own memory address space.
Once confirmedProgramParallel part in, You can consider handing over this part of computing to the GPU.
Kernel:C Functions running on the GPUIt is called kernel. A kernel function is not a complete program, but a step in the whole Cuda program that can be executed in parallel. When called, it is executed n times through n Different Cuda threads.
A complete Cuda program is composed of a series of parallel steps of the device-side kernel function and serial processing steps on the host side.
A kernel function has two levels of parallelism, that is, the block in the grid is parallel with the thread in the block.

Hardware ing Computing Unit

Computing Core: The GPU has multiple stream multi-processor (SM), and the stream multi-processor is the computing core. Each stream multi-processor contains eight scalar stream processors and a small number of other computing units. SP is only an execution unit, not a complete processing core. It has a complete front-end processing core and must contain the finger taking, decoding, distribution logic, and execution unit. The eight SPS belonging to the same SM share a set of indicators and units, and also share a shared memory.
The kernel function in Cuda is executed in blocks,Threads in the same block need to share data, so they must be transmitted in the same SM, and each thread in the block is sent to an SP for execution..
A block must be allocated to one sm, but multiple active thread blocks can be waiting for execution at the same time in one SM, that is, a Sm can have multiple block contexts at the same time.Put multiple thread blocks in an SM to hide latency and make better use of the resources of execution units.When one block performs high-latency operations such as synchronization or access to the video memory, the other block can be "passed in", occupying GPU execution resources.
Factors that limit the number of active thread blocks in SM include:: The number of active thread blocks in SM cannot exceed 8. The sum of the number of warp in all active thread blocks cannot exceed 24 on 1.0/1.1 devices, the sum of registers and memories used by all active thread blocks cannot exceed the resource limit in SM.

Thread hierarchy)

Cuda is organized in the form of a thread grid,Each thread mesh consists of several thread blocks, and each thread block is composed of several threads..
Threadidx: The built-in dim3 variables threadidx and blockidx are used in Cuda. Threadidx is a vector containing three components, so that threads can be identified by one-dimensional, two-dimensional, or three-dimensional thread indexes to form a one-dimensional, two-dimensional, or three-dimensional thread block. The relationship between a thread index and Its thread ID is very direct:
1. For a one-dimensional block, the threadidx of the thread is threadidx. X;
2. For a two-dimensional block with the size of (dx, Dy), threadidx of the thread is (threadidx. x + threadidx. y * dx );
3. For a three-dimensional block with the size of (dx, Dy, Dz), The threadidx of the thread is (threadidx. x + threadidx. y * dx + threadidx. z * DX * Dy ).
The number of threads in a block cannot exceed 512.
Threads in the same block can communicate with each other. In cuda, the method for intra-block communication is: the threads in the same block exchange data through the shared memory (shared memory), and ensure that data can be correctly shared between threads through the fence synchronization. Specifically,You can call the _ syncthreads () function at the location to be synchronized in the kernel function.
All threads in a block execute commands at one time. For example, a block may have the following situation: Some threads have executed 20th commands, at this time, other threads only execute 8th vkjsfdsvd 21st statements to share data through the shared storage, so the data in the thread that only executes 8th statements may not be updated yet, it is handed over to other threads for processing, which leads to the wrong computing structure. After calling the _ syncthreads () function for barrier synchronization, you can ensure that the program will continue to run only after every thread in the block runs to 21st commands.
The number of threads, the size of shared memory, and the number of registers in each thread block must be limited by the processing of core hardware resources. The reason is:
1. In the GPU, the physical distance between the shared memory and the execution unit must be small, and it is in the same processing core to minimize the latency of the shared memory, this ensures that the threads in the thread block can collaborate effectively.
2. To implement the _ syncthreads () function at a very low cost on the hardware, the data of all threads in a block must be processed by the same processing core.

Definition and call of kernel functions

Kernel functions must be defined by the _ global _ function type Qualifier and can only be performed on the host.Code. During the call, the execution parameters of the kernel function must be declared. For example:
// Define KERNEL _ global _ void vecadd (float * a, float * B, float * c ){
Int I = threadidx. X; C [I] = A [I] + B [I];
} Int main {
// Call kernelvecadd <1, n> (A, B, C );
}
You must allocate enough space for the array or variable used in the kernel before calling the kernel function. Otherwise, an error occurs during GPU computing.
The threads running on the device side are executed in parallel, and each thread executes the kernel function one time in sequence according to the instruction. Each thread has its own block ID and thread ID to distinguish it from other threads. Block ID and thread ID can only be accessed through built-in variables in the kernel. Built-in variables are provided by dedicated registers on the device and are read-only and can only be called in the kernel function on the GPU side.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cuda programming Basics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cuda programming Basics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support