Cuda learning note

Source: Internet
Author: User
Tags nvcc

After reading Cuda for a week on and off, I caught a cold (the charm of Cuda is really great = !), Let's take a review and take notes.


CPU code: data preparation and device initialization before the kernel starts, as well as some serial operations between the kernel. Ideally, the CPU serial code only serves to clear the previous kernel function and start the next kernel function.


Cuda parallel computing function (kernel): It is a parallel execution step in the whole Cuda program.


The kernel function has two levels of Parallelism: the blocks in the grid are parallel, and the threads in the blocks are parallel.


Kernel: The format is grid, and the Unit is block.


Grid: a collection of parallel blocks;

The blocks cannot communicate with each other and there is no execution order;

Currently, one kernel has only one grid. In the future, dx11 will adopt the MIMD architecture, allowing multiple different grids in one kernel.


Block: threads of the same block need to share data and must be transmitted in the same SM. (at the same time point, one Sm can have multiple active blocks)

Each thread in the block is sent to an SP;

When the number of blocks is several times the number of processing cores, the GPU computing capability can be fully utilized: if it is too small, it cannot reflect its computing speed advantage over the traditional method.


Thread: has its own private register and local memory;

Threads in the same block can communicate with each other through the shared storage and synchronization mechanism.


Actual running unit: Warp (thread bundle), which is determined by the hardware capability. The Tesla architecture has 32 GPUs. The division is based on the block ID, for example, 0 ~ 31 is a bundle.

32 warp: each time a warp command is sent, the eight SP in SM will execute this command four times.



Cuda programming key:

Avoid branches in programs and warp as much as possible:

Avoid using the branch in warp as much as possible: if there is a branch, SM needs to transmit the commands of each branch to each sp, and then decide whether to execute the command according to the sp. The execution time will

Is the sum of all branches.

Optimized Memory Access: Ideally, all the memory is only transferred, and all GPU cores are still computing. This requires rational division of procedures.



Cuda software system:

Cuda C: A Programming Method for writing device code in C language, including some extensions of C and A Runtime Library.

Nvcc Compiler: separates the host code and device code from the source file;

The host code is output in the form of a C file and can be compiled by high-performance compilers such as ICC, GCC, or other appropriate high-performance compilers. Or at the final stage of compilation, you can hand it over to other compilers to generate. OBJ or. O files.

The device code is compiled from nvcc into PTX code or binary code.

PTx code: similar to Assembly metadata, it is an input command sequence designed for the dynamic compiler JIT.

JIT can run the same PTX on graphics cards using different machine languages to ensure compatibility.

JIT output is affected by hardware and other factors with uncertainty.

Independent software developers who need to determine the code can compile the code into Cuda binary code Cubin to avoid uncertainty in the JIT process.


Cuda Runtime API: The driver API is encapsulated to facilitate programming.

Put it in the cudart package;

The function is prefixed with Cuda;

Cuda driver API: handle-based underlying interface.

In the nvcuda package;

The prefix is Cu;


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.