Cuda learning note

Last Update:2018-12-03 Source: Internet

Author: User

Tags nvcc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After reading Cuda for a week on and off, I caught a cold (the charm of Cuda is really great = !), Let's take a review and take notes.

CPU code: data preparation and device initialization before the kernel starts, as well as some serial operations between the kernel. Ideally, the CPU serial code only serves to clear the previous kernel function and start the next kernel function.

Cuda parallel computing function (kernel): It is a parallel execution step in the whole Cuda program.

The kernel function has two levels of Parallelism: the blocks in the grid are parallel, and the threads in the blocks are parallel.

Kernel: The format is grid, and the Unit is block.

Grid: a collection of parallel blocks;

The blocks cannot communicate with each other and there is no execution order;

Currently, one kernel has only one grid. In the future, dx11 will adopt the MIMD architecture, allowing multiple different grids in one kernel.

Block: threads of the same block need to share data and must be transmitted in the same SM. (at the same time point, one Sm can have multiple active blocks)

Each thread in the block is sent to an SP;

When the number of blocks is several times the number of processing cores, the GPU computing capability can be fully utilized: if it is too small, it cannot reflect its computing speed advantage over the traditional method.

Thread: has its own private register and local memory;

Threads in the same block can communicate with each other through the shared storage and synchronization mechanism.

Actual running unit: Warp (thread bundle), which is determined by the hardware capability. The Tesla architecture has 32 GPUs. The division is based on the block ID, for example, 0 ~ 31 is a bundle.

32 warp: each time a warp command is sent, the eight SP in SM will execute this command four times.

Cuda programming key:

Avoid branches in programs and warp as much as possible:

Avoid using the branch in warp as much as possible: if there is a branch, SM needs to transmit the commands of each branch to each sp, and then decide whether to execute the command according to the sp. The execution time will

Is the sum of all branches.

Optimized Memory Access: Ideally, all the memory is only transferred, and all GPU cores are still computing. This requires rational division of procedures.

Cuda software system:

Cuda C: A Programming Method for writing device code in C language, including some extensions of C and A Runtime Library.

Nvcc Compiler: separates the host code and device code from the source file;

The host code is output in the form of a C file and can be compiled by high-performance compilers such as ICC, GCC, or other appropriate high-performance compilers. Or at the final stage of compilation, you can hand it over to other compilers to generate. OBJ or. O files.

The device code is compiled from nvcc into PTX code or binary code.

PTx code: similar to Assembly metadata, it is an input command sequence designed for the dynamic compiler JIT.

JIT can run the same PTX on graphics cards using different machine languages to ensure compatibility.

JIT output is affected by hardware and other factors with uncertainty.

Independent software developers who need to determine the code can compile the code into Cuda binary code Cubin to avoid uncertainty in the JIT process.

Cuda Runtime API: The driver API is encapsulated to facilitate programming.

Put it in the cudart package;

The function is prefixed with Cuda;

Cuda driver API: handle-based underlying interface.

In the nvcuda package;

The prefix is Cu;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More