Cuda Basic Concepts

Source: Internet
Author: User

CUDA Computational Model

Cuda is calculated in two parts, the serial part executes on the host, namely the CPU, while the parallel part executes on the device, namely the GPU.

Cuda has added some extensions, including libraries and keywords, compared to the traditional C language.

Cuda code is submitted to the NVCC compiler, which divides the code into both the host code and the device code.

The host code is the original C language, referred to GCC,ICC or other compiler processing;

The Device Code section is given to a component called the Just in Time compiler, which is compiled before the code is run. Device code compiles a Java-like bytecode file called ptx, and then generates ISA running on top of the GPU, or co-processing.

Parallel thread pattern on device

The parallel threading array consists of the Grid--block--thread three-level structure, as shown in:

Each grid contains n blocks, and each block contains n thread.

Here you need to mention the concept of SPMD: SPMD, the single program multiple data, refers to the same programs that handle the different numbers. The thread that executes on the device side belongs to this type, and all threads in each grid execute the same program (shared PC and IR pointer). But these threads need to get their own data from the shared storage, which requires a data-positioning mechanism. Cuda's positioning formula is as follows:

i = blockidx.x * blockdim.x + threadidx.x

Bllockidx identifies Block,blockdim as the size of the block on that dimension, Threadidx is the identity of the thread inside the block.

Note the suffix of. x, which is because Cuda's threading array can be multidimensional (such as), BLOCKIDX and Threadidx can reach up to 3 dimensions. This can provide great convenience for processing images and spatial data.

The memory model on the device

The memory model on the device is as follows:

Each thread has its own copy of the register and local memory space.

Each thread in the same block has a shared copy of the share memory.

In addition, all thread (including thread of different blocks) shares a copy of the global memory,constant memory, texture memory.

different grids have their own global memory, constant memory and texture memory

Each grid has a shared storage, where each thread has its own register. The host code is responsible for allocating shared memory space in the grid, as well as the transfer of data between host and device. The device code interacts only with shared memory and local registers.

Function ID

Cuda's functions are divided into three types:

Note that both are double-underlined. The __GLOBAL__ function is the entry in the C code that is calculated on the device that is called.

The __host__ function is a traditional C function and is also the default function type. The reason for this increase is that sometimes __device__ and __host__ can be used together, allowing the compiler to know that a two-version function needs to be compiled.

Cuda Basic Concepts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.