CUDA Computational Model
Cuda is calculated in two parts, the serial part executes on the host, namely the CPU, while the parallel part executes on the device, namely the GPU.
Cuda has added some extensions, including libraries and keywords, compared to the traditional C language.
Cuda code is submitted to the NVCC compiler, which divides the code into both the host code and the device code.
The host code is the original C language, referred to GCC,ICC or other compiler processing;
The Device Code section is given to a component called the Just in Time compiler, which is compiled before the code is run. Device code compiles a Java-like bytecode file called ptx, and then generates ISA running on top of the GPU, or co-processing.
Parallel thread pattern on device
The parallel threading array consists of the Grid--block--thread three-level structure, as shown in:
Each grid contains n blocks, and each block contains n thread.
Here you need to mention the concept of SPMD: SPMD, the single program multiple data, refers to the same programs that handle the different numbers. The thread that executes on the device side belongs to this type, and all threads in each grid execute the same program (shared PC and IR pointer). But these threads need to get their own data from the shared storage, which requires a data-positioning mechanism. Cuda's positioning formula is as follows:
i = blockidx.x * blockdim.x + threadidx.x
Bllockidx identifies Block,blockdim as the size of the block on that dimension, Threadidx is the identity of the thread inside the block.
Note the suffix of. x, which is because Cuda's threading array can be multidimensional (such as), BLOCKIDX and Threadidx can reach up to 3 dimensions. This can provide great convenience for processing images and spatial data.
The memory model on the device
The memory model on the device is as follows:
Each thread has its own copy of the register and local memory space.
Each thread in the same block has a shared copy of the share memory.
In addition, all thread (including thread of different blocks) shares a copy of the global memory,constant memory, texture memory.
different grids have their own global memory, constant memory and texture memory
Each grid has a shared storage, where each thread has its own register. The host code is responsible for allocating shared memory space in the grid, as well as the transfer of data between host and device. The device code interacts only with shared memory and local registers.
Function ID
Cuda's functions are divided into three types:
Note that both are double-underlined. The __GLOBAL__ function is the entry in the C code that is calculated on the device that is called.
The __host__ function is a traditional C function and is also the default function type. The reason for this increase is that sometimes __device__ and __host__ can be used together, allowing the compiler to know that a two-version function needs to be compiled.
Cuda Basic Concepts