Introduction to Cuda C Programming-Programming Model

Source: Internet
Author: User

This section describes the main concepts of the Cuda programming model.

2.1.kernels (kernel function)

Cuda C extends the C language and allows programmers to define C functions, called kernels ). Execute n times in N Cuda threads in parallel.

Use the _ global _ specifier to declare a core function, call and use <...>, and specify the number of Cuda threads to be executed. Each thread that is executed has a unique ID, which can be obtained through the threadidx variable in the kernel function.

For example, add two vectors, add a and B, and store the result to C. The length of A, B, and C is N.

__global__ void addKernel(int *c, const int *a, const int *b){    int i = threadIdx.x;    c[i] = a[i] + b[i];}int main(){  ...  // Launch a kernel on the GPU with one thread for each element.    addKernel<<<1, N>>>(c, a, b);  ...      }

 

Each thread executes the addkernel kernel function on each element of the array.

2. Thread level

Threadidx is a triple, so a thread can be identified by one-dimensional, two-dimensional, and three-dimensional threadidx to form a one-dimensional, two-dimensional, and three-dimensional thread block.

The relationship between the index and ID of the thread: for one-dimensional thread blocks, the index and ID are the same; for two-dimensional thread blocks with a size of (dx, Dy, the index is (x, y), and the ID is x + y * dx. for three-dimensional thread blocks with the size of (dx, Dy, Dz), the index is (x, y, z), ID is (x + y * dx + z * DX * Dy ).

For example, add a two-dimensional matrix, add a and B of N * n, and save the result to C.

  

// Kernel definition__global__ void MatAdd(float A[N][N], float B[N][N],float C[N][N]){    int i = threadIdx.x;    int j = threadIdx.y;    C[i][j] = A[i][j] + B[i][j];}int main(){    ...    // Kernel invocation with one block of N * N * 1 threads    int numBlocks = 1;    dim3 threadsPerBlock(N, N);    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);    ...}

Among them, because the thread of a block is generally in the same processing core and shares limited memory, the number of threads of a block is limited. In the current GPU, the maximum number of threads in the block is 1024.

However, a core function can be executed in multiple thread blocks of the same size. Therefore, the total number of threads is equal to the number of thread blocks multiplied by the number of threads in each thread block.

A thread block is organized into a grid of one, two, or three, as shown in figure 6.

Figure 6 grid with multiple thread blocks

When calling a core function, you can <...> specify the number of threads in each thread block and the number of thread blocks in each grid. <...> the dim3 type can be Int. In the core function, you can use the built-in variable blockidx to obtain the index of each thread block in the grid. At the same time, blockdim can be used to obtain the dimension of each thread block.

Example: extend the previous matrix addition example to multiple thread blocks.

// Kernel definition__global__ void MatAdd(float A[N][N], float B[N][N],float C[N][N]){    int i = blockIdx.x * blockDim.x + threadIdx.x;    int j = blockIdx.y * blockDim.y + threadIdx.y;    if (i < N && j < N)    C[i][j] = A[i][j] + B[i][j];}int main(){    ...    // Kernel invocation    dim3 threadsPerBlock(16, 16);    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);    ...}

The thread block size is 16*16, with a total of 256 threads. In the same thread block, threads can share data through shared memory. You can use the _ syncthreads () function to synchronize the data access from the thread to the shared memory.

2. 3. Memory Level

During execution, the core function can obtain data in Multiple Memory Spaces, as shown in figure 7. Each thread has its own local memory. Each thread block has shared memory, and each thread in the block can access it. There is also a global memory space that can be accessed by every thread that executes the core function.

In addition, there are two additional read-only memory spaces: constant and texture memory space. The global, constant, and texture memory space are optimized for different memory usage. The texture model provides multiple addressing methods.

 

Figure 7 memory Layers

2. 4. Heterogeneous Programming

As shown in 8, The Cuda programming model assumes that the Cuda thread is executed on a device separated from the host's C program. Core functions are executed on the GPU, while others are executed on the CPU. The Cuda programming model assumes that the host and device operate on their memory space in DRAM independently. Therefore, when the program calls the Cuda Runtime (described in the chapter of the programming interface), it manages global, constant, and texture memory space access to core functions, the runtime includes memory allocation and recovery, and copying of memory data between the host and the device.

Execute serial code on the host, while GPU executes parallel core functions.

Figure 8 heterogeneous Programming

2. 5. Computing Capability

The computing capability of a device is defined as a major version number and a minor revision number.

GPUs of the same major version have the same core architecture. The main version number 5 is the Maxwell architecture, 3 is the Kepler architecture, 2 is the Fermi architecture, and 1 is the Tesla architecture.

The revision number is equivalent to the continuous improvement and new features of the core.

This section lists the computing capabilities of all GPUs that support Cuda. The computing power section provides detailed technical specifications for each computing power.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.