CUDA 2, CUDA

Source: Internet
Author: User
Tags nvcc

CUDA 2, CUDA
CUDA Introduction

CUDA is a platform for parallel computing and a class C programming model. We can easily implement parallel algorithms, just like writing C code. With NVIDIA GPUs, you can run your parallel programs on many devices, including desktops, laptops, and tablets. Familiarity with the C language helps you master CUDA as soon as possible.

CUDA Programming

CUDA programming allows your program to run on a heterogeneous system, namely, the CUP and GPU. The two have their own storage space and are separated by the PCI-Express bus. Therefore, we should first pay attention to the distinction between the two terms:

  • Host: CPU and itsmemory (host memory)
  • Device: GPU and its memory (device memory)

In the code, host memory is usually indicated by the prefix h _, and d _ represents device memory.

Kernel is the key in CUDA programming. It is the code running on the GPU, indicated by the identifier _ global.

The host can perform most operations independently of the host. When a kernel is started, the control immediately returns it to the CPU for other additional tasks. Therefore, CUDA programming is asynchronous. A typical CUDA program contains serial code supplemented by parallel code. The serial code is executed by the host, and the parallel code is executed in the device. The host code is standard C, and the device code is cuda c. We can put all the code into a single source file, or use multiple files or libraries. The nvidia c compiler (nvcc) can compile host and device to generate executable programs.

Next, describe the processing process of the cuda program:

Memory operations

The cuda program divides the system into host and device, which have their own memory. Kernel can operate device memory. To control device memory Well, CUDA provides several memory operation functions:

 

To ensure ease of learning, the cuda c style is very similar to that of C, for example:

CudaError_t cudaMalloc (void ** devPtr, size_t size)

Let's take a look at cudaMencpy. Its function prototype is:

CudaError_t cudaMemcpy (void * dst, const void * src, size_t count, cudaMemcpyKind kind)

The optional types of cudaMemcpykind include:

The specific meaning is very understandable, so I will not explain it more.

For the returned type cudaError_t, if it is called correctly, cudaSuccess is returned; otherwise, cudaErrorMemoryAllocation is returned. You can use char * cudaGetErrorString (cudaError_t error) to convert it to an easy-to-understand format.

Organization thread

Mastering how to organize threads is an important part of CUDA programming. CUDA threads are divided into two layers: Grid and Block.

 

A grid is composed of all threads started by a separate kernel. All threads in the grid share global memory. A grid consists of multiple blocks, which are composed of multiple threads. the grid and block can be one-dimensional, two-dimensional, or three-dimensional, and a two-dimensional grid and Two-Dimensional block.

Here we will introduce several built-in CUDA variables:

  • BlockIdx: block index. blockIdx. x indicates the x coordinate of the block.
  • ThreadIdx: thread index. Similarly, blockIdx.
  • BlockDim: block dimension, in which blockDim. x = 5.
  • GridDim: grid dimension, also known as blockDim.

Generally, the grid is organized into 2D and the block is 3D. Both grid and block use dim3 as the declaration, for example:

Dim3 block (3); // follow-up blog posts will explain why griddim3 grid (nElem + block. x-1)/block. x) is written like this );

Note that dim3 is only visible to the host, and its corresponding device type is uint3.

Start CUDA kernel

The Calling format of CUDA kernel is:

Kernel_name <grid, block> (argument list );

Grid and block are the dim3 variables described above. The two variables can be used to configure the total number of threads of a kernel and the thread organization form. For example:

Kernel_name <4, 8> (argumentt list );

The code in this row indicates that grid is one-dimensional, with four blocks and one-dimensional, and each block has eight threads. Therefore, there are 4*8 = 32 threads.

Note: different from Calling c functions, the start of all CUDA kernel is asynchronous. When the CUDA kernel is called, the control is immediately returned to the CPU.

Function type identifier

_ Device _ and _ host _ can be used in combination.

Kernel restrictions:

  • Only the device memory can be obtained.
  • The void type must be returned.
  • Variable Number parameters are not supported.
  • Static variables are not supported.
  • Function pointers are not supported.
  • Asynchronous.
Code Analysis
#include <cuda_runtime.h>#include <stdio.h>#define CHECK(call) \{ \  const cudaError_t error = call; \  if (error != cudaSuccess) \  { \    printf("Error: %s:%d, ", __FILE__, __LINE__); \    printf("code:%d, reason: %s\n", error, cudaGetErrorString(error)); \    exit(1); \  } \}
void checkResult(float *hostRef, float *gpuRef, const int N) {  double epsilon = 1.0E-8;  bool match = 1;  for (int i=0; i<N; i++) {    if (abs(hostRef[i] - gpuRef[i]) > epsilon) {      match = 0;      printf("Arrays do not match!\n");      printf("host %5.2f gpu %5.2f at current %d\n",hostRef[i],gpuRef[i],i);      break;    }  }  if (match) printf("Arrays match.\n\n");}
void initialData(float *ip,int size) {  // generate different seed for random number  time_t t;  srand((unsigned) time(&t));  for (int i=0; i<size; i++) {    ip[i] = (float)( rand() & 0xFF )/10.0f;  }}
void sumArraysOnHost(float *A, float *B, float *C, const int N) {  for (int idx=0; idx<N; idx++)  C[idx] = A[idx] + B[idx];}
__global__ void sumArraysOnGPU(float *A, float *B, float *C) {  int i = threadIdx.x;  C[i] = A[i] + B[i];}
int main(int argc, char **argv) {  printf("%s Starting...\n", argv[0]);  // set up device  int dev = 0;  cudaSetDevice(dev);
  // set up data size of vectors  int nElem = 32;  printf("Vector size %d\n", nElem);
  // malloc host memory  size_t nBytes = nElem * sizeof(float);  float *h_A, *h_B, *hostRef, *gpuRef;  h_A = (float *)malloc(nBytes);  h_B = (float *)malloc(nBytes);  hostRef = (float *)malloc(nBytes);  gpuRef = (float *)malloc(nBytes);
  // initialize data at host side  initialData(h_A, nElem);  initialData(h_B, nElem);  memset(hostRef, 0, nBytes);  memset(gpuRef, 0, nBytes);
  // malloc device global memory  float *d_A, *d_B, *d_C;  cudaMalloc((float**)&d_A, nBytes);  cudaMalloc((float**)&d_B, nBytes);  cudaMalloc((float**)&d_C, nBytes);
  // transfer data from host to device  cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);  cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);
  // invoke kernel at host side  dim3 block (nElem);  dim3 grid (nElem/block.x);  sumArraysOnGPU<<< grid, block >>>(d_A, d_B, d_C);  printf("Execution configuration <<<%d, %d>>>\n",grid.x,block.x);
  // copy kernel result back to host side  cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost);
  // add vector at host side for result checks  sumArraysOnHost(h_A, h_B, hostRef, nElem);
  // check device results  checkResult(hostRef, gpuRef, nElem);
  // free device global memory  cudaFree(d_A);  cudaFree(d_B);  cudaFree(d_C);
  // free host memory  free(h_A);  free(h_B);  free(hostRef);  free(gpuRef);  return(0);}

Compile command: $ nvcc sum. cu-o sum

Run: $./sum

Output:

./sum Starting...Vector size 32Execution configuration <<<1, 32>>>Arrays match.

Download Code: CodeSamples.zip

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.