Cuda Programming (ii) CUDA initialization and kernel functions
Cuda Initialization
As has been said in the last time, Cuda installation success, a new project is very simple, directly in the new project when the Nvidia Cuda project can be selected, we first create a new Mycudatest project, Delete the sample kernel.cu, and then create a new one, and create a Cuda/C + + File, let's first look at how to initialize Cuda, so I'm named Initcuda.cu
First we want to use Cuda's runtime API so we need to include cuda_runtime.h
#include <stdio.h> //CUDA RunTime API#include <cuda_runtime.h>
Next, the function calls the runtime API for the content of the initialization cuda.
//cuda InitializationBOOL Initcuda () {int Count;//Get the number of Cuda-enabled devicesCudagetdevicecount (&Count);//non-compliant hardware if(Count==0) {fprintf (stderr,"There is no device.\n");return false; }intI for(i =0; I <Count; i++) {Cudadeviceprop prop;if(Cudagetdeviceproperties (&prop, i) = = cudasuccess) {if(Prop.major >=1) { Break; } } }if(i = =Count) {fprintf (stderr,"There is no device supporting CUDA 1.x.\n");return false; } cudasetdevice (i);return true;}
This program will first call the Cudagetdevicecount function, the number of supported Cuda GPU, if the computer does not support Cuda device, it will be returned 1, and this 1 is device 0, Device0 is just a simulation device, However, many of Cuda's features are not supported (not supported in CUDA1.0 or above), so we want to really determine if there is a CUDA-enabled device on the system, need to call cudagetdeviceproperties for each device, to obtain their specific parameters, and the supported Cuda version (pro P.major and Prop.minor represent the version numbers supported by the appliance, such as 6.5 for Prop.major 6 and Prop.minor for 5)
Cudagetdeviceproperties In addition to the CUDA version supported by the device, there are the name of the device, the size of the memory, the maximum number of thread, the frequency of the execution unit, and so on. Refer to Nvidia's CUDA Programming Guide for details.
After finding a device that supports CUDA 1.0 or more, you can call the Cudasetdevice function and set it to the current video card you want to use.
Here we call the Initcuda function in the main function, because we use VS, so the direct ctrl+f5 compiled execution can be executed, if there is a CUDA-enabled device on the system, the Cuda initialized should be displayed.
int main() { if (!InitCUDA()) { return0; } printf("CUDA initialized.\n"return0;}
Complete program:
#include <stdio.h>//cuda RunTime API#include <cuda_runtime.h>//cuda InitializationBOOLInitcuda () {intCount//Get the number of Cuda-enabled devicesCudagetdevicecount (&count);//non-compliant hardware if(Count = =0) {fprintf(stderr,"There is no device.\n");return false; }intI for(i =0; I < count; i++) {Cudadeviceprop prop;if(Cudagetdeviceproperties (&prop, i) = = cudasuccess) {if(Prop.major >=1) { Break; } } }if(i = = count) {fprintf(stderr,"There is no device supporting CUDA 1.x.\n");return false; } cudasetdevice (i);return true;}intMain () {if(! Initcuda ()) {return 0; }printf("CUDA initialized.\n");return 0;}
Cuda kernel function
Completed Cuda initialization Check, we can use Cuda to complete some simple calculations, here we are going to calculate the sum of squares of a series of numbers.
So we first wrote a random function:
#define DATA_SIZE 1048576int data[DATA_SIZE];//产生大量0-9之间的随机数void GenerateNumbers(intint size){ for (int0; i < size; i++) { 10; }}//生成随机数(main中调用)//GenerateNumbers(data, DATA_SIZE);
This function produces a large number of random numbers between 0 and 9, and then we have to square and manipulate them.
So how do we get this job done on the graphics card? First of all, it is obvious that these numbers cannot be placed in memory, but are copied to the graphics memory of the GPU. Let's take a look at the Data replication section.
Host&device Architecture:
The last time I've talked about some of the basics of Cuda architecture, here's a little refresher, in the Cuda architecture, a program is divided into two parts: the host side and the device side. The Host side is the part that executes on the CPU, and the device side is the part that executes on the display chip. Device-side programs are also known as "kernel". Usually the host terminal program will be ready to copy the data into the memory of the video card, and then the display chip to execute the device-side program, complete and then by the host terminal program to retrieve the results from the memory card.
We need to copy the resulting data to the device side of RAM to complete the calculation on the graphics card, so we first open up a suitable video memory and then copy the random number from memory.
//generate random number generatenumbers (data, data_size); /* copy data into video card memory */ int * gpudata, *result; //cudamalloc get a piece of video card memory (where result is used to store the calculation result) Cudamalloc (( void * *) &gpudata, sizeof (int ) * data_size); Cudamalloc ((void * *) &result, sizeof ( Span class= "Hljs-keyword" >int )); //cudamemcpy copies the resulting random number into the video card memory //cudamemcpyhosttodevice-Copy from memory to video card memory //cudamemcpydevicetohost-Copy from video card memory to memory cudamemcpy (gpudata, data, sizeof (int ) * data_size,cudamemcpyhosttodevice);
The notes have been written more clearly. The usage of Cudamalloc and cudamemcpy is similar to the general malloc and memcpy, but cudamemcpy has a parameter that indicates the direction of replication memory. The Cudamemcpyhosttodevice is used here because it is copied from the main memory to the video card memory. If it is from video card memory to main memory, use Cudamemcpydevicetohost.
After we have finished the copy of the data from memory to video, we will finish the calculation on the video card, how to get the program to run on the video card? The answer is the kernel function.
cuda kernel function:
To write the program executed on the display chip. In CUDA, the function __global__
is preceded by an expression to be executed on the display chip, so we just need to add one before the normal function __global__
:
// __global__ 函数 (GPU上执行) 计算平方和staticvoid sumOfSquares(intint* result){ intsum0; int i; for0; i < DATA_SIZE; i++) { sum += num[i] * num[i] * num[i]; } sum;}
There are some restrictions on the display of the programs executed on the chip, the first most obvious one-there can be no return value, there are some other restrictions, which are slowly mentioned later.
To execute a kernel function:
After writing the kernel function, you need cuda to execute the function.
In CUDA, to execute a kernel function, use the following syntax:
函数名称<<<block 数目, thread 数目, shared memory 大小>>>(参数...);
Here we do not go parallel, but simply complete the GPU computation, so we let block = 1,thread = 1,share Memory = 0
sumOfSquares<<<1, 1, 0>>>(gpudata, result);
After the calculation, don't forget to copy the results from the display chip back to the main memory, and then release the memory ~
intsum; //cudaMemcpy 将结果从显存中复制回内存 cudaMemcpy(&sum, result, sizeof(int), cudaMemcpyDeviceToHost); //Free cudaFree(gpudata); cudaFree(result);
Finally we print out the results and we're done:
printf("GPUsum: %d \n", sum);
We then use the CPU to verify the above procedure is wrong, this step is still very necessary:
sum0; for (int0; i < DATA_SIZE; i++) { sum += data[i] * data[i] * data[i]; } printf("CPUsum: %d \n"sum);
Complete Program:
Program code:
#include <stdio.h>#include <stdlib.h>//cuda RunTime API#include <cuda_runtime.h>#define Data_size 1048576intData[data_size];//Generate a large number of random numbers between 0-9voidGeneratenumbers (int*number,intSize) { for(inti =0; i < size; i++) {Number[i] = rand ()%Ten; }}//cuda InitializationBOOL Initcuda () {int Count;//Get the number of Cuda-enabled devicesCudagetdevicecount (&Count);if(Count==0) {fprintf (stderr,"There is no device.\n");return false; }intI for(i =0; I <Count; i++) {Cudadeviceprop prop;if(Cudagetdeviceproperties (&prop, i) = = cudasuccess) {if(Prop.major >=1) { Break; } } }if(i = =Count) {fprintf (stderr,"There is no device supporting CUDA 1.x.\n");return false; } cudasetdevice (i);return true;}//__global__ function (performed on GPU) calculates the sum of squares__global__Static voidSumofsquares (int*num,int* result) {int sum=0;intI for(i =0; i < data_size; i++) {sum+ = num[i] * num[i] * num[i]; } *result =sum;}intMain () {//cuda Initialization if(! Initcuda ()) {return 0; }//Generate random numbersGeneratenumbers (data, data_size);/ * Copy the data into the video card memory * / int* Gpudata, *result;//cudamalloc get a video card memory (where result is used to store calculation results)Cudamalloc ((void* *) &gpudata, sizeof (int) * data_size); Cudamalloc ((void* *) &result, sizeof (int));//cudamemcpy Copy the resulting random number into the video card memory //cudamemcpyhosttodevice-Copy from memory to video card memory //cudamemcpydevicetohost-Copy from video card memory to memorycudamemcpy (gpudata, data, sizeof (int) * data_size, Cudamemcpyhosttodevice);//Execute function syntax in Cuda: function name <<<block number, thread number, shared memory size >>> (Parameters ...);Sumofsquares << <1,1,0>> > (gpudata, result);/ * Copy the results from the display chip back to main memory * / int sum;//cudamemcpy Copy the results back from memorycudamemcpy (&sum, result, sizeof (int), cudamemcpydevicetohost);//freeCudafree (Gpudata); Cudafree (result); printf"Gpusum:%d \ n",sum);sum=0; for(inti =0; i < data_size; i++) {sum+ = data[i] * data[i] * data[i]; } printf ("Cpusum:%d \ n",sum);return 0;}
Operation Result:
Summary:
This time we introduced CUDA initialization and how to run the program on the graphics card, that is, the data from the memory copied to the video memory, and then write the kernel function of the operation, then use Cuda call kernel function, complete the calculation on the GPU, then of course do not forget to copy the results back to memory, release the video.
Overall the skeleton of a CUDA program has been set up, and the top priority of GPU computing is the parallel acceleration has not been introduced, but before the acceleration we have a very important thing to consider, that is whether our program is accelerated, that is, we want to output the program run time, This time we need to use CUDA provides a clock function, you can get the frequency of the GPU execution unit, so the next blog I will mainly explain this function ~ hope to give everyone's learning to bring help ~
Reference: "Talk about Cuda in layman's"
Cuda Programming (ii) CUDA initialization and kernel functions