Cuda from beginner to proficient (0): write in front
At the request of the boss, the master of the 2012 high-performance computing course began to contact Cuda programming, and then apply the technology to the actual project, so that the processing program to accelerate more than 1K, visible based on graphics display parallel computing for the pursuit of speed is undoubtedly an ideal choice. There are less than a year of graduation, even after graduation these technologies will go with graduation, prepare this summer to open a cuda column, from the beginning to proficient, step by step, by the way to share some of the design experience and lessons, hope to learn cuda children's shoes to provide some guidance. Personal ability, mistakes are unavoidable, welcome to discuss.
PS: The application column seems to need to first post more than 15 original posts ... Forget it, write enough to apply again, then turn over.
Cuda from beginner to Proficient (i): Environment construction
Nvidia introduced Cuda (Compute Unified Devices Architecture) in 2006 and can use its GPU for general computing to extend parallel computing from large clusters to ordinary graphics cards. Allows users to run larger parallel programs with a notebook with GeForce graphics card.
The advantage of using a video card is that power consumption is very low and inexpensive compared to large clusters, but performance is outstanding. Take my Notebook For example, Geforce 610M, with the Devicequery program test, you can get the following hardware parameters:
Computing power up to 48x0.95 = 45.6 GFLOPS. The CPU parameters of the notebook are as follows:
CPU computing power is (4 cores): 2.5g*4 = 10GFLOPS, visible, graphics card computing performance is the 4-Core i5 CPU, so we can make full use of this resource to accelerate some time-consuming applications.
Well, 工欲善其事 its prerequisite, in order to use Cuda to program the GPU, we need to prepare the following necessary tools:
1. Hardware platform, is the video card, if you are not using NVIDIA graphics card, then can only say sorry, the other does not support Cuda.
2. Operating system, I used Windows xp,windows 7 is no problem, this blog with Windows7.
3. C compiler, recommended VS2008, and this blog consistent.
4. Cuda compiler NVCC, can be free of charge license download Cuda toolkitcuda download from the official website, the latest version is 5.0, this blog is the version.
5. Other tools (such as visual Assist, auxiliary code highlighting)
When you're ready, start installing the software. VS2008 Installation comparison time, it is recommended to install the full version (Nvidia website said Express version can also), the process does not need to be detailed. Cuda Toolkit 5.0 contains the necessary raw materials for NVCC compilers, design documents, design routines, Cuda runtime libraries, Cuda header files, and more.
After installation, we found this icon on the desktop:
Yes, that's it, double-click Run, you can see a lot of routines. We found simple OpenGL this run to see the effect:
Point to the right yellow line mark at the run can see the wonderful three-dimensional sine surface, the left mouse button drag can be converted angle, right-drag can be scaled. If this is a successful operation, it means that your environment is basically built successfully.
The possibility of a problem occurs:
1. You use Remote Desktop Connection to log on to another server, which has a graphics card support Cuda, but your remote terminal cannot run the CUDA program. This is because the remote login is using your local graphics card resources, the remote login can not see the server side of the video card, so the error: no Cuda support graphics card. Workaround: 1. The remote server is loaded with two video cards, one for display only, the other for calculation; 2. Do not log in with the graphical interface, but instead use the command line interface such as Telnet login.
2. There are more than two video cards that support Cuda, how to differentiate between which video card is running. This requires you to control in the program, choose to meet certain conditions of the video card, such as high clock frequency, large video memory, high computational version and so on. See the following blog for details. OK, so much for first, the next section we'll show you how to program the GPU in VS2008.
Cuda from beginner to Proficient (ii): First CUDA program
The book goes back, since we run the routine successfully, the next step is to understand how to implement each link in the routine. Of course, we start from the simple, the general programming language will find a helloworld example, and our video card is not talking, can only do some simple subtraction operation. So, the HelloWorld of Cuda program, I think the most suitable is the vector plus.
Open VS2008, select File->new->project, pop up the following dialog box, set as follows:
Then click OK and go directly to the engineering interface.
Project, we see only one. cu file with the following contents:[CPP] View Plain copy print? #include "cuda_runtime.h" #include "Device_launch_parameters.h" # include <stdio.h> Cudaerror_t addwithcuda (int *c, const int *a, const int *b, size_t size); __global__ void addkernel (int *c, const int *a, const int *b) { int i = threadIdx.x; c[i] = a[i] + b[i]; } int main () { const int arraySize = 5; const int a[arraysize] = { 1, 2, 3, 4, 5 }; const int b[arraysize] = { 10, 20, 30, 40, 50 }; int c[arraySize] = { 0 }; // Add vectors in parallel. cudaerror_t cudastatus = addwithcuda (C, a, b, arraySize) ; if (cudastatus != cudasuccess) { fprintf (stderr, "addwithcuda failed!"); return 1; } printf ("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n ", c[0], c[1], c[2], &NBSP;C[3],&NBSP;C[4]); // cudaThreadExit must be called before exiting in order for profiling and // tracing tools such as nsight and visual profiler to show complete traces. cudastatus = cudathreadexit (); if (cudastatus != cudasuccess) { fprintf (stderr, "cudathreadexit failed!"); return 1; } return 0; } // helper function for using cuda to add vectors in parallel. Cudaerror_t addwithcuda (Int *c, const int *a, const int *b, size_t size) {   int *dev_a = 0; int *dev_b = 0; int *dev_c = 0; cudaError_t cudaStatus; // Choose Which gpu to run on, change this on a multi-gpu system. cudastatus = cudasetdevice (0); if (cudastatus != cudasuccess) { fprintf (stderr, "cudasetdevice failed! do you have a Cuda-capable gpu installed? "); goto Error; } &NBSP;&NBSP;&NBSP;&NBSP;//&NBSP;ALLOCATE&NBSP;GPU&NBSP;BUFFERS&Nbsp;for three vectors (two input, one output) . cudastatus = cudamalloc ((void**) &dev_c, size * sizeof (int)); if (cudastatus != cudasuccess) { fprintf (stderr, "cudamalloc failed!"); goto Error; } cudastatus = cudamalloc (void**) &dev_a, size * sizeof (int)); if (cudastatus != cudasuccess) { fprintf (stderr, " Cudamalloc failed! "); goto Error; } &nbSp cudastatus = cudamalloc (void**) &dev_b, size * sizeof (int)); if (cudastatus != cudasuccess) { fprintf (stderr, "cudamalloc failed!"); goto Error; } // Copy input vectors from host memory to gpu buffers. cudaStatus = cudamemcpy (dev_a, a, size * sizeof (int), cudamemcpyhosttodevice); if (cudastatus != cudasuccess) { fprintf (stderr, "cudamemcpy failed!"); goto error; } cudastatus = cudamemcpy (dev_b, b, size * sizeof (int), Cudamemcpyhosttodevice); if (cudastatus != cudasuccess) { fprintf (stderr, "cudamemcpy Failed! "); goto Error; } // launch a kernel on the GPU with one thread for each element. Addkernel<<<1, size>>> (Dev_c, dev_a, dev_b); // cudathreadsynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. cudastatus = cudathreadsynchronize (); if ( cudastatus != cudasuccess) { fprintf (stderr, "cudathreadsynchronize returned error code %d after launching addkernel!\n ", cudastatus); goto error; } // Copy output vector from gpu buffer to host memory. cudastatus = cudamemcpy (c, dev_c, size * sizeof (int), Cudamemcpydevicetohost); if (cudastatus != cudasuccess) { &NBSP;&NBsp; fprintf (stderr, "cudamemcpy failed!"); goto Error; } error: cudafree (dev_c); cudafree (dev_a); cudafree (dev_b); return cudaStatus; }
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> cudaerror_t addwithcuda (int *
c, const int *a, const int *B, size_t size);
__global__ void Addkernel (int *c, const int *a, const int *b) {int i = threadidx.x;
C[i] = A[i] + b[i];
} int main () {const int arraySize = 5;
const int A[arraysize] = {1, 2, 3, 4, 5};
const int B[arraysize] = {10, 20, 30, 40, 50};
int C[arraysize] = {0};
Add vectors in parallel.
cudaerror_t cudastatus = Addwithcuda (c, a, b, arraySize);
if (cudastatus! = cudasuccess) {fprintf (stderr, "Addwithcuda failed!");
return 1;
} printf ("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n", c[0], c[1], c[2], c[3], c[4]); Cudathreadexit must be called before exiting in order for profiling and//tracing tools such as Nsight and Visual
Profiler to show complete traces.
Cudastatus = Cudathreadexit ();
if (cudastatus! = cudasuccess) { fprintf (stderr, "Cudathreadexit failed!");
return 1;
} return 0;
}//Helper function for using the CUDA to add vectors in parallel.
cudaerror_t Addwithcuda (int *c, const int *a, const int *B, size_t size) {int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaerror_t Cudastatus;
Choose which GPU to run on, the change this on a MULTI-GPU system.
Cudastatus = Cudasetdevice (0); if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudasetdevice failed!
Do you have a cuda-capable GPU installed? ");
Goto Error;
}//Allocate GPU buffers for three vectors (both input, one output).
Cudastatus = Cudamalloc ((void**) &dev_c, size * sizeof (int));
if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudamalloc failed!");
Goto Error;
} Cudastatus = Cudamalloc ((void**) &dev_a, size * sizeof (int));
if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudamalloc failed!"); GotoError;
} Cudastatus = Cudamalloc ((void**) &dev_b, size * sizeof (int));
if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudamalloc failed!");
Goto Error;
}//Copy input vectors from the host memory to GPU buffers.
Cudastatus = cudamemcpy (Dev_a, a, size * sizeof (int), cudamemcpyhosttodevice);
if (cudastatus! = cudasuccess) {fprintf (stderr, "cudamemcpy failed!");
Goto Error;
} cudastatus = cudamemcpy (Dev_b, b, size * sizeof (int), cudamemcpyhosttodevice);
if (cudastatus! = cudasuccess) {fprintf (stderr, "cudamemcpy failed!");
Goto Error;
}//Launch a kernel on the GPU with one thread for each element.
Addkernel<<<1, Size>>> (Dev_c, dev_a, Dev_b);
Cudathreadsynchronize waits for the kernel to finish, and returns//any errors encountered during the launch.
Cudastatus = Cudathreadsynchronize (); if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudathreadsynchronize returned error code%d after launching Addkernel!\n", cudastatus);
Goto Error;
}//Copy output vector from the GPU buffer to host memory.
Cudastatus = cudamemcpy (c, dev_c, size * sizeof (int), cudamemcpydevicetohost);
if (cudastatus! = cudasuccess) {fprintf (stderr, "cudamemcpy failed!");
Goto Error;
} error:cudafree (Dev_c);
Cudafree (dev_a);
Cudafree (Dev_b);
return cudastatus; }
As can be seen, CUDA program and C program is no different, just a few more "cuda" beginning some library functions and a special declaration of the function:
[CPP]View Plain copy print? __global__ void Addkernel (int *c, const int *a, const int *b) {int i = threadidx.x; C[i] = A[i] + b[i]; }
__global__ void Addkernel (int *c, const int *a, const int *b)
{
int i = threadidx.x;
C[i] = A[i] + b[i];
}
This function is a function that runs on the GPU, called the kernel function, and the English name kernel function, which distinguishes it from the operating system kernel functions.
We compile directly by F7, we can get the following output: [HTML] view plain copy print? 1>------ Build started: project: cuda_helloworld, configuration: debug win32 ------ 1 >Compiling with CUDA Build Rule... 1> "c:\program files\ Nvidia gpu computing toolkit\cuda\v5.0\\bin\nvcc.exe " -g -gencode=