A while ago, I completed both the ant colony algorithm and the improved K-means algorithm, and then watched Cuda programming. I read the introduction of Cuda and thought that Cuda would be easy to use after C, in fact, you still need to know some GPU architecture-related knowledge to write a good program. After reading this book "Cuda for GPU high-performance computing", I feel that it is more like a manuscript, and I have sorted out a book from the previous hengduo document, because it is a collection of the wisdom of a large family, the lecture is good, that is, the order is not very good. There is always better than none. After reading it again, there is still some confidence in Cuda programming. We recommend that you take a look at it first.
Reading a book and writing a program is another thing. I set up the environment in the previous article, but I still don't know how to create a Cuda project and how to start writing a program. Fortunately, Cuda provides an SDK, which contains many instances for our reference. As a result, my first Cuda program starts from here.
The Cuda SDK instances are all under the src directory, and each instance has its own directory, such as deviceuery. There is also a MAKEFILE file used for compilation under its directory, this is to compile a single project. Now we compile all the instances. After running sudo make in the root directory of cuda_sdk, we can see the compiled executable program under <cuda_sdk_home>/bin/Linux/release, run the command to view the result.
This is the running result of devicequery:
We believe that we can use these instances to create our own projects. Then there is a template in the instance. cu ,. delete the CPP file and clear the content in the OBJ directory. This becomes an empty Cuda project. You can write a program under SRC and modify the compiled file name in makefie, compile. The generated execution file is in cuda_sdk_home/bin/Linux/release. Here is a test code that executes the matrix addition operation:
1 # include <stdio. h>
2 # include <stdlib. h>
3 # include <time. h>
4 # include <cuda_runtime.h>
5 # include <cutil. h>
6
7 # define vec_size 16
8
9 // Kernel Function
10 _ global _ void vecadd (float * d_a, float * d_ B, float * d_c)
11 {
12 INT Index = threadidx. X;
13 d_c [Index] = d_a [Index] + d_ B [Index];
14}
15
16 int main ()
17 {
18 // get the size of the allocated space
19 size_t size = vec_size * sizeof (float );
20
21 // allocate local memory
22 float * h_a = (float *) malloc (size );
23 float * h_ B = (float *) malloc (size );
24 float * h_c = (float *) malloc (size );
26 // Initialization
27 For (INT I = 0; I <vec_size; ++ I)
28 {
29 h_a [I] = 1.0;
30 h_ B [I] = 2.0;
31}
32
33 // copy the data in the local memory to the device
34 float * d_a;
35 cudamalloc (void **) & d_a, size );
36 cudamemcpy (d_a, h_a, size, cudamemcpyhosttodevice );
37
38 float * d_ B;
39 cudamalloc (void **) & d_ B, size );
40 cudamemcpy (d_ B, h_ B, size, cudamemcpyhosttodevice );
41
42 // allocate space for storing results
43 float * d_c;
44 cudamalloc (void **) & d_c, size );
45
46 // define 16 threads
47 dim3 dimblock (16 );
48 vecadd <1, dimblock> (d_a, d_ B, d_c );
49
50 // copy the computing result back to the primary storage
51 cudamemcpy (h_c, d_c, size, cudamemcpydevicetohost );
52
53 // output the calculation result
54 For (Int J = 0; j <vec_size; ++ J)
55 {
56 printf ("% F \ t", h_c [J]);
57}
58
59 // release host and device memory
60 cudafree (d_a );
61 cudafree (d_ B );
62 cudafree (d_c );
63
64 free (h_a );
65 free (h_ B );
66 free (h_c );
67
68 return 0;
69}