In the field of high-performance computing, due to its architecture, GPU is playing more and more applications in the field of parallel computing, such as a large number of computing games, graphics, image algorithms, etc, GPU acceleration can significantly improve performance. Today, NVIDIA graphics cards are widely used on PCs. Cuda is the universal parallel computing architecture launched by NVIDIA.
Next we will try Cuda for the first time on the basis of studying "getting started with Cuda.
1. engineering settings
This is not much to mention. Create an empty Win32 console application and set the engineering properties (see the previous blog post ).
2. Program Initialization
First, add the header file
1 # include <stdio. h> // C standard input/output interface 2 # include <stdlib. h> 3 # include <cuda_runtime.h> // use the Runtime API
Define the Cuda initialization function initcuda (). Obtain the Cuda device and return true. If not, return false.
1 // Cuda initialize 2 bool initcuda () 3 {4 int count; 5 // return the number of devices with computing power (≥1), no return 1, device 0 is a simulation device and does not support Cuda function 6 cudagetdevicecount (& COUNT); 7 8 If (COUNT = 0) // device 9 {10 fprintf (stderr, "there is no device. \ n "); 11 return false; 12} 13 14 int I; 15 for (I = 0; I <count; I ++) 16 {17 cudadeviceprop prop; // device properties 18 if (cudagetdeviceproperties (& prop, I) = cudasuccess) // get device data, brief returns information about the compute-device19 {20 if (prop. major> = 1) // Cuda computing capability 21 {22 break; 23} 24} 25} 26 27 if (I = count) 28 {29 fprintf (stderr, "There is no device supporting Cuda 1. X \ n "); 30 return false; 31} 32 33 cudasetdevice (I); // brief set device to be used for GPU executions34 return true; 35}
Of course, the entry function of the Cuda program is also main ().
1 int main()2 {3 if (!InitCUDA())4 {5 return 0;6 }7 printf("CUDA initialized.\n");8 }
In this way, a simple executable initialization program is completed. The following uses GPU to calculate the sum of an array. To reflect the GPU's parallel computing capability, set the array length to a larger value.
3. generate an array
Define an array and then generate the values of each element in the array.
1 # define data_size 1048576 // define data length 2 3 int data [data_size]; 4 5 // generate array element value 6 void generatenumbers (int * number, int size) 7 {8 for (INT I = 0; I <size; I ++) 9 {10 number [I] = rand () % 10; // generate 0 ~ Random Number of 9 11} 12}
4. program running on the video card
The Program executed on the video card device is written in the same way as C. However, the program running on the GPU still has some programming specifications. For details, refer to the relevant documentation. _ Global _ is a function type qualifier, indicating that the function is a kernel function and can only be called from the host (CPU) when executed on the device (GPU.
Note that all the variables required for calculation on the video card device are pointer type. For specific Cuda interface functions, you can view the meaning and parameter meanings through the source code.
Note that for (I = 0; I <data_size/(block_num * thread_num); I ++) is not used in the for loop because the threads in a block share the memory, when a thread in a block reads array data, it is thread0-> thread1->... -> for reading data in the thread255 sequence, it is necessary to maintain the continuity of reading array data between threads, which can improve the performance. Data distribution is very important in future parallel computing.
1 // function 2 executed on the display chip // Note: The Global keyword is about two _ 3 _ global _ static void sumofsquares (int * num, int * result, clock_t * Time) 4 {5 const int tid = threadidx. x; // obtain the thread number 6 const int bid = blockidx. x; // obtain the block number 7 int sum = 0; 8 int I; 9 If (TID = 0) 10 time [bid] = clock (); // start time 11 for (I = bid * thread_num + tid; I <data_size; I + = block_num * thread_num) // Note 12 {13 sum + = num [I] * num [I]; 14} 15 result [bid * thread_num + TID] = sum; // result calculated by the TID thread in the bid Block 16 if (TID = 0) 17 time [Bid + block_num] = clock (); // runtime 18}
5. Allocate video memory and execute parallel operations
Elements in the array need to be put into the GPU for calculation. Therefore, you must first copy the array in the memory to the graphics card memory so that the data in the graphics card can be read for calculation. In the video card, we use 32 blocks, each opening up 256 threads for parallel computing.
After the video memory is allocated, you can call the core function for Parallel Computing. After the computing, You need to copy the computing results from the video memory to the memory, and then release the allocated video memory space. Note that the data length must be consistent during copy.
Each block uploads the calculated result to the memory and then calculates the sum on the CPU.
The relationship between CPU, GPU, and memory is as follows. The blocks in the GPU and the thread in the fast speed are just examples, not the actual quantity.
Cudamalloc indicates that memory space is allocated on the video memory.
1 # define block_num 32 // number of blocks 2 # define thread_num 256 // Number of threads in each block 3 4 int main () 5 {6 //... cuda initializes 7 generatenumbers (data, data_size); // generates a random number of 8 9 int * gpudata, * result; // 10 clock_t * Time on the GPU device; // computing time (based on the GPU clock) 11 12 cudamalloc (void **) & gpudata, sizeof (INT) * data_size); // allocate video memory, allocate memory on the device13 cudamalloc (void **) & result, sizeof (INT) * block_num * thread_num); 14 cudamalloc (void **) & time, sizeof (clock_t) * block_num * 2); 15 // copies data between host and device16 cudamemcpy (gpudata, Data, sizeof (INT) * data_size, cudamemcpyhosttodevice); // host-> Device
17 // function name <number of blocks, number of threads, size of shared memory> (parameter ...) 18 sumofsquares <block_num, thread_num, 0> (gpudata, result, time); 19 20 int sum [thread_num * block_num]; 21 clock_t time_used [block_num * 2]; // run time 22 cudamemcpy (& sum, result, sizeof (INT) * thread_num * block_num, cudamemcpydevicetohost); 23 cudamemcpy (& time_used, time, sizeof (clock_t) * block_num * 2, cudamemcpydevicetohost); 24 cudafree (gpudata); // release the video memory of 25 cudafree (result); 26 cudafree (time ); 27 28 // calculate the sum of sums calculated by each thread. 29 int final_sum = 0; 30 for (INT I = 0; I <thread_num * block_num; I ++) 31 {32 final_sum + = sum [I]; 33} 34 clock_t min_start, max_end; 35 min_start = time_used [0]; 36 max_end = time_used [block_num]; 37 // calculate the total GPU running time 38 for (INT I = 0; I <block_num; I ++) 39 {40 if (min_start> time_used [I]) 41 min_start = time_used [I]; 42 if (max_end <time_used [I + block_num]) 43 max_end = time_used [I + block_num]; 44} 45 printf ("sum: % d Time: % d \ n ", final_sum, max_end-min_start );
6. CPU Verification
Next, write a verification program to check whether the execution on the GPU is correct.
1 // After the above Code 2 // computing and verification on the CPU 3 final_sum = 0; 4 for (INT I = 0; I <data_size; I ++) 5 {6 final_sum + = data [I] * Data [I]; 7} 8 printf ("(CPU) sum: % d \ n", final_sum );
7. Compile and run
The result is as follows:
Here, the time after the time is the running time on the GPU, in the unit of GPU cycle, that is, it took 1099014 GPU cycles, the GPU clock speed of the local machine is 800 MHz, so the time spent is 10099014/(800*1000) MS
Note: careful readers will find that this article does not show how efficient GPU computing is. Indeed, this article is just a first experience of using Cuda for parallel computing, it is intended to give a general impression of GPU computing.