write in front
The content is divided into two parts, the first part is translation "Professional CUDA C Programming" section 2. The timing YOUR KERNEL in CUDA programming model, and the second part is his own experience. Experience is not enough, you are welcome to add greatly.
Cuda, the pursuit of speed ratio, want to get accurate time, the timing function is essential
Timing is usually divided into two situations, (1) Directly get the interface function time, generally used to get the speedup ratio, (2) to obtain the interface function kernel functions, memory copy functions, such as time consuming, generally used to optimize the code.
The case (1) method has two kinds, CPU timing function and GPU timing function.
Situation (2) There are three kinds of tools nsight,nvvp,nvprof
This blog will detail the situation (1) of the two methods, the situation (2), Nsight will not use, briefly introduce the use of NVVP and nvprof. CPU Timing Functions
One of the problems to consider when using the CPU timing function is that the execution of the kernel function is executed asynchronously, so the kernel function synchronization function must be added to get the exact time.
The sample code is as follows:
Double Cpusecond () {
struct timeval tp;
Gettimeofday (&tp,null);
Return (double) Tp.tv_sec + (double) tp.tv_usec*1.e-6);
}
Double IStart = Cpusecond ();
function (argument list);
Cudadevicesynchronize (); Synchronization function
Double ielaps = Cpusecond ()-IStart;
GPU Timing Functions
The GPU timing function does not need to consider the synchronization problem, directly with the timing event function can be, the sample code is as follows:
cudaevent_t start, stop;
float elapsedtime = 0.0;
Cudaeventcreate (&start);
Cudaeventcreate (&stop);
Cudaeventrecord (start, 0);
function (argument list);;
Cudaeventrecord (stop, 0);
Cudaeventsynchronize (stop);
Cudaeventelapsedtime (&elapsedtime, start, stop);
cout << elapsedtime << Endl;
Cudaeventdestroy (start);
Cudaeventdestroy (stop);
How to get accurate timekeeping
Normally, the first time you execute a kernel function is slower than the second. This is because the GPU needs to be warmup the first time it is calculated. Therefore, it is not accurate to want the execution time of the first kernel function. Get the exact timing I summed up in three, the following loop to perform 100 times the part of the time required, averaging, the first error is reduced by 100 times times. The advantage of this method is that it is simple and rude. But the shortcomings are also obvious: (1) The execution time of the program is greatly increased, especially the larger program (2) To consider memory overflow problem, C + + memory needs to be manually managed by the program Ape. It is easy to write a program that does not run out of memory overflow problems, but it is difficult to write code that loops through 100 of times without a memory overflow problem (3) The timing is not particularly accurate, although the error has been reduced by 100 times times. Execute a warmup function before timing, warmup function, such as my vectoradd from Cuda sample as the warmup function. The advantage of this approach is that program execution time is shortened; The disadvantage is that you need to add a function to the program, and because the GPU is executing in parallel, the two execution time of the kernel function is not exactly the same. So it is recommended to use Method 3. Perform the warmup function first, in the loop 10 times the timing section. the use of NVVP and Nvprof
nvprof is a command line profiler that has been in existence since cuda5.0, and you can use only nvprof to perform some of the execution details of your code. The simple usage is as follows:
$ nvprof./sumarraysongpu-timer
You can get the following:
./sumarraysongpu-timer Starting
... Using Device 0:tesla M2070
==17770== nvprof is profiling process 17770, command:./sumarraysongpu-timer
Vector si Ze 16777216
sumarraysongpu <<<16384, 1024>>> time elapsed 0.003266 sec
Arrays match.
......
For more parameter information about NVPROF, you can use the help command:
$ nvprof--help
The NVIDIA Visual Profiler (NVVP) is a graphical interface profiler and the profiler I've been using.
Simple text tutorial See links written in the back
Opencuda:cuda Image algorithm Open source project, the algorithm has detailed comments, we learn together.
Private access to various CUDA-related outsourcing (commissioning, optimization, development of image algorithms, etc.), interested in contact, add friends please specify.