Cuda from Getting started to mastering (10): Profiling and Visual Profiler

Last Update:2018-07-26 Source: Internet

Author: User

Tags error code goto cuda toolkit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The content of further learning after getting started is how to optimize your code. Our previous example did not consider any performance optimizations in order to better learn the basic points of knowledge, rather than other detail issues. Starting with this section, we want to think about performance and constantly optimize the code, making execution faster is the only purpose of parallel processing.

There are many ways to run the code, and the C language provides an API similar to SYSTEMTIME () to get the system time and then calculate the time between two events to complete the clocking function. In Cuda, we have an API dedicated to measuring the uptime of the equipment, as described below.

Open the Programming Manual, "Cuda_toolkit_reference_manual", ready to query do not know the API. Before and after running the kernel function, we do the following:

cudaevent_t start,stop;//Event Object
cudaeventcreate (&start);//Create Event
cudaeventcreate (&stop);//Create Event
Cudaeventrecord (Start,stream);//Record start
mykernel<<<dimg,dimb,size_smem,stream>>> (parameter list);//execute kernel function

Cudaeventrecord (stop,stream);//Record End Event
cudaeventsynchronize (stop);//Event synchronization, The device operation was completed before the end event.
float ElapsedTime;
Cudaeventelapsedtime (&elapsedtime,start,stop);//calculates the length of time between two events (in MS)

The kernel function execution time is saved in the variable elapsedtime. With this value we can evaluate the performance of the algorithm. Here is an example of how to use the timekeeping function.

The previous example is small in size, with only 5 elements, the processing amount is too small to be timed, the scale is enlarged to 1024, and the total time of the calculation is repeated 1000 times, so the estimate is not susceptible to random disturbances. We use this example to compare the performance of thread parallelism and block parallelism. The code is as follows:

#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> cudaerror_t addwithcuda (int *
c, const int *a, const int *B, size_t size);
    __global__ void addkernel_blk (int *c, const int *a, const int *b) {int i = blockidx.x;
C[i] = a[i]+ B[i];
    } __global__ void Addkernel_thd (int *c, const int *a, const int *b) {int i = threadidx.x;
C[i] = a[i]+ B[i];
    } int main () {const int arraySize = 1024;
    int A[arraysize] = {0};
	int B[arraysize] = {0};
		for (int i = 0;i<arraysize;i++) {A[i] = i;
	B[i] = arraysize-i;
    } int C[arraysize] = {0};
    Add vectors in parallel.
	cudaerror_t Cudastatus;
	int num = 0;
	Cudadeviceprop prop;
	Cudastatus = Cudagetdevicecount (&num);
	for (int i = 0;i<num;i++) {cudagetdeviceproperties (&prop,i);
    } Cudastatus = Addwithcuda (c, a, b, arraySize);
        if (cudastatus! = cudasuccess) {fprintf (stderr, "Addwithcuda failed!");
    return 1; }//Cudathreadexit muSt is called before exiting in order for profiling and//tracing tools such as Nsight and Visual Profiler to show com
    Plete traces.
    Cudastatus = Cudathreadexit ();
        if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudathreadexit failed!");
    return 1;
		} for (int i = 0;i<arraysize;i++) {if (C[i]! = (A[i]+b[i])) {printf ("Error in%d\n", I);
}} return 0;
}//Helper function for using the CUDA to add vectors in parallel.
    cudaerror_t Addwithcuda (int *c, const int *a, const int *B, size_t size) {int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;

    cudaerror_t Cudastatus;
    Choose which GPU to run on, the change this on a MULTI-GPU system.
    Cudastatus = Cudasetdevice (0);  if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudasetdevice failed!
        Do you have a cuda-capable GPU installed? ");
    Goto Error;
    }//Allocate GPU buffers for three vectors (both input, one output). Cudastatus = Cudamalloc((void**) &dev_c, size * sizeof (int));
        if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudamalloc failed!");
    Goto Error;
    } Cudastatus = Cudamalloc ((void**) &dev_a, size * sizeof (int));
        if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudamalloc failed!");
    Goto Error;
    } Cudastatus = Cudamalloc ((void**) &dev_b, size * sizeof (int));
        if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudamalloc failed!");
    Goto Error;
    }//Copy input vectors from the host memory to GPU buffers.
    Cudastatus = cudamemcpy (Dev_a, a, size * sizeof (int), cudamemcpyhosttodevice);
        if (cudastatus! = cudasuccess) {fprintf (stderr, "cudamemcpy failed!");
    Goto Error;
    } cudastatus = cudamemcpy (Dev_b, b, size * sizeof (int), cudamemcpyhosttodevice);
        if (cudastatus! = cudasuccess) {fprintf (stderr, "cudamemcpy failed!");
    Goto Error;
	} cudaevent_t Start,stop; CudaeventCreate (&start);
	Cudaeventcreate (&stop);
	Cudaeventrecord (start,0);
		for (int i = 0;i<1000;i++) {//Addkernel_blk<<<size,1>>> (Dev_c, dev_a, Dev_b);
	Addkernel_thd<<<1,size>>> (Dev_c, dev_a, Dev_b);
	} cudaeventrecord (stop,0);
	Cudaeventsynchronize (stop);
	FLOAT TM;
	Cudaeventelapsedtime (&tm,start,stop);
    printf ("GPU Elapsed time:%.6f ms.\n", TM);
    Cudathreadsynchronize waits for the kernel to finish, and returns//any errors encountered during the launch.
    Cudastatus = Cudathreadsynchronize (); if (cudastatus! = cudasuccess) {fprintf (stderr, "Cudathreadsynchronize returned error code%d after launching a
        Ddkernel!\n ", cudastatus);
    Goto Error;
    }//Copy output vector from the GPU buffer to host memory.
    Cudastatus = cudamemcpy (c, dev_c, size * sizeof (int), cudamemcpydevicetohost);
        if (cudastatus! = cudasuccess) {fprintf (stderr, "cudamemcpy failed!");
    Goto Error; } ErRor:cudafree (Dev_c);
    Cudafree (dev_a);    
    Cudafree (Dev_b);
return cudastatus;
 }

ADDKERNEL_BLK is a vector addition operation implemented by block parallelism, while ADDKERNEL_THD is a vector addition operation implemented by thread parallelism. Run separately, the resulting results are as shown in the following figure:

Thread parallelism:

Block parallelism:

Visible performance is nearly 16 times times the difference. Therefore, when choosing the parallel processing method, if the problem size is not very large, then the use of thread parallelism is more appropriate, and the large problem is divided into multiple thread block processing, each block of the number of threads is not too small, like this article only 1 threads, which is a huge waste of hardware resources. An ideal solution is to divide n thread blocks, each with 512 threads, to decompose the problem, often more efficiently than a single thread parallel processing or a single piece of parallel processing. This is also the essence of CUDA programming.

The above analysis of the performance of the method is rough, only know the approximate length of the run time, the device program parts of the code execution time without a deep understanding, so we have a problem, if the code is optimized, then optimize which part of it. Whether to adjust the number of threads or use shared memory instead. The best solution to this problem is to use visual Profiler. The following excerpt from "Cuda_profiler_users_guide"

"Visual Profiler is a graphical profiling tool that shows the CPU and GPU activity in your application and leverages the analytics engine to help you find opportunities for optimization." ”

In fact, in addition to the visual interface, NVIDIA provides command-line profiling commands: Nvprof. For beginners, using a graphical approach is easier to use, so this section uses visual Profiler.

Open Visual Profiler, which can be found from the CUDA Toolkit installation menu. The main interface is as follows:

We click File->new Session and the New dialog box appears, as shown in the following image:

Where the file column fill in the application we need to parse the EXE file, can not fill the back (if you need command line parameters, you can fill in the third line), directly next, see the following image:

The first behavior of the application execution time-out setting, can not be filled, the next three boxes are checked, so that we can be analyzed, so that the function of the analysis of the parallel kernel, and then run the analyzer.

Click Finish to start running our application and profiling and analyzing performance.

In the figure above, the CPU and GPU sections show hardware and execution content information, one point highlighting the portion of the time bar for easy observation, while the details on the right show the run time information. As seen from the time bar, Cudamalloc occupies a large part of the time. The following analyzers give some key points for performance improvement, including: Low computational utilization (1.8% of the total time of calculation), and no wonder that the complexity of the addition calculation is inherently low. Low memory copy/compute overlap rate (no overlap at all, completely copy-calculate-copy); Low storage copy size (the amount of input data is too small, the equivalent of your Taobao bought a diary, freight prices are higher than the real price.) ); Low storage copy throughput (1.55GB/S only). These are very helpful for us to further optimize the procedure.

Let's click Details, just next to the analysis window. The results are as follows:

This window allows you to see the execution time of each kernel function, as well as the parameters such as line Cheng, thread block size, number of registers occupied, static shared memory, dynamic shared memory size, and the execution of memory copy functions. This provides a more accurate way of measuring time than the previous cudaevent function, directly seeing the execution time of each step, accurate to NS.

In the details, there is a console, click to see.

This is actually the command line window, which shows the run output. When you see the profiler information added, the total execution time is longer (the original thread-parallel version of the program runs only about 4ms). This is also the "uncertainty theorem" decided, if we want to measure more subtle time, then the total time is certainly not allowed, if we want to measure the total time, then the small time is ignored.

The back settings is the parameter configuration we set up the session, no longer detailed.

In this section, we should be able to have some ideas on cuda performance improvements, so we'll discuss how to optimize the Cuda program in the next section.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More