CUDA Learning notes One: CUDA+OPENCV image transpose, using shared memory for CUDA program optimization

Last Update:2018-08-16 Source: Internet

Author: User

Tags visual studio 2010

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

original articles, reproduced please indicate the source ... I. Background of the problem

Recently to do a learning sharing report on Cuda, I would like to make an example of using Cuda for image processing in the report, and use shared memory to avoid the global memory not merging, improve image processing performance. But for the CUDA program how to read the image a little puzzled, online found a "second Cuda program-image Stretching" blog Click on the Open link, the code involved in the image interactive part, but need to include "cutil_inline.h" Header file (which is said to be the first file used by developers to write routines), sad to urge since CUDA5.0 after "cutil.h" and "cutil_inline.h" and other headers were removed, and I installed is CUDA6.5, so can not use posting read the image method.

Later in Cuda's sample to see the Image Processing sample program, but to tell the truth, for me this just beginners, the program is a bit complicated, so gave up the idea of studying it. In addition, someone told me that Cuda has a NPP library that can support the interaction of images, but I don't know how to call this library.

So I figured out a way to read and display images with the OPENCV function, and the processing of the images was done by the Cuda kernel function, because the OPENCV and Cuda programming on the Windows platform was done on Visual Studio, so this method is feasible. second, the experimental process

1. Experimental platform: Visual Studio 2010,cuda 6.5,OPENCV 2.4.9

2. OPENCV Development Environment Configuration

To increase the use of OpenCV function library functions in VC, you need to first OpenCV development environment configuration, OpenCV installation and Environment configuration Reference Bowen "OpenCV One of the introductory tutorials" Installation OPENCV:OPENCV 3.0, OpenCV 2.4.8, OpenCV 2.4.9 "Click to open the link

3. Code

This paper uses CUDA+OPENCV environment for image transfer processing, using CPU and GPU to transfer the read image, in which the implementation of GPU is divided into global memory and shared memory two versions, the experimental results show that the shared Memory image transpose can avoid the situation of not merging, so as to improve the running speed of the program.

(1) The function definitions of CPU, GPU Global memory and GPU Shared memory for image transpose are in the header file "imageTranspose.h":

#ifndef _IMAGETRANSPOSE_CU #define _IMAGETRANSPOSE_CU #include <stdio.h> #include <stdlib.h> #include <c uda_runtime.h> #include <device_launch_parameters.h> #include <Windows.h>//For timing #include <time.h > #define W//block dimensions #define N 1024//grid size//image data on globalmemory for processing __global__ static void Gpuimagetrans Pose_global (unsigned char *imagedatasrc, unsigned char *imagedatadst, int Width, int Height) {int tid = threadidx.x + bl  Ockdim.x * blockidx.x;

	Get thread ID//out of bounds, the thread may not have the corresponding pixel if (tid >= Width * Height) return;  
	int I, J;
	i = tid/width;
	
	j = tid% Width;     
	          
	Transpose Imagedatadst[j * Height + i] = Imagedatasrc[tid];
Return ///image data is placed on sharedmemory for processing __global__ static void gpuimagetranspose_shared (unsigned char *imagedatasrc, unsigned char *  IMAGEDATADST, int Width, int Height) {__shared__ unsigned char tile[w][w]; Declares a shared Memory//that stores image data The index of the current thread processing pixel in the input matrix int x = threadidx.x + blockidx.x * W
	int y = threadidx.y + blockidx.y * W; 


	int index_in = x + y * Width;
	
	This cross-border judgment is critical, otherwise the output error if (index_in >= Width * Height) return;     
	Copies the current thread-processed pixel values from the global Memory to the shared Memory tile[threadidx.y][threadidx.x] = imagedatasrc[index_in];  __syncthreads ();
	Thread synchronization statement//Compute the index of the current thread processing pixel in the output matrix X = threadidx.x + blockidx.y * W;
	y = threadidx.y + blockidx.x * W; 
	
	int index_out = x + y * Height;


	Copies the current thread-processed pixel values from the shared Memory to the global Memory, and completes the transpose via a coordinate transformation imagedatadst[index_out] = tile[threadidx.x][threadidx.y];
Return      
	
	//CPU completes the image transpose void Cpuimagetranspose (unsigned char *imagedatasrc, unsigned char *imagedatadst, int Width, int Height) {       

	int I, J; if (imagedatasrc = null | | imagedatadst = NULL | | Width <= 0 | |    

	Height <= 0) return; Traversing image data completes the image transpose for (i=0 i 


(2) The main function is defined in the "imagetranspose.cu" file, the main function calls the image transpose function for image processing, and the following shows the image transpose using CPU and GPU Global memory:


#include <cv.h>//use OpenCV #include  


(3) When the CUDA program is optimized and the GPU Shared memory is used for image transpose, the "IMAGETRANSPOSE.CU" file is modified accordingly:


int main () {Iplimage *imgsrc = cvloadimage ("<span style=" text-align:justify; "
	>Lena.jpg</span> ", Cv_load_image_grayscale);
	int Width = imgsrc->width;
	int Height = imgsrc->height;
	The output image of the wide-high dimensional interchange iplimage *imgdst_gpu_shared = Cvcreateimage (Cvsize (Height, Width), ipl_depth_8u, 1);

	Iplimage *imgdst_cpu = Cvcreateimage (Cvsize (Height, Width), ipl_depth_8u, 1);
	unsigned char *psrcdata = (unsigned char*) (imgsrc->imagedata);
	unsigned char *pdstdata_shared = (unsigned char*) (imgdst_gpu_shared->imagedata);

	unsigned char *cdstdata = (unsigned char*) (imgdst_cpu->imagedata);
	Allocating video memory for storing the original image array and the target image array unsigned char *device_imgdatasrc = NULL;
	unsigned char *device_imgdatadst_shared = NULL;
	Cudamalloc ((void**) &device_imgdatasrc, sizeof (unsigned char) * Width * Height);

	Cudamalloc ((void**) &device_imgdatadst_shared, sizeof (unsigned char) * Height * Width); Passes the original image array to the memory cudamemcpy (DEVICE_IMGDATASRC, psrcdata, sizeof (unsigned char) * Width * Height, cudAmemcpyhosttodevice);  Sharedmemory version of the parameters set Dim3 dimgrid_shared (n/w, n/w);  The maximum number of blocks allowed per grid is 65535 dim3 dimblock_shared (W, W);
	The maximum number of threads allowed per block is cudaevent_t start, stop;
	Cudaeventcreate (&start);
	Cudaeventcreate (&stop);
	Cudaeventrecord (start, 0); Gpuimagetranspose_shared<<<dimgrid_shared, Dimblock_shared>>> (DEVICE_IMGDATASRC, Device_
	Imgdatadst_shared, Imgsrc->width, imgsrc->height);
	Cudaeventrecord (stop, 0);
	Cudaeventsynchronize (start);
	Cudaeventsynchronize (stop);
	float gputime_shared = 0;
	Cudaeventelapsedtime (&gputime_shared, start, stop);

	printf ("gpu_time_shared =%f\n", gputime_shared); Passes the result to the memory cudamemcpy (pdstdata_shared, device_imgdatadst_shared, sizeof (unsigned char) * Width * Height,

	Cudamemcpydevicetohost);
	clock_t T1 = clock ();  Cpuimagetranspose (Psrcdata, Cdstdata, Imgsrc->width, imgsrc->height);	
	CPU processing of the image clock_t t2 = clock ();
	float time_cpu = 0;
	TIME_CPU = T2-T1; printf ("Cpu_time =%f\n", TimE_CPU*1000/CLOCKS_PER_SEC);
	Time Unit Ms Cvnamedwindow ("SRC");

	Cvshowimage ("SRC", IMGSRC);
	Cvnamedwindow ("Dst_cpu");

	Cvshowimage ("Dst_cpu", IMGDST_CPU);
	Cvnamedwindow ("dst_gpu_shared");

	Cvshowimage ("dst_gpu_shared", imgdst_gpu_shared);

	Cvwaitkey ();
	Cudafree (DEVICE_IMGDATASRC);

	Cudafree (device_imgdatadst_shared);
	Cvdestroyallwindows ();
	Cvreleaseimage (&AMP;IMGSRC);
	Cvreleaseimage (&AMP;IMGDST_CPU);
	
	Cvreleaseimage (&imgdst_gpu_shared);
return 0; } 

4. Experimental results

(1) using the classic test image "lena.jpg" as input Image:




(2) CPU, GPU Global memory operation results and time comparison









(3) GPU Shared memory running results and time







5. Analysis of experimental results

After testing, the output image can be judged, the result of the image transpose is correct, time-consuming GPU Shared Memory < GPU Global Memory < CPU, but the more puzzling is two times to run the CPU is different, this does not know what the reason ...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More