One, using the GPU module provided in the OPENCV
At present, many GPU functions have been provided in OpenCV, and the GPU modules provided by OPENCV can be used to accelerate most image processing.
Basic use method, please refer to: http://www.cnblogs.com/dwdxdy/p/3244508.html
The advantage of this method is simple, using Gpumat to manage the data transfer between CPU and GPU, and does not need to pay attention to the setting of kernel function call parameter, only need to pay attention to the logic operation of the process.
The disadvantage is limited to the development and update of the OpenCV library, when the need to complete some custom operations (OpenCV did not provide the corresponding library), difficult to meet the requirements of the application, you need to implement the parallel implementation of custom actions. In addition, for some special needs, OPENCV provides parallel processing functions, its performance optimization is not optimal, in the specific application, it may need further optimization, improve performance.
Second, the use of Cuda API programming alone
With the Cuda Runtime API, Cuda Driver API to achieve some parallel acceleration of operations, the use of the process needs to manage the CPU and GPU data transmission between the parameters of kernel function calls, kernel function optimization.
The advantage is that the processing process is controlled by the user, and the user can implement more parallel acceleration processing operations.
The disadvantage is that the use of complex, more code writing, you need to be familiar with CUDA related data and API interface. The following is a simple sample program:
__global__ void Swap_rb_kernel (const uchar3* src,uchar3* dst,int width,int height) {int x = threadidx.x + blockidx.x
* BLOCKDIM.X;
int y = threadidx.x + blockidx.y * BLOCKDIM.Y;
if (x < width && y < height) {Uchar3 v = src[y * width + x];
Dst[y * width + x].x = v.z;
Dst[y * width + x].y = v.y;
Dst[y * width + x].z = v.x;
} void Swap_rb_caller (const uchar3* src,uchar3* dst,int width,int height) {dim3 block (32,8);
Dim3 Grid ((width + block.x-1)/block.x, (height + block.y-1)/block.y);
Swap_rb_kernel<<<grid,block,0>>> (Src,dst,width,height);
Cudathreadsynchronize ();
int main () {Mat image = Imread ("lena.jpg");
Imshow ("src", image);
size_t memsize = image.cols*image.rows*sizeof (UCHAR3);
uchar3* d_src = NULL;
uchar3* d_dst = NULL;
Cuda_safe_call (Cudamalloc (void**) &d_src,memsize));
Cuda_safe_call (Cudamalloc (void**) &d_dst,memsize)); Cuda_safe_call (Cudamempcy (D_src,image.data,memsize,cudamemcpyhosttodevice));
Swap_rb_caller (d_src,d_dst,image.cols,image.rows);
Cuda_safe_call (Cudamempcy (image.data,d_dst,memsize,cudamemcpydevicetohost));
Imshow ("GPU", image);
Waitkey (0);
Cuda_safe_call (Cudafree (D_SRC));
Cuda_safe_call (Cudafree (D_DST));
return 0; }
In the above code, you use Cudamalloc,cudamemcpy,cudafree to manage the allocation, transfer, and release of memory.
Note: If the image.data contains byte-aligned blank data, the above program cannot complete the normal processing operation.
Third, the use of OPENCV to provide interface, and combined with CUDA API programming
With some of the interfaces already provided by OPENCV, the basic processing of some CUDA programming is accomplished, and the complexity of programming is simplified; the kernel functions that are provided by the kernel function or extension OpenCV are simply customized to their own business needs. This can not only make full use of the characteristics of OPENCV, but also can meet the different needs of the business, easy to use, and easy to expand. The following is a simple sample program:
//swap_rb.cu #include <opencv2/core/cuda_devptrs.hpp> using namespace CV; using
namespace Cv::gpu; Custom kernel functions __global__ void Swap_rb_kernel (const ptrstepsz<uchar3> src,ptrstep<uchar3> DST) {int x = thread
idx.x + blockidx.x * blockdim.x;
int y = threadidx.y + blockidx.y * BLOCKDIM.Y;
if (x < src.cols && y < src.rows) {Uchar3 v = src (y,x);
DST (y,x) = Make_uchar3 (v.z,v.y,v.x); } void Swap_rb_caller (const ptrstepsz<uchar3>& src,ptrstep<uchar3> dst,cudastream_t stream) {dim3
Block (32,8);
Dim3 Grid ((Src.cols + block.x-1)/block.x, (src.rows + block.y-1)/block.y);
Swap_rb_kernel<<<grid,block,0,stream>>> (SRC,DST);
if (stream = = 0) cudadevicesynchronize (); }
//swap_rb.cpp #include <opencv2/gpu/gpu.hpp> #include <opencv2/gpu/stream_
Accessor.hpp> using namespace CV;
using namespace Cv::gpu;
void Swap_rb_caller (const ptrstepsz<uchar3>& src,ptrstep<uchar3> dst,cudastream_t stream); void Swap_rb (const gpumat& src,gpumat& dst,stream& Stream = Stream::null ()) {Cv_assert (src.type () = = CV_8
UC3);
Dst.create (Src.size (), Src.type ());
cudastream_t s = streamaccessor::getstream (stream);
Swap_rb_caller (src,dst,s); }
Main.cpp
#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/gpu/ Gpu.hpp>
using namespace CV;
using namespace Cv::gpu;
void Swap_rb (const gpumat& src,gpumat& dst,stream& Stream = Stream::null ());
int main ()
{
Mat image = Imread ("lena.jpg");
Imshow ("src", image);
Gpumat Gpumat,output;
Gpumat.upload (image);
SWAP_RB (gpumat,output);
Output.download (image);
Imshow ("GPU", image);
Waitkey (0);
return 0;
}
The Swap_rb.cu file defines the call function for kernel functions and kernel functions, and in the calling function, sets the call parameters of the kernel function.
The Swap_rb.cpp file defines the entry function of the parallel operation, that is, the function that the main program needs to call to complete the parallel operation, which is mainly the function of encapsulating kernel function, and the validation of input parameters and the selection of different kernel functions according to the input parameters.
Main.cpp file main program, the completion of data input, business processing and data output.
Summarize
Programming simplicity and controllability is relative, the more convenient programming, the more difficult to control. In the practical application process, we should seek the balance point of simplicity and controllability of programming, and should choose the appropriate method according to the application demand, and generally recommend adopting method three.