CPU and GPU implementations Julia
The main objective is to learn how to write Cuda programs by contrast. Julia's algorithm is still a certain difficulty, but not the focus. Since the GPU is also an image recognition program, the default is to combine with OpenCV. First, CPU implementation (JULIA_CPU.CPP) Julia_cpu using the CPU to implement the Julia transform
#include"StdAfx.h"
#include<iostream>
#include"OPENCV2/CORE/CORE.HPP"
#include"OPENCV2/HIGHGUI/HIGHGUI.HPP"
#include"OPENCV2/IMGPROC/IMGPROC.HPP"
using namespaceStd
usingnamespaceCv
#DefineDIM 512
structCucomplex
{
floatR
floatI
Cucomplex (floatAfloatb): R (a), I (b) {}
floatMagnitude2 (void){returnR*r+i*i;}
Cucomplexoperator*(Constcucomplex& a)
{
returnCucomplex (R*A.R-I*A.I,I*A.R+R*A.I);
}
Cucomplexoperator+(Constcucomplex& a)
{
returnCucomplex (R+A.R,I+A.I);
}
};
intJuliaintXintY
{
Constfloatscale = 1.5;
floatJX = scale* (float) (DIM/2-X)/(DIM/2);
floatJY = scale* (float) (DIM/2-y)/(DIM/2);
Cucomplex c ( -0.8,0.156);
Cucomplex A (JX,JY);
for(inti=0;i<200;i++)
{
A=a*a +c;
if(A.magnitude2 () >1000)
{
return0;
}
}
return1;
}
int_tmain (intARGC, _tchar* argv[])
{
Mat src = Mat (DIM,DIM,CV_8UC3);//Create Canvas
for(intx=0;x<src.rows;x++)
{
for(inty=0;y<src.cols;y++)
{
for(intc=0;c<3;c++)
{
Src.at<vec3b> (x, y) [C]=julia (x, y) *255;
}
}
}
Imshow ("src", SRC);
Waitkey ();
return0;
}The implementation here is mainly to illustrate the Julia algorithm, which itself is a recursive, and has a certain computational complexity of the algorithm. Second, GPU implementation in order to have a deep understanding of the technology here, I did a series of experiments. It is important to note that GPU compilation is very slow and that there is no way to speed it up. In addition, the more troublesome is the reading of the matrix read, because the current lack of information, so many things still unclear. 1) CUDA and OpenCV; (TEST1.CU)Cuda is mainly to do mathematical operations, it itself and OPENCV no inevitable connection. In general, the calculation itself is in Cuda, while the OPENCV writes the relevant transformations to show the results. The function here is to read a monochrome image and invert all pixels. Writing the code, or based on the existing template, to adjust the parameters, so that the fastest, based on the existing data constantly adjusted, it can also control errors. Note that in the Cuda kernel, you cannot use any of the OPENCV functions. At the moment I can only achieve this effect, because the majority of groups how to introduce, it is necessary to find more information. The main is the operation of the array, now only to do the singular group, once the multidimensional overflow.
1) Cuda and OpenCV linked together; (TEST1.CU)
#include"StdAfx.h"
#include<iostream>
#include"OPENCV2/CORE/CORE.HPP"
#include"OPENCV2/HIGHGUI/HIGHGUI.HPP"
#include"OPENCV2/IMGPROC/IMGPROC.HPP"
#include<stdio.h>
#include<assert.h>
#include<cuda_runtime.h>
#include#includeusingnamespaceStd
usingnamespaceCv
#DefineN 250
Test1 's kernel
__global__voidTest1kernel (int*T)
{
intx = blockidx.x;
inty = blockidx.y;
intoffset = x+y*griddim.x;
T[offset] =255-t[offset];
}
intMainvoid)
{
Step0. Data and Memory initialization
Mat src = imread ("Opencv-logo.png", 0);
Resize (src,src,size (n,n));
int*dev_t;
intT[n*n];
Mat DST = Mat (N,N,CV_8UC3);
for(inti=0;i<n*n;i++)
{
T[i] = (int) src.at<Char> (i/n,i%n);
}
Checkcudaerrors (Cudamalloc (void* *) &dev_t,sizeof(int) (*n*n));
Step1. Importing data from the CPU to the GPU
Checkcudaerrors (cudamemcpy (dev_t, T,sizeof(int) (*n*n, Cudamemcpyhosttodevice));
STEP2.GPU operations
Dim3 grid (N,N);
Test1kernel<<<grid,1>>> (dev_t);
Step3. Transferring data from the GPU to the CPU
Checkcudaerrors (cudamemcpy (t, dev_t,sizeof(int) (*n*n, cudamemcpydevicetohost));
Step4. Displaying results
for(inti=0;i<n;i++)
{
for(intj=0;j<n;j++)
{
intoffset = i*n+j;
for(intc=0;c<3;c++)
{
Dst.at<vec3b> (I,J) [C] =t[offset];
}
}
}
STEP5, freeing up resources
Checkcudaerrors (Cudafree (dev_t));
Imshow ("DST", DST);
Waitkey ();
return0;
}2) Cuda calculates Fibonacci numbers, thinking about the implementation of CNN; is cuda suitable forFibonacci, like Julia, each point is independent, it is suitable; if you can separate some blocks, it should be appropriate. Therefore, a singleFibonacci Operations do not work, but it is valuable to do an array, and to operate on the idea of parallelism. The results do not support recursion, so you should pay attention to this when computing the design later. Parallel design is never a simple problem, there must be a very steep learning curve, need to be rich experience, there are very big market. However, CNN really is a typical implementation, it does not need a serial operation, but after a large number of parallel results, choose a best parameter, so CNN can be used as the image domain and cuda combination of a typical implementation. 3) Cuda realizes Julia. On the basis of the previous, very smooth
3) Julia
#include"StdAfx.h"
#include<iostream>
#include"OPENCV2/CORE/CORE.HPP"
#include"OPENCV2/HIGHGUI/HIGHGUI.HPP"
#include"OPENCV2/IMGPROC/IMGPROC.HPP"
#include<stdio.h>
#include<assert.h>
#include<cuda_runtime.h>
#include#includeusingnamespaceStd
usingnamespaceCv
#DefineN 250
structCucomplex
{
floatR
floatI
__device__ Cucomplex (floatAfloatb): R (a), I (b) {}
__device__floatMagnitude2 (void)
{
returnR*r+i*i;
}
__device__ Cucomplexoperator*(Constcucomplex& a)
{
returnCucomplex (R*A.R-I*A.I,I*A.R + r*a.i);
}
__device__ Cucomplexoperator+(Constcucomplex& a)
{
returnCucomplex (R+A.R,I+A.I);
}
};
__device__intJuliaintXintY
{
Constfloatscale = 1.5;
floatJX = scale* (float) (N/2-X)/(N/2);
floatJY = scale* (float) (N/2-y)/(N/2);
Cucomplex c ( -0.8,0.156);
Cucomplex A (JX,JY);
for(inti=0;i<200;i++)
{
A=a*a +c;
if(A.magnitude2 () >1000)
{
return0;
}
}
return1;
}
__device__intFBLX (intOffset
{
if(Offset ==0 | | offset==1)
{
returnOffset
}
Else
{
return(FBLX (offset-1) +FBLX (offset-2));
}
}
Test3 's kernel
__global__voidJuliakernel (int*T)
{
intx = blockidx.x;
inty = blockidx.y;
intoffset = x+y*griddim.x;
intJuliavalue = Julia (x, y);
T[offset] =juliavalue*255;
}
intMainvoid)
{
Step0. Data and Memory initialization
int*dev_t;
intT[n*n];
Mat DST = Mat (N,N,CV_8UC3);
for(inti=0;i<n*n;i++)
{
T[i] = 0;
}
Checkcudaerrors (Cudamalloc (void* *) &dev_t,sizeof(int) (*n*n));
Step1. Importing data from the CPU to the GPU
Checkcudaerrors (cudamemcpy (dev_t, T,sizeof(int) (*n*n, Cudamemcpyhosttodevice));
STEP2.GPU operations
Dim3 grid (N,N);
Juliakernel<<<grid,1>>> (dev_t);
Step3. Transferring data from the GPU to the CPU
Checkcudaerrors (cudamemcpy (t, dev_t,sizeof(int) (*n*n, cudamemcpydevicetohost));
Step4. Displaying results
for(inti=0;i<n;i++)
{
for(intj=0;j<n;j++)
{
intoffset = i*n+j;
printf ("%d is%d", Offset,t[offset]);
for(intc=0;c<3;c++)
{
Dst.at<vec3b> (I,J) [C] =t[offset];
}
}
}
STEP5, freeing up resources
Checkcudaerrors (Cudafree (dev_t));
Imshow ("DST", DST);
Waitkey ();
return0;
}
Three, summary Cuda programming is a new field, although the document is said to be uncomplicated, not complex, but want to large-scale application can not be complex. So let's start with the existing examples and run things that can run up. Then think about merging and forming your own stuff, which is productivity. I believe that without a lot of time, I would be able to use Cuda's computational functions to touch and resolve something that I could not have done before. I wish you success and wish to review.
From for notes (Wiz)
CPU and GPU implementations Julia