Learn the use of Cuda libraries by learning the examples of Nvidia Matrixmul.
Brief part of the rubbish. Just say the core code.
This example is a matrix multiplication that implements C=a*b
Use a larger blocks size for Fermi and above
int block_size =;
Original:
dim3 Dimsa (5*2*block_size, 5*2*block_size, 1);
Dim3 DIMSB (5*4*block_size, 5*2*block_size, 1);
Reduce sizes to avoid running out of memory
//dim3 Dimsa (32,32, 1);
DIM3 DIMSB (32,32,1);
Defines the size of the matrix. DIM3 is a three-dimensional structure. Xyz. Each represents the length and width of the height. is the built-in structure of the CUDA
struct __device_builtin__ dim3
{
unsigned int x, y, z;
};
The z-height of the matrix is 1. The expression is a face that can be ignored without looking.
Matrix A column 5*2*32 (320) row 5*2*32 (320)
Matrix B column 5*4*32 (640) row 5*2*32 (320)
The following says to avoid the memory run out. Can reduce some width and height.
Then call
int matrix_result = matrixmultiply (argc, argv, Block_size, Dimsa, DIMSB);
Implementation of the calculation process
After entering the matrixmultiply code as follows
unsigned int size_a = dimsa.x * DIMSA.Y;
unsigned int mem_size_a = sizeof (float) * SIZE_A;
float *h_a = (float *) malloc (mem_size_a);
unsigned int size_b = dimsb.x * DIMSB.Y;
unsigned int mem_size_b = sizeof (float) * SIZE_B;
float *h_b = (float *) malloc (mem_size_b);
Initialize host Memory
const FLOAT VALB = 0.01f;
Constantinit (H_a, size_a, 1.0f);
Constantinit (H_b, Size_b, valb);
Size_a and Size_b are the number of AB matrix elements respectively.
Mem_size_a and Mem_size_b are the size of the memory required by the matrix, where each element is a floating-point number.
H_a and H_b are the starting addresses for allocated memory.
The data for a B is then initialized. A each element is assigned a value of 1.0f. b Each element is assigned a value of 0.01f (Constantinit is the process of implementing the assignment)
Allocate Device Memory
cudeviceptr d_a, D_b, D_c;
Char *ptx, *kernel_file;
size_t Ptxsize;
Define some variables for subsequent needs
Kernel_file = Sdkfindfilepath ("Matrixmul_kernel.cu", argv[0));
Compilefiletoptx (kernel_file, 0, NULL, &ptx, &ptxsize);
Cumodule module = loadptx (ptx, argc, argv);
Find the location of the Cu file. The Cu file is the C language syntax, is the suffix is different, this is mainly realizes the algorithm. Then call
Compile the Cu file to the GPU to understand the execution code, and then pass LOADPTX to execute the load function.
is to compile the Cu file into something that the GPU can understand. Equivalent to the process of translating. And then load it in.
This requires ARGC and argv to be possible in argv to specify the use of a particular device. For example, I have a few Zhang Xianka. You might want to choose this. Otherwise, select by default configuration.
Second, about Compilefiletoptx and loadptx. is to take the underlying SDK for the underlying encapsulation. You can also invoke the underlying SDK implementation. The specific function is inside the Nvrtc_helper.h file.
Allocate Host Matrix C
Dim3 dimsc (dimsb.x, DIMSA.Y, 1);
unsigned int mem_size_c = dimsc.x * dimsc.y * sizeof (float);
float *h_c = (float *) malloc (Mem_size_c);
Define the results saved after the calculation, not to elaborate.
Checkcudaerrors (Cumemalloc (&d_a, mem_size_a));
Checkcudaerrors (Cumemalloc (&d_b, Mem_size_b));
Checkcudaerrors (Cumemalloc (&d_c, Mem_size_c));
Copy host memory to device
Checkcudaerrors (Cumemcpyhtod (d_a, H_a, mem_size_a));
Checkcudaerrors (Cumemcpyhtod (d_b, H_b, Mem_size_b));
Cumemalloc this time, the memory is allocated. Matrix A B and result C are assigned. So pay attention here. Don't have enough memory. So try to turn off the memory that you don't need.
Next, the memory data (h_a h_b) of matrix A B is copied to the video memory (d_a d_b).
Setup Execution Parameters
DIM3 Threads (block_size, block_size);
Dim3 Grid (dimsb.x/threads.x, DIMSA.Y/THREADS.Y);
Defines the execution parameter 3 d x y Z
Threads (X=block_size, Y=block_size, z=1) threads (32,32,1)
Grid (The Y/thread of the x/threads.x,y= matrix A of x= matrix B). y,z=1) grid (640/32=20, 320/32=10,1) = (20,10,1)
Cufunction kernel_addr;
if (block_size = 16)
{
Checkcudaerrors (Cumodulegetfunction (&kernel_addr, module, "matrixmulcuda_block16"));
}
Else
{
Checkcudaerrors (Cumodulegetfunction (&kernel_addr, module, "matrixmulcuda_block32"));
}
The address of the function in the CU module is obtained by cumodulegetfunction and placed in the kernel_addr.
void Matrixmulcuda_block32 (float *c, float *a, float *b, int wA, int wB)
This is the actual executive function in the Cu file, and here's a look at the function, which has an image.
void *arr[] = {(void *) &d_c, (void *) &d_a, (void *) &d_b, (void *) &dimsa.x, (void *) &dimsb.x};
It then defines the parameters that the function requires.
int niter = 300;
for (int j = 0; J < Niter; J + +)
{
Checkcudaerrors (Culaunchkernel (KERNEL_ADDR,
Grid.x, Grid.y, grid.z, * grid Dim * *
Threads.x, Threads.y, Threads.z, * block Dim * *
0,0, * shared mem, Stream * *
&ARR[0],/* arguments *
0));
Checkcudaerrors (Cuctxsynchronize ());
}
Performs a calculation operation.
Here we highlight the Culaunchkernel function
Curesult Cudaapi Culaunchkernel (cufunction F,
unsigned int griddimx,
unsigned int griddimy,
unsigned int griddimz,
unsigned int blockdimx,
unsigned int blockdimy,
unsigned int blockdimz,
unsigned int sharedmembytes,
Custream Hstream,
void **kernelparams,
void **extra);
Call the core execution F function. A block of data that handles GRIDDIMX griddimy griddimz size. Each data block contains BLOCKDIMX blockdimy BLOCKDIMZ threads
SHAREDMEMBYTES Specifies the dynamic memory that each block of data can share.
The parameters of the F function can have two forms
1 is specified by the Kernelparams parameter. If f has n parameters. So Kernelparams is a pointer to an n-size argument array. From Kernelparams[0] to
KERNELPARAMS[N-1], each parameter must point to a piece of memory that the core will copy, meaning that the core needs an address, not a value.
For example, f (int x) requires an int x. So kernelparams[0]=&x; Instead of directly specifying X. Pay special attention to this.
The number of core parameters, size and offset need not be specified, those are directly from the core of the image directly obtained, (this is not understood what meaning).
2 parameters can also be packaged through the program to a single buffer passing through the extra parameter. This needs to be done with the size of each parameter and so on.
The main purpose of extra is to allow the Culaunchkernel function to get some generic parameters. Extra specifies the names of these parameters and their corresponding values. Must end with null or cu_launch_param_end
Like what
void *config[] = {
Cu_launch_param_buffer_pointer, Argbuffer,
Cu_launch_param_buffer_size, &argbuffersize,
Cu_launch_param_end
};
Status = Culaunchkernel (f, GX, GY, GZ, BX, by, BZ, SH, s, NULL, config);
And you can only choose one of two ways. Two values are specified to result in a function execution error.
The Culaunchkernel function equates to the following order of call functions
* Calling:: Culaunchkernel () Sets persistent function state this is
* The same as function state set through the following deprecated APIs:
*:: Cufuncsetblockshape (),
*:: Cufuncsetsharedsize (),
*:: Cuparamsetsize (),
*:: Cuparamseti (),
*:: CUPARAMSETF (),
*:: Cuparamsetv ().
When called by Culaunchkernel, the block size previously set is overwritten. Parameter information. Share size, and so on.
* \param F-kernel to launch
* \param griddimx-width of grid in blocks
* \param griddimy-height of grid in blocks
* \param griddimz-depth of grid in blocks
* \param blockdimx-x dimension of each thread block
* \param blockdimy-y dimension of each thread block
* \param blockdimz-z dimension of each thread block
* \param sharedmembytes-dynamic shared-memory size per thread blocks in bytes
* \param Hstream-stream identifier
* \param Kernelparams-array of pointers to kernel parameters
* \param Extra-extra Options
Copy result from device to host
Checkcudaerrors (Cumemcpydtoh (H_c, D_c, Mem_size_c));
Copy the results from the video memory directly back to the RAM
Free (h_a);
Free (h_b);
Free (H_c);
Checkcudaerrors (Cumemfree (d_a));
Checkcudaerrors (Cumemfree (d_b));
Checkcudaerrors (Cumemfree (D_c));
And finally, the release of resources.