Improvement of image integration graph algorithm based on OpenCL

Last Update:2016-06-12 Source: Internet

Author: User

Tags vload

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Complex algorithms are not necessarily inefficient, and simple algorithms often pay a price, which can be costly. Programming in the OPENCL environment has some differences with our traditional programming ideas on the CPU, which seem trivial, but often the details determine success, and these seemingly insignificant differences are infinitely magnified on multicore GPUs, resulting in a huge difference in the performance of the same algorithm on the GPU and CPU.
I've written an article on "OpenCL-based image integration graph algorithm Implementation" introduces the basic principle of integral graph algorithm in OpenCL (friends who do not know the concept of integration graph can refer to this article first), The kernel implementation code is provided based on this basic principle. But after two months of practice testing, the original prefix-based and computational-plus-matrix-transpose algorithm proved to be very inefficient on the GPU.
Why is it? Fundamentally, the previous algorithm does not conform to the principle of divide and conquer required by parallel computing, each kernel a whole row of data at a time, it is quite simple, real execution, and unpleasant.
is the original algorithm in the CODEXL GPU performance counters recorded results. The total execution time of a single integral graph calculation is around 1.6ms

Note: In order to improve efficiency the kernel code here is based on an improved algorithm on the previous article, combining the pre-and compute-and matrix-transpose into a kernel called prefix_sum_col_and_transpose, without the algorithm being improved a few times more slowly.

So I refer to the Openclipp integration diagram algorithm ideas, rewrite their own code, the new algorithm thinking is this:
The entire algorithm is divided into 5 steps (kernel) to complete.
The first step (integral_block) divides the entire image into a 4x4 small block, which calculates the local integral graph separately.

The second step (Intergral_scan_v), the longitudinal scan calculates the prefix and matrix vert of the last set of data for each 4x4 block in the previous step.

The third step (INTERGRAL_COMBINE_V), combined with the results of the previous two steps, connects the vertically unrelated 4x4 blocks vertically.

Fourth step (Intergral_scan_h), the transverse scan calculates the prefix and Matrix Horiz of the last set of data for each 4x4 block in the previous step.

The fifth step (Intergral_combine_h), combined with the results of the previous two steps, connects the 4x4 blocks in the transverse direction, forming a complete integral graph.

Compared with the previous algorithm, this algorithm has no time-consuming matrix transpose process, but it is divided into 5 steps, more complex, and the actual execution effect? To my surprise: 5 kernel total time is about 0.63ms, compared to the original algorithm improved nearly 3 times times.

The following is the complete kernel code

 /////////////////////////////////////////////////////////////////////////////////! @file: integral_gpu.cl//! @date: 2016/05/08//! @author: Guyadong//! @brief: Calculates the integral sum scan of an image////////////////////////////////////////////////////////////////////////////////#include "common_types.h"#ifndef cl_device_local_mem_size#error not defined cl_device_local_mem_size by complier with options-d#endif#ifndef Src_type#error not defined Src_type by complier with options-d#endif#ifndef Dst_type#error not defined Dst_type by complier with options-d#endif#ifndef Integ_type#error not defined Integ_type by complier with options-d#endif#define V_TYPE 4#define SHIFT_NUM 2#define LOCAL_BUFFER_SIZE (cl_device_local_mem_size/sizeof (dst_type))#define _KERNEL_NAME (s,d,t) prefix_sum_col_and_transpose_# #s # #_ # #d # #_ # #t#define KERNEL_NAME (s,d,t) _kernel_name (s,d,t)#define _kernel_name_integral_block (s,d,t) integral_block_# #s # #_ # #d # #_ # #t#define Kernel_name_integral_block (s,d,t) _kernel_name_integral_block (s,d,t)#define _KERNEL_NAME_SCAN_V (s) integral_scan_v_# #s#define KERNEL_NAME_SCAN_V (s) _kernel_name_scan_v (s)#define _KERNEL_NAME_COMBINE_V (s) integral_combine_v_# #s#define KERNEL_NAME_COMBINE_V (s) _kernel_name_combine_v (s)#define _KERNEL_NAME_SCAN_H (s) integral_scan_h_# #s#define KERNEL_NAME_SCAN_H (s) _kernel_name_scan_h (s)#define _KERNEL_NAME_COMBINE_H (s) integral_combine_h_# #s#define KERNEL_NAME_COMBINE_H (s) _kernel_name_combine_h (s)#define _KERNEL_NAME_SCAN_V Kernel_name_scan_v (dst_type)#define _KERNEL_NAME_SCAN_H Kernel_name_scan_h (dst_type)#define _KERNEL_NAME_COMBINE_V kernel_name_combine_v (dst_type)#define _KERNEL_NAME_COMBINE_H kernel_name_combine_h (dst_type)#define VECTOR_SRC VECTOR (src_type,v_type)#define VECTOR_DST VECTOR (dst_type,v_type)#define Vload fun_name (vload,v_type)#if Integ_type = = Integ_square#define COMPUTE_SRC (SRC) src*src#define _KERNEL_NAME_ kernel_name (src_type,dst_type,integ_square)#define _kernel_name_integral_block Kernel_name_integral_block (src_type,dst_type,integ_square)#elif Integ_type = = Integ_count#define COMPUTE_SRC (SRC) (dst_type) 0!=src? ( Dst_type) (1):(Dst_type) (0))#define _KERNEL_NAME_ kernel_name (src_type,dst_type,integ_count)#define _kernel_name_integral_block Kernel_name_integral_block (src_type,dst_type,integ_count)#elif Integ_type = = Integ_default#define COMPUTE_SRC (SRC) src#define _KERNEL_NAME_ kernel_name (src_type,dst_type,integ_default)#define _kernel_name_integral_block Kernel_name_integral_block (src_type,dst_type,integ_default)#else#error unknow Integ_type by complier with options-d#endif/////////////////////////////////////////////////////////////////////////////////! @brief: Calculates the integral of an image////////////////////////////////////////////////////////////////////////////////#define __SWAP (A, b) swap=a,a=b,b=swap;//4x4 matrix transposeinline voidTranspose (VECTOR_DST M[v_type]) {Dst_type swap; __swap (m[0].s1,m[1].S0); __swap (m[0].s2,m[2].S0); __swap (m[0].s3,m[3].S0); __swap (m[1].s2,m[2].S1); __swap (m[1].s3,m[3].S1); __swap (m[2].s3,m[3].S2);}//Calculate a 4x4 local integration graph__kernelvoid_kernel_name_integral_block (__global src_type *sourceimage, __global vector_dst * dest, __constant integ_param* param) {intPOS_X=GET_GLOBAL_ID (0) *v_type,pos_y=get_global_id (1) *v_type;if(pos_x>=param->width| | Pos_y>=param->height)return;intCount_x=min (v_type,param->width-pos_x);intCount_y=min (v_type,param->height-pos_y);    VECTOR_DST sum; VECTOR_DST Matrix[v_type];//Load the data from the original matrix and switch to the data vector type (VECTOR_DST) of the target matrix,    //For example, the original matrix is Uchar, the target matrix is floatmatrix[0]=0<count_y?        Count_x==v_type? Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+0) (*param->src_width_step+pos_x)):(count_x==1? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+0) (*param->src_width_step+param->width-v_type)). W,0,0,0):(count_x==2? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+0) (*param->src_width_step+param->width-v_type)). ZW,0,0):(Vector_dst) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+0) (*param->src_width_step+param->width-v_type)). YZW,0)            )                                   ):0; matrix[1]=1<count_y?        Count_x==v_type? Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+1) (*param->src_width_step+pos_x)):(count_x==1? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+1) (*param->src_width_step+param->width-v_type)). W,0,0,0):(count_x==2? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+1) (*param->src_width_step+param->width-v_type)). ZW,0,0):(Vector_dst) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+1) (*param->src_width_step+param->width-v_type)). YZW,0)            )                                   ):0; matrix[2]=2<count_y?        Count_x==v_type? Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+2) (*param->src_width_step+pos_x)):(count_x==1? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+2) (*param->src_width_step+param->width-v_type)). W,0,0,0):(count_x==2? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+2) (*param->src_width_step+param->width-v_type)). ZW,0,0):(Vector_dst) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+2) (*param->src_width_step+param->width-v_type)). YZW,0)            )                                   ):0; matrix[3]=3<count_y?        Count_x==v_type? Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+3) (*param->src_width_step+pos_x)):(count_x==1? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+3) (*param->src_width_step+param->width-v_type)). W,0,0,0):(count_x==2? (VECTOR_DST) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+3) (*param->src_width_step+param->width-v_type)). ZW,0,0):(Vector_dst) (Vconvert (VECTOR_DST,) (Vload (0, Sourceimage+ (pos_y+3) (*param->src_width_step+param->width-v_type)). YZW,0)            )                                   ):0; sum=0;//4x4 Matrix vertical prefix and calculationSUM+=COMPUTE_SRC (matrix[0]), matrix[0]=sum; SUM+=COMPUTE_SRC (matrix[1]), matrix[1]=sum; SUM+=COMPUTE_SRC (matrix[2]), matrix[2]=sum; SUM+=COMPUTE_SRC (matrix[3]), matrix[3]=sum;//Transpose matrixTranspose (matrix); sum=0;//4x4 Matrix horizontal prefixes and calculationssum+=matrix[0],matrix[0]=sum; sum+=matrix[1],matrix[1]=sum; sum+=matrix[2],matrix[2]=sum; sum+=matrix[3],matrix[3]=sum;///second transpose matrix, return the matrix direction to normalTranspose (matrix);//Calculation results write data to the target matrix    if(0<count_y) dest[((pos_y+0) *param->dst_width_step+pos_x)/v_type]=matrix[0];if(1<count_y) dest[((pos_y+1) *param->dst_width_step+pos_x)/v_type]=matrix[1];if(2<count_y) dest[((pos_y+2) *param->dst_width_step+pos_x)/v_type]=matrix[2];if(3<count_y) dest[((pos_y+3) *param->dst_width_step+pos_x)/v_type]=matrix[3];}#undef __swap//The result of the first kernel calculation (a 4x4 tile local integration graph) as the input input matrix (dest)//Calculate the prefix of the longitudinal end data for each 4x4 block and deposit the vert__kernelvoid_kernel_name_scan_v (__global dst_type * dest, __constant integ_param* param,__global dst_type *vert,intVert_step) {intGID_Y=GET_GLOBAL_ID (0);if(Gid_y>=param->height)return; Dst_type sum=0;intdst_width_step=param->dst_width_step; for(intx=v_type-1, End_x=param->width;x<end_x;x+=v_type) {sum+=dest[gid_y*dst_width_step+x];    vert[gid_y*vert_step+ (X/v_type)]=sum; }}//The result of the first kernel calculation (a 4x4 tile local integration graph) as the input input matrix (dest)//The packet prefix computed on the second kernel and as the input input matrix (vert)//Dest each 4x4 block data plus vert corresponding to the previous set of increments, resulting in output to Dest_out__kernelvoid_kernel_name_combine_v (__global vector_dst * dest, __constant integ_param* param,__global DST_TYPE *vert,intVert_step,__global VECTOR_DST * dest_out) {intGID_X=GET_GLOBAL_ID (0), gid_y=get_global_id (1);if(gid_x*v_type>=param->width| | Gid_y>=param->height)return;intdest_index= (gid_y*param->dst_width_step)/v_type+gid_x;       VECTOR_DST m = Dest[dest_index]; M + = (VECTOR_DST) (gid_x>=1? vert[Gid_y*vert_step + gid_x-1]:0); Dest_out [Dest_index]=m;}//The result of the previous kernel calculation (4x4 tile local integration graph) as input input matrix (dest)//Calculate the prefix of the horizontal end data for each 4x4 block and deposit Horiz__kernelvoid_kernel_name_scan_h (__global vector_dst * dest, __constant integ_param* param,__global vector_dst *horiz,intHoriz_step) {intGID_X=GET_GLOBAL_ID (0);if(Gid_x*v_type>=param->width)return; VECTOR_DST sum=0;intdst_width_step=param->dst_width_step; for(inty=v_type-1, End_y=param->height;y<end_y;y+=v_type) {sum+=dest[y*dst_width_step/v_type+gid_x];    horiz[(Y/v_type) *horiz_step/v_type+gid_x]=sum; }}//The result of the third kernel calculation as the input input matrix (dest)//The packet prefix computed by the fourth kernel and as the input input matrix (vert)//Dest each 4x4 block data plus horiz corresponding to the previous set of increments, resulting in output to Dest_out//Dest_out is the final integration chart__kernelvoid_kernel_name_combine_h (__global vector_dst * dest, __constant integ_param* param,__global VECTOR_DST *horiz,intHoriz_step,__global VECTOR_DST * dest_out) {intGID_X=GET_GLOBAL_ID (0), gid_y=get_global_id (1);if(gid_x*v_type>=param->width| | Gid_y>=param->height)return; VECTOR_DST m;intdest_index= (gid_y*param->dst_width_step)/v_type+gid_x;      m = Dest[dest_index]; M + = gid_y>=v_type?horiz[((gid_y/v_type)-1) *horiz_step/v_type + gid_x]:(Dst_type)0; Dest_out[dest_index]=m; }

Improvement of image integration graph algorithm based on OpenCL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More