Quick implementation of arbitrary convolution (conv2 function in matlab) in image processing.
Convolution is actually the most basic operation in image processing. Some of our common algorithms include mean blur, Gaussian blur, sharpening, Sobel, Laplace, prewitt edge detection, and so on, can be achieved through convolution algorithms. However, due to the particularity of the convolutional matrix of these algorithms, the convolutional matrix is generally not implemented directly, but the computation is reduced through some optimization means. However, in some cases, the element values of the convolution matrix are irregular or have special requirements and cannot be optimized by conventional means. This can only be achieved through the original method. Therefore, it is necessary to study how to quickly implement any convolution matrix operations of images.
Currently, Reshuf? Ing: A Fast Algorithm for Filtering with Arbitrary Kernels. The article says it can increase the speed of the original program by about 40%, but it is still unclear how the original program is written.
Several Functions in matlab are related to image convolution. For example, imfilter can implement convolution, or conv2 can also perform convolution. Their speed is quite fast, for example, 3000*3000 grayscale images, the convolution matrix size is 15*15, and the running time on the I5 CPU is about MS, which is quite powerful.
In Celery's blog, we also mentioned that conv2 and matlab after optimization are much faster than matlab. For details, see http://blog.csdn.net/celerychen2009/article/details/38852105.
Since the IPL library is used in matlab code for acceleration, the Conv2 function I have written cannot be equivalent to it yet, and the speed of any core is about half of that of matlab.
Simply record the optimizations I used in convolution.
The implementation of the original convolution requires four duplicates. The simple expression is as follows:
for (Y = 0; Y < Height; Y++){ for (X = 0; X < Width; X++) { Index = .....; Sum = 0; for (XX = 0; XX < ConvW; XX++) { for (YY = 0; YY < ConvH; YY++) { Index1 = ..... ; Index2 = ..... ; Sum += Conv[Index1] * Pixel[Index2]; } } Dest[Index] = Sum / Weight; }}
When the convolution matrix is large, the calculation workload will be large, and the memory access in the program is very frequent, the cache miss phenomenon is serious, so the efficiency is extremely low.
My optimization methods mainly include the following:
I. SSE is used for multiplication calculation. Because SSE can calculate four single-precision floating point numbers at a time, it can significantly increase the speed.
2. Through appropriate processing methods, the elements in the convolution matrix around each sampling point are concentrated, so that each moving pixel does not need to perform a lot of search work from the memory.
The implementation process is as follows:
1. To take advantage of SSE, first adjust the convolution matrix to adjust the number of elements in a line of the convolution matrix so that it is an integer multiple not less than 4 of the original value, in addition, the memory layout of the new convolution matrix should meet the 16-byte alignment requirements of SSE-related functions.
The implementation code is as follows:
Float * Conv16 = (float *) _ mm_malloc (PadConvLine * convl * sizeof (float), 16); // Save the 16-byte aligned convolution matrix, to facilitate the use of SSE for (Y = 0; Y <convl; Y ++) {memcpy (Conv16 + Y * PadConvLine, Conv-> Data. F + Y * ConvW, ConvW * sizeof (float); // copy the data memset of the convolution matrix (Conv16 + Y * PadConvLine + ConvW, 0, (PadConvLine-ConvW) * sizeof (float); // set the convolution data of the redundant part to 0}
Here, the prototype of PadConvLine = Pad4 (ConvW) and Pad4 is: # define Pad4 (bits) + 3)/4*4 );
Note that the value in the memory allocated by the _ mm_malloc function is a random value, and the extended part must be filled with 0. Otherwise, the convolution result will be damaged.
So if we also get part of the data that needs to be convolution (the convolution kernel must be the same size as the convolution matrix and should also be 16 bytes aligned ), the following SSE code can be used for multiplication calculation:
Float MultiplySSE (float * Kernel, float * Conv, int Length) {int Block; const float * Data; // pointer used to merge multiple values in the SSE variable. float Sum = 0; if (Length> 16) // you can perform four SSE calculations. The test shows that this is still faster {const int BlockWidth = 4*4; // block width. the SSE register can process four float at a time, and then expand the loop four times. block = Length/BlockWidth; // number of blocks. float * KernelP = Kernel, * ConvP = Conv; // the pointer used by SSE for batch processing. _ m128 Sum0 = _ mm_setzero_ps (); // sum variable. SSE assigns the initial value 0 _ m128 Sum1 = _ mm_setzero_ps (); _ m128 Sum2 = _ mm_setzero_ps (); _ m128 Sum3 = _ mm_setzero_ps (); for (int I = 0; I <Block; I ++) {Sum0 = _ mm_add_ps (Sum0, _ mm_mul_ps (_ mm_load_ps (KernelP), _ mm_load_ps (ConvP ))); // SSE single precision Floating Point tightening addition Sum1 = _ mm_add_ps (Sum1, _ mm_mul_ps (_ mm_load_ps (KernelP + 4), _ mm_load_ps (ConvP + 4 ))); sum2 = _ mm_add_ps (Sum2, _ mm_mul_ps (_ mm_load_ps (KernelP + 8), _ mm_load_ps (ConvP + 8) ); Sum3 = _ mm_add_ps (Sum3, _ mm_mul_ps (_ mm_load_ps (KernelP + 12), _ mm_load_ps (ConvP + 12); KernelP + = BlockWidth; convP + = BlockWidth;} Sum0 = _ mm_add_ps (Sum0, Sum1); // Merge (0 ~ 1). Sum2 = _ mm_add_ps (Sum2, Sum3); // Merge (2 ~ 3). Sum0 = _ mm_add_ps (Sum0, Sum2); // Merge (0 ~ 2 ). data = (const float *) & Sum0; Sum = Data [0] + Data [1] + Data [2] + Data [3]; Length = Length-Block * BlockWidth; // remaining quantity .} if (Length! = 0) {const int BlockWidth = 4; // The program has ensured that the number of blocks must be a multiple of 4: Block = Length/BlockWidth; float * KernelP = Kernel, * ConvP = Conv; _ m128 Sum0 = _ mm_setzero_ps (); for (int I = 0; I <Block; I ++) {Sum0 = _ mm_add_ps (Sum0, _ mm_mul_ps (_ mm_load_ps (KernelP), _ mm_load_ps (ConvP); KernelP + = BlockWidth; ConvP + = BlockWidth;} Data = (const float *) & Sum0; sum + = Data [0] + Data [1] + Data [2] + Data [3];} return Sum ;}
When the number of elements in the convolutional matrix (after expansion) is greater than 16, we adopt 4 parallel SSE multiplication. When I test on the I3 CPU, there is no big difference between a 2-way SSE and a 4-way SSE, but there is still a significant improvement in the I5 CPU, so a 4-way SSE is used to run simultaneously. Of course, one-way SSE is definitely slower than two-way SSE. In addition, if the number of elements is less than 16 or greater than 16, but cannot be divisible by 16, the remaining part must be a multiple of 4 due to the previous expansion, therefore, you can use a single-path SSE implementation. This is also a coding technique.
2. As mentioned above, how can we quickly obtain the data that needs to be convolutioned. Observe the original 4-fold loop, and the internal 2-fold is the part for obtaining the convolution, but there are actually many problems here. First, because the coordinates of some sampling points must be out of the valid range of the original image during convolution sampling, it must be determined and time-consuming. Second, in order to use SSE, the sampled data must be stored in the same memory size as the expanded convolution matrix. Here I will first post my code to explain the specific implementation:
IS_RET _ stdcall Conv2 (TImage * Src, TMatrix * Conv, TImage * Dest, EdgeMode Edge) {if (Src = NULL | Dest = NULL | Conv = NULL) return IS_RET_ERR_PARA; if (Src-> Width! = Dest-> Width | Src-> Height! = Dest-> Height | Src-> BitCount! = Dest-> BitCount | Src-> Stride! = Dest-> Stride) return IS_RET_ERR_PARA; if (Src-> Scan0 = NULL | Dest-> Scan0 = NULL | Conv-> Data. F = NULL) return IS_RET_ERR_MEM; if (Conv-> Width <1 | Conv-> Height <1) return IS_RET_ERR_PARA; int Width = Src-> Width, height = Src-> Height, Stride = Src-> Stride; int ConvW = Conv-> Width, convl = Conv-> Height; unsigned char * PtSrc = Src-> Scan0, * PtDest = Dest-> Scan0; if (Src-> BitCount = 24 ){} Else {int Left = ConvW/2, Top = convl/2, Right = ConvW-Left-1, Bottom = convl-Top-1, ExpHeight = Height + convl-1; // note that the element in the core center does not need to be extended. For example, if the core width is 3, you only need to expand one pixel between the left and right sides, and int PadConvLine = Pad4 (ConvW ), length = PadConvLine * convl; int X, Y, IndexD, IndexE, IndexK, ExpStride; float * CurKer, Inv, Sum = 0; unsigned char * PtExp, * PtDest; TImage * Expand; IS_RET = GetPadImage (Src, & Expand, Left, Right, Top, Bottom, Edge); // get the extended data to speed up and facilitate programming. However, if (Ret! = IS_RET_ OK) return Ret; PtExp = Expand-> Scan0; PtDest = Dest-> Scan0; ExpStride = Expand-> Stride; for (X = 0; X <convl * ConvW; X ++) Sum + = Conv-> Data. F [X]; Inv = (Sum = 0? 1: 1/Sum); // if the Sum of convolution evidence is 0, set it to 1 float * Conv16 = (float *) _ mm_malloc (PadConvLine * conh_* sizeof (float ), 16); // Save the 16-byte aligned convolution matrix to facilitate the use of SSE float * Kernel = (float *) _ mm_malloc (PadConvLine * ExpHeight * sizeof (float), 16 ); // Save the 16-byte aligned convolution kernel matrix to facilitate the use of SSE for (Y = 0; Y <convl; Y ++) {memcpy (Conv16 + Y * PadConvLine, conv-> Data. F + Y * ConvW, ConvW * sizeof (float); // copy the data memset of the convolution matrix (Conv16 + Y * PadConv Line + ConvW, 0, (PadConvLine-ConvW) * sizeof (float); // set the convolution data of the redundant part to 0} for (Y = 0; Y <ExpHeight; Y ++) {IndexE = Y * ExpStride; CurKer = Kernel + Y * PadConvLine; // calculate the convolution Kernel data to be sampled for (X = 0; X <ConvW; X ++) {CurKer [X] = PtExp [IndexE ++] ;}}for (X = 0; X <Width; X ++) {if (X! = 0) // if not the first column, update the data of the convolution Kernel {memcpy (Kernel, Kernel + 1, (PadConvLine * ExpHeight-1) * sizeof (float )); // move a data IndexK = ConvW-1; IndexE = IndexK + X; for (Y = 0; Y <ExpHeight; Y ++) {Kernel [IndexK] = PtExp [IndexE]; // just refresh the next element IndexK + = PadConvLine; IndexE + = ExpStride ;}} CurKer = Kernel; IndexD = X; for (Y = 0; Y <Height; Y ++) // update in the column direction {PtDest [IndexD] = Clamp (int) (MultiplySSE (Conv16, CurKer, length) * Inv + 0.5); // The function is not accelerated when it is directly put here. Note that the function will not be replaced by the inline CurKer + = PadConvLine; indexD + = Stride; }}_ mm_free (Conv16); _ mm_free (Kernel); FreeImage (Expand); return IS_RET_ OK ;}}
For the first problem, the solution is very simple, that is, to use space for time, create a (Width + ConvW-1, Height + convl-1) size image, then, the surrounding ConvW and convl pixels are filled with the edge value or the edge image value, and the center is copied from the original image. After this operation, the original image is no longer sampled during sampling, in this extended graph, sampling avoids the jump time of if Statements such as coordinate judgment, and the GetPadImage function is implemented.
The second problem requires some implementation skills. We allocate a piece of PadConvLine * (Height + convl-1) memory, calculate the data that requires convolution in the first column of pixels in the source image. This part of code is shown in line 44-52. With such data, if you need to calculate the convolution result of the first column, it is very easy. Each time you skip a column, add PadConvLine elements to the starting point of the convolution data, obtain the convolution result by calling the MultiplySSE function. Then, the convolution value of the second column of pixels is calculated. At this time, the data to be convolution needs to be updated when the pixels in this column are connected in tandem. The update is also very simple, it is to move the original data to the left by a pixel. This can be implemented quickly using memcpy, and then it will be OK when the new element is filled, and then the MultiplySSE function will be called again, repeat this.
After encoding testing, for a gray image of 3000*3000, the average result of a 15*15 core on the I5 CPU is 360 ms, half slower than that of matlab.
Finally, many people say that FFT can be used to quickly implement convolution, And it is O (1). I agree with the last half sentence, but the first half is absolutely problematic, at least when the kernel is smaller than 50*50, the convolution implemented by FFT will not be compared to the direct implementation block. You need to know that the FFT calculation is actually very large.
* *************************** Basically, I do not provide source code, however, I will try to use text to clearly describe the corresponding algorithm or provide the reference documentation ************************
* ************************************ Because it depends on your own the effort and the effect written in practice are exactly what they are, people must rely on their own *******************
* ************************** Author: laviewpbt time: 2014.11.27 contact QQ: 33184777 reprinted. Please keep the information of this bank **********************