Fast realization of convolution---arbitrary convolution cores for image processing

Source: Internet
Author: User

Convolution is actually the most basic operation in image processing, some algorithms such as mean Blur, Gaussian blur, sharpening, Sobel, Laplace, Prewitt edge detection and so on, can be realized by convolution algorithm. Only because of the particularity of the convolution matrix of these algorithms, it is generally not directly implemented, but by some optimization means to make the calculation small. However, in some cases the convolution matrix element values are not very regular or special requirements, can not be optimized by conventional means, this time can only be achieved in the original way. Therefore, it is necessary to do a proper research on how to implement the arbitrary convolution matrix operation of image quickly.

At present, through the sharing of friends or their own search to find a piece about arbitrary accounting optimization of the article is: reshuf?ing:a fast algorithm for Filtering with arbitrary kernels, the article said to improve the original program speed of 40% About, but the original program is how to write and still do not understand.

In Matlab, there are several functions related to the image convolution, such as imfilter can achieve convolution, or conv2 also line, their speed is quite fast, such as 3000*3000 grayscale, convolution matrix size is 15* 15, on the I5 CPU running time as long as about 170ms, quite to force.

In Celery's blog, also mentioned his optimized conv2 and MATLAB is quite even faster than MATLAB, see

As the MATLAB code in the use of the IPL library to accelerate, currently I write the CONV2 function and its equivalent, for any nuclear speed of about half of Matlab.

Simply record the optimizations I used in the convolution process.

The original convolution implementation requires a four-cycle, simple expression as follows:

for (Y = 0; Y < Height; y++) {for    (X = 0; X < Width; x + +)    {        Index = ...;        Sum = 0;        for (XX = 0; XX < CONVW; xx++)        {for            (YY = 0; YY < CONVH; yy++)            {                Index1 = ...;                Index2 = ...;                Sum + = conv[index1] * PIXEL[INDEX2];            }        }        Dest[index] = sum/weight;    }}

When the convolution matrix is large, the computational amount will be very large, and because the memory accesses are very frequent in the program, the cache Miss phenomenon is more serious, so the efficiency is very low.

My optimization methods mainly include the following aspects:

One: Using SSE for multiplication calculation, because SSE can be a one-time calculation of 4 single-precision floating point number, so there can be a significant speed improvement.

Second: By proper processing, the elements within the convolution matrix around each sampling point are concentrated, so that each pixel move does not require a large amount of search work from memory.

Specifically, the implementation process is as follows:

1, in order to use the advantages of SSE, the convolution matrix is adjusted first, adjust the number of elements of the convolution matrix row, so that it is not less than the original value of 4 integer times, and let the new convolution matrix memory layout in accordance with the SSE correlation function 16-byte alignment requirements.

The implementation code is as follows:

float *conv16 = (float *) _mm_malloc (padconvline * CONVH * sizeof (float), +);    holds a 16-byte aligned convolution matrix to facilitate the use of SSE for                                                (Y = 0; Y < CONVH; y++) {    memcpy (Conv16 + y * padconvline, conv->data.f + y * convw, CONVW * sizeof (float));    Copy the data of the convolution matrix    memset (Conv16 + Y * padconvline + convw, 0, (PADCONVLINE-CONVW) * sizeof (float));    set the convolution data for the redundant part to 0}

The prototypes for padconvline = Pad4 (CONVW) and Pad4 are: #define PAD4 (BITS) ((BITS) + 3)/4 * 4);

Note that the values in memory allocated by the _MM_MALLOC function are random values and must be populated with 0 for the extended portion, or the results of the convolution will be destroyed.

So if we also get some of the data that needs to be convolution (convolution kernel is definitely the same size as the convolution matrix, and it should also be 16-byte aligned), it can be multiplied by the following SSE code:

Float Multiplysse (float *kernel, float *conv, int Length) {int Block;                        const float *data;    The pointer to use when merging multiple values on the SSE variable.    float Sum = 0;        if (Length > 16)//Can do four SSE calculations, the test shows that this is still faster {const INT blockwidth = 4 * 4; Block width.        SSE registers can process 4 float at a time and then loop through 4 times.        Block = Length/blockwidth;            Number of blocks.                float *kernelp = Kernel, *convp = Conv;                The pointer used for SSE batch processing.         __m128 SUM0 = _mm_setzero_ps (); Summation variable.        SSE assigns initial value 0 __m128 Sum1 = _mm_setzero_ps ();        __m128 Sum2 = _mm_setzero_ps ();        __m128 Sum3 = _mm_setzero_ps (); for (int I = 0; I < Block;                    i++) {SUM0 = _mm_add_ps (SUM0, _mm_mul_ps (_mm_load_ps (Kernelp), _mm_load_ps (CONVP)));            SSE single-precision floating-point tightening addition Sum1 = _mm_add_ps (Sum1, _mm_mul_ps (_mm_load_ps (Kernelp + 4), _mm_load_ps (CONVP + 4)); Sum2 = _mm_add_ps (Sum2, _mm_mul_ps (_mm_load_ps (kernelp + 8), _mm_loAd_ps (CONVP + 8)));            SUM3 = _mm_add_ps (Sum3, _mm_mul_ps (_mm_load_ps (Kernelp +), _mm_load_ps (CONVP + 12));            Kernelp + = Blockwidth;        CONVP + = Blockwidth;    } SUM0 = _mm_add_ps (SUM0, SUM1);        22 merge (0~1).    Sum2 = _mm_add_ps (Sum2, Sum3);        22 combined (both).    SUM0 = _mm_add_ps (SUM0, Sum2);        22 merge (0~2).        Data = (const float *) &Sum0;        Sum = data[0] + data[1] + data[2] + data[3];            Length = Length-block * blockwidth;    The remaining quantity.                        } if (Length! = 0) {Const int blockwidth = 4;                The procedure has ensured that the number must be a multiple of 4 Block = Length/blockwidth;                        float *kernelp = Kernel, *convp = Conv;                __m128 SUM0 = _mm_setzero_ps (); for (int I = 0; I < Block;                    i++) {SUM0 = _mm_add_ps (SUM0, _mm_mul_ps (_mm_load_ps (Kernelp), _mm_load_ps (CONVP)));            Kernelp + = Blockwidth; CONVP + = BlockwidtH        } Data = (const float *) &Sum0;    Sum + = Data[0] + data[1] + data[2] + data[3]; } return Sum;}

When the convolution matrix (expansion) of the number of elements is greater than 16 o'clock, we use 4 parallel SSE multiplication implementation, I test on the CPU of I3, 2 SSE and 4-way SSE has no big difference, and in the I5 CPU 4 is still more obvious improvement, so the use of 4-way SSE simultaneously run. Of course, the 1-way SSE is still slower than the 2-way. In addition, if the number of elements is less than 16 or greater than 16 but not divisible by 16, then the remainder due to the previous expansion, the number of remaining elements is certainly a multiple of 4, so you can use a single-channel SSE implementation. This is also a coding technique.

2, the previous mention of the need to be convolution part of the data, this part of how to quickly get it. Observing the original 4-heavy cycle, the 2 of its internal weight is the part that gets the convolution, but there are a lot of problems here. First: Since convolution sampling must have some of the coordinates of the sampling point in the original image of the effective range, it has to be judged, time-consuming. Second: Also in order to use SSE, the sampled data must be placed in the same size as the expansion of the convolution matrix in memory. Here I first post my code to explain the specific implementation:

Is_ret __stdcall Conv2 (timage *src, Tmatrix *conv, Timage *dest, EdgeMode Edge) {if (SRC = = NULL | | Dest = = NULL | |    Conv = = NULL) return Is_ret_err_para; if (src->width! = Dest->width | | Src->height! = Dest->height | | Src->bitcount! = Dest->bitcount | |    Src->stride! = dest->stride) return Is_ret_err_para; if (src->scan0 = = NULL | | Dest->scan0 = = NULL | |    CONV-&GT;DATA.F = = NULL) return is_ret_err_mem; if (Conv->width < 1 | |        Conv->height < 1) return Is_ret_err_para;    int Width = src->width, Height = src->height, Stride = src->stride;    int convw = conv->width, CONVH = conv->height;    unsigned char *ptsrc = src->scan0, *ptdest = dest->scan0;  if (Src->bitcount = =) {} else {int left = CONVW/2, Top = convh/2, right = Convw-left-1,        Bottom = convh-top-1, expheight = Height + ConvH-1; Note that the nuclear center element does not have to be expanded, such as the kernel has a width of 3, then as long as the left and right expansion of one pixel can be int padconvline = PAD4 (ConvW), Length = Padconvline * CONVH;        int X, Y, Indexd, Indexe, Indexk, expstride;        Float *curker, Inv, Sum = 0;        unsigned char *ptexp, *ptdest;        Timage *expand;                                Is_ret RET = Getpadimage (SRC, &expand, left, right, Top, Bottom, Edge);                The expanded data can be accelerated and easy to program, but occupy a portion of memory if (ret! = IS_RET_OK) return Ret; Ptexp = expand->scan0; Ptdest = dest->scan0;                Expstride = expand->stride; for (X = 0; X < CONVH * CONVW;        X + +) Sum + = conv->data.f[x];                                                                        INV = (sum = = 0? 1:1/sum);                        If the convolution proof and 0, then set to 1 float *conv16 = (float *) _mm_malloc (padconvline * CONVH * sizeof (float), 16);  Saves a 16-byte aligned convolution matrix to facilitate the use of SSE float *kernel = (float *) _mm_malloc (padconvline * expheight * sizeof (float),                    16);                                          Save 16-byte aligned convolution core matrix for easy use of SSE      for (Y = 0; Y < CONVH;            y++) {memcpy (Conv16 + y * padconvline, conv->data.f + y * convw, CONVW * sizeof (float));                Copy the data of the convolution matrix memset (Conv16 + Y * padconvline + convw, 0, (PADCONVLINE-CONVW) * sizeof (float)); Set the convolution data for the redundant part to 0} for (Y = 0; Y < Expheight;            y++) {Indexe = Y * expstride;                        Curker = Kernel + Y * padconvline; Calculates the convolution kernel data for which all pixels in the first column will be sampled for (X = 0; X < CONVW;            X + +) {Curker[x] = ptexp[indexe++]; }} for (X = 0; X < Width;                 X + +) {if (x! = 0)///If it is not the first column, you need to update the convolution kernel's data {    memcpy (Kernel, Kernel + 1, (Padconvline * ExpHeight-1) * sizeof (float));                Move forward a data indexk = ConvW-1;                Indexe = indexk + X; for (Y = 0; Y < ExpheiGht        y++) {KERNEL[INDEXK] = Ptexp[indexe];                    Just refresh the next element indexk + = Padconvline;                Indexe + = Expstride;    }} curker = Kernel;            Indexd = X; for (Y = 0; Y < Height; Y + +)//update in the direction of the column {ptdest[indexd] = Clamp ((int) (Multiplysse (Con        V16, Curker, Length) * INV + 0.5));                Directly put the function here also does not have what speed, notice change function will not be inline curker + = Padconvline;            Indexd + = Stride;        }} _mm_free (CONV16);        _mm_free (Kernel);        Freeimage (Expand);    return IS_RET_OK; }}

For the first problem, the solution is a simple answer, that is, using space for time, create a new pair (Width + ConvW-1, Height + ConvH-1) Size of the image, and then the surrounding convw and CONVH pixels with the value of the edge or the value of the edge image to fill, is in the middle of the original figure copied over, so that after the operation of sampling is no longer the original sample, and in this well-extended map sampling, to avoid the coordinate judgment and other if the jump time of the statement, on the getpadimage that realized the function of change.

The second problem requires some implementation techniques, we allocate a padconvline * (Height + ConvH-1) size of memory, and then calculate the first column of the original image in series of pixels in tandem with the volume of the data needed, this part of the code as shown in the above 44-52 lines. With this data, if you need to calculate the convolution results of the first column, it is very simple, each skipping a column will be convolution of the data starting point to add padconvline elements, in the call above Multiplysse function to obtain the convolution results. Then calculate the convolution value of the second column of pixels, at this time need to update the whole column of pixels in series need to be convolution data, update is very simple, that is, the original data is moved to the left one pixel, this can be quickly implemented with memcpy, and then fill in the new incoming element, OK, The next is to call the Multiplysse function again, so repeat.

After coding test, for 3000*3000 grayscale graph, 15*15 's kernel test on I5 CPU average result is 360ms, half slower than Matlab.

Finally, a lot of people say that using FFT can quickly achieve convolution, and is O (1), I agree with the latter half sentence, but the front half is absolutely problematic, at least when the kernel is less than 50*50, the FFT implementation of the convolution will not be more than the direct implementation of the block. You know, the FFT calculation is actually very large.


Fast realization of convolution---arbitrary convolution cores for image processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.