SSE Image algorithm Optimization series 10: Simple one complexion detection algorithm for SSE optimization.

Source: Internet
Author: User
Tags rar truncated

  In many cases the need for efficient skin tone detection code, I commonly used a C + + version of the code as follows:

voidIm_getroughskinregion (unsignedChar*SRC, unsignedChar*skin,intWidth,intHeight,intStride) {     for(intY =0; Y < Height; y++) {unsignedChar*lineps = SRC + Y * Stride;//The first address of the Y-line pixel of the source graphUnsignedChar*LINEPD = Skin + Y * Width;//the first address of the Y-line pixel of the skin area for (int X = 0; X < Width; x + +)         for(intX =0; X < Width; X + +)        {            intBlue = lineps[0], Green = lineps[1], Red = lineps[2]; if(Red >= -&& Green >= +&& Blue >= -&& Red >= Blue && (red-green) >=Ten&& Im_max (Im_max (red, green), blue)-Im_min (Im_min (red, green), blue) >=Ten) Linepd[x]=255;//all for the complexion part            ElseLinepd[x]= -; Lineps+=3;//move to next pixel        }    }}

The efficiency of this code is already very high, for the 1080P contains a general image of the face of the 4.0ms can be processed, the effect of the normal light and skin color detection is also done, as shown below.

4.0ms is really fast, but in a lot of real-time occasions, every frame can save 1MS for the overall fluency is good, the algorithm has no room to improve the speed. The conventional C-language aspect of the optimization may be the cycle of expansion, the measured speed is not much difference.

Then let's try to get the results from the SIMD instructions.

Before deciding to use SIMD, I have been hesitant because the algorithm itself is very simple, is a combination of conditional judgment, and SSE is very ill-suited to do the judgment operation, while the General C language && operations have short-circuit function, for this example, When one of them does not meet the criteria to jump out of the loop, no longer the subsequent conditions of the calculation and judgment, and my code has already put the simple judgment conditions in front, a bit more complicated to put in the back. If SSE is used to achieve the same function, due to SSE characteristics, we can only judge all the conditions, and then each condition to determine the result of the and operation, the process can not be interrupted from the middle (from the code implementation, it is possible, but that way must be slower). This overall judgment time-consuming and SSE processor-level multi-path parallelism brought about by the weight of the lighter, before the implementation of the heart is a bit unsure.

since the writing of this article, it must have implemented the algorithm of the SSE version code, we have to analyze the implementation of the method and the possible functions.

First of all, we have to extract the r/g/b component into a SSE variable, which we in the SSE Image algorithm optimization series eight: The Natural saturation (vibrance) algorithm simulation and the SSE optimization (with source code, can be used as SSE image primer, The vibrance algorithm can also be used for simple skin tone adjustment) in the article has been mentioned in the implementation.

then look at the previous three judging conditions Red >= && Green >= && Blue >= 20, we need a unsigned char type of Comparison function, and SSE only provides the SSE comparison function of singed char type, which has the answer in a few missing SSE Intrinsics article. Can be implemented with the following code:

#define _mm_cmpge_epu8 (A, B) _mm_cmpeq_epi8 (_mm_max_epu8 (A, b), a)

The fourth condition Red >= Blue can also be achieved using the above judgment.

Let's look at the fifth condition (Red-green) >= 10, if the red-green is calculated directly, you need to convert them to the ushort type in order to satisfy the possible negative numbers, but if you use the Saturation calculation function _mm_subs_epu8, when Red < Green, Red-green is truncated to 0, this time (Red-green) >= 10 will return false, and if Red > Green, then the result of Red-green will not be truncated, is the ideal effect, therefore, Solve this problem.

The last condition Im_max (red, green), blue)-Im_min (Im_min (red, green), blue) >= 10, this is also very simple, first with _mm_max_epu8 and _mm_min_ Epu8 obtains the maximum and minimum values of the B/G/R three components, which is obviously max>min, so there is a correct result that can be produced directly using the _MM_SUBS_EPU8 function without truncation.

We note that SSE's comparison function (byte type) returns only two of the 0 and 255, so the above 6 judging conditions result directly and operation can get the final combined value, the pixel result satisfies all the conditions is 255, and the other is 0.

In our C-language version of the code, the pixels that do not meet the criteria are set to 16 or other nonzero values, what about this, the same reason, 255 and the other number for or operation or 255, and 0 and other number of or operation will become the other number, So finally the above result and 16 of this constant or operation can get the correct result, the main code is as follows:

Src1 = _mm_loadu_si128 ((__m128i *) (Lineps +0)); Src2= _mm_loadu_si128 ((__m128i *) (Lineps + -)); SRC3= _mm_loadu_si128 ((__m128i *) (Lineps + +)); Blue= _mm_shuffle_epi8 (Src1, _mm_setr_epi8 (0,3,6,9, A, the, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Blue= _mm_or_si128 (Blue, _mm_shuffle_epi8 (SRC2, _mm_setr_epi8 (-1, -1, -1, -1, -1, -1,2,5,8, One, -, -1, -1, -1, -1, -1))); Blue= _mm_or_si128 (Blue, _mm_shuffle_epi8 (SRC3, _mm_setr_epi8 (-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,1,4,7,Ten, -))); Green= _mm_shuffle_epi8 (Src1, _mm_setr_epi8 (1,4,7,Ten, -, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Green= _mm_or_si128 (Green, _mm_shuffle_epi8 (SRC2, _mm_setr_epi8 (-1, -1, -1, -1, -1,0,3,6,9, A, the, -1, -1, -1, -1, -1))); Green= _mm_or_si128 (Green, _mm_shuffle_epi8 (SRC3, _mm_setr_epi8 (-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,2,5,8, One, -))); Red= _mm_shuffle_epi8 (Src1, _mm_setr_epi8 (2,5,8, One, -, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Red= _mm_or_si128 (Red, _mm_shuffle_epi8 (SRC2, _mm_setr_epi8 (-1, -1, -1, -1, -1,1,4,7,Ten, -, -1, -1, -1, -1, -1, -1))); Red= _mm_or_si128 (Red, _mm_shuffle_epi8 (SRC3, _mm_setr_epi8 (-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,0,3,6,9, A, the))); Max= _mm_max_epu8 (_mm_max_epu8 (Blue, Green), Red);//Im_max (Im_max (Red, Green), Blue)Min = _mm_min_epu8 (_mm_min_epu8 (Blue, Green), Red);//im_min (Im_min (Red, Green), Blue)Result = _mm_cmpge_epu8 (Blue, _mm_set1_epi8 ( -));//Blue >=result = _mm_and_si128 (result, _mm_cmpge_epu8 (Green, _mm_set1_epi8 ( +)));//Green >=result = _mm_and_si128 (result, _mm_cmpge_epu8 (Red, _mm_set1_epi8 ( -)));//Red >=result = _mm_and_si128 (result, _mm_cmpge_epu8 (Red, Blue));//Red >= Blueresult = _mm_and_si128 (result, _mm_cmpge_epu8 (_mm_subs_epu8 (Red, Green), _mm_set1_epi8 (Ten)));//(red-green) >=result = _mm_and_si128 (result, _mm_cmpge_epu8 (_mm_subs_epu8 (Max, Min), _mm_set1_epi8 (Ten)));//Im_max (Im_max (red, green), blue)-Im_min (Im_min (red, green), blue) >=result = _mm_or_si128 (result, _mm_set1_epi8 ( -)); _mm_storeu_si128 (__m128i*) (LINEPD +0), Result);

The speed test is calculated 100 times in a loop:

Environment

1920*1080 skin color About half of the figure

1920*1080 Full Image Skin tone

1920*1080 full picture without skin color

Standard C language

400ms

550ms

360ms

SSE optimization

70ms

70ms

70ms



  

   

Can be seen, although the calculation of SSE optimization is much more theoretically than the normal C language, but the SSE optimization algorithm has two advantages, the first is a lot faster, the maximum speedup is about 8 times times, the second is the SSE calculation time and the image content is irrelevant.

This result shocked me, it seems that SSE processing 16 bytes of the ability is not covered, but also shows that the ordinary C-language jump is also time-consuming.

Address of the complete project: Http://files.cnblogs.com/files/Imageshop/GetSkinArea.rar

Combined with skin-tone detection and previously researched integration graphs, mean variance denoising algorithms, I wrote a comprehensive makeup algorithm using pure SSE, processing a single frame 1080P image time is probably also implemented in 25MS (single core), than the pure C language is 3 to 4 times times faster, as shown in:

Http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar, here is a I all with SSE optimized image processing Demo, interested friends can see.

SSE Image algorithm Optimization series 10: Simple one complexion detection algorithm for SSE optimization.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.