Image processing-to-Error Diffusion-faster speed or better effect

Source: Internet
Author: User
Tags color gamut

Image processing-to-Error Diffusion-faster speed or better effect

HouSisong@GMail.com

2010.01.05

 

(20 10.01.06 some discussions on the parallelism of the complementary Error Diffusion Algorithm)

 


Tag: Error Diffusion, conversion from true color to high color, color order, fading, half color

Abstract: In the color conversion process of images, due to different color ranges, there may be errors during the conversion process;
The Error Diffusion algorithm reduces the visual error caused by passing the error to the surrounding pixels.
Article 1: simple implementation; Article 1: simple Speed Optimization; Article 2: faster speed or better results.

 

(Test source code download: http://cid-10fa89dec380323f.skydrive.live.com/self.aspx/.Public/ErrorDiffuse.zip
)

Body:
The Code uses C ++, And the compiler: vc2005
Test Platform: (CPU: i7-920 (3.44g); Memory: ddr3 1333 (three channels); Compiler: vc2005)

(Please refer to the previous and middle articles first)

 

A: faster
The error transfer coefficient used in the previous article is:
* 2
1 0/4
There is a compromise between speed and quality. In some application scenarios, it is possible to spread errors faster in real time;
In this case, we can consider the error diffusion coefficient:
* 1
0 1 0/2
Even:
* 1/1
In this case, the error is transmitted to the right side, which is much simpler to implement:

 

// Diffusion template <br/> // * 1/1 <br/> void cvspic32to16_errordiffuse_line_fast (uint16 * pdst, const color32 * psrc, long width) {<br/> terrorcolor Herr; <br/> herr. dr = 0; herr. DG = 0; herr. DB = 0; <br/> for (long x = 0; x <width; ++ X) <br/>{< br/> long cb = (psrc [X]. B + herr. DB); <br/> long CG = (psrc [X]. G + herr. DG); <br/> long Cr = (psrc [X]. R + herr. dr); <br/> long RB = bestrgb16_555color_table [CB]; <br/> long Rg = bestrgb16_555color_table [CG]; <br/> long RR = bestrgb16_555color_table [Cr]; <br/> pdst [x] = RB | (RG <5) | (RR <10); <br/> herr. DB = (CB-getc8color (RB); <br/> herr. DG = (Cg-getc8color (RG); <br/> herr. dr = (Cr-getc8color (RR); <br/>}</P> <p> void cvspic32to16_errordiffuse_fast (const tpicregion_rgb16_555 & DST, const tpixels32ref & SRC) {<br/> uint16 * pdst = (uint16 *) DST. pdata; <br/> const color32 * psrc = SRC. pdata; <br/> const long width = SRC. width; <br/> for (long y = 0; y <SRC. height; ++ y) {<br/> cvspic32to16_errordiffuse_line_fast (pdst, psrc, width); <br/> (uint8 * &) pdst + = DST. byte_width; <br/> (uint8 * &) psrc + = SRC. byte_width; <br/>}< br/>

Speed test:
//////////////////////////////////////// //////////////////////
// Cvspic32to16_errordiffuse_fast 422.83 FPS
//////////////////////////////////////// //////////////////////


Effect (worse than the previous one ):


Simple rewrite to MMX implementation:

Const uint64 csmmx_erdf_mul_w = 0x41ce41ce41ce41ce; // 0x41ce = 16846 = (255*(1 <11)/(1 <5)-1 )); <br/> const uint64 csmmx_erdf_mask_w = 0x00000000000000f8f8; // 0xf8 = (1 <5)-1) <3; <br/> inline void Merge (uint16 * pdst, const color32 * psrc, long width) {<br/> ASM {<br/> // push ESI <br/> // push EDI <br/> // push EBX <br/> mov ECx, width <br/> mov EDI, pdst <br/> mov ESI, psrc <br/> Lea EDI, [EDI + ECx * 2] // 2 = sizeof (uint16) <br/> Lea ESI, [ESI + ECx * 4] // 4 = sizeof (color32) <br/> neg ECx <br/> pxor mm6, mm6 // Herr = 0000000... <br/> pxor mm7, mm7 // mm7 = 0000000... <br/> movq mm3, csmmx_erdf_mul_w <br/> movq mm4, csmmx_erdf_mask_w <br/> loop_begin: <br/> movd mm0, [ESI + ECx * 4] <br/> punpcklbw mm0, mm7 <br/> paddw mm0, mm6 // OK: CB, CG, Cr <br/> movq mm6, mm0 <br/> packuswb mm0, mm7 <br/> pand mm0, mm4 <br/> psrscsi mm0, 3 //> 3 <br/> movd eax, mm0 // 00000000 000 rrrrr 000 ggggg 000 bbbbb <br/> punpcklbw mm0, mm7 <br/> movzx EBX, Ah <br/> movzx edX, al <br/> psllw mm0, 5 <br/> SHR eax, 16 <br/> shl ebx, 5 <br/> pmulhw mm0, mm3 <br/> SHL eax, 10 <br/> or edX, EBX <br/> psubw mm6, mm0 <br/> or eax, EDX <br/> mov word PTR [EDI + ECx * 2], ax // pdst [x] = RB | (RG <5) | (RR <10 ); </P> <p> Inc ECx <br/> jnz loop_begin <br/> Emms <br/> // pop EBX <br/> // pop EDI <br/> // pop ESI <br/>}< br/> void cvspic32to16_errordiffuse_mmx (const tpicregion_rgb16_555 & DST, const tpixels32ref & SRC) {<br/> uint16 * pdst = (uint16 *) DST. pdata; <br/> const color32 * psrc = SRC. pdata; <br/> const long width = SRC. width; <br/> If (width <= 0) return; <br/> for (long y = 0; y <SRC. height; ++ y) {<br/> cvspic32to16_errordiffuse_line_mmx (pdst, psrc, width); <br/> (uint8 * &) pdst + = DST. byte_width; <br/> (uint8 * &) psrc + = SRC. byte_width; <br/>}< br/>}

 

Speed test:
//////////////////////////////////////// //////////////////////
// Cvspic32to16_errordiffuse_mmx 662.98 FPS
//////////////////////////////////////// //////////////////////


Effect:


 

B: Better Results

To optimize the color conversion effect, halftone technology is generally used to optimize the output;
Among these technologies, the Error Diffusion algorithm has always been one of the most effective technologies;
Let's implement the classic Floyd-steberger error diffusion coefficient,

// Floyd-steberger <br/> // * 7 <br/> // 3 5 1/16 <br/> void cvspic32to16_errordiffuse_line_fs (uint16 * pdst, const color32 * psrc, long width, terrorcolor * phlineerr0, terrorcolor * phlineerr1) {<br/> terrorcolor Herr; <br/> herr. dr = 0; herr. DG = 0; herr. DB = 0; <br/> phlineerr1 [-1]. DB = 0; phlineerr1 [-1]. DG = 0; phlineerr1 [-1]. dr = 0; <br/> phlineerr1 [0]. DB = 0; phlineerr1 [0]. DG = 0; phlineerr1 [0]. dr = 0; <br/> for (long x = 0; x <width; ++ X) <br/>{< br/> long cb = (psrc [X]. B + (herr. DB + phlineerr0 [X]. DB)> 4); <br/> long CG = (psrc [X]. G + (herr. DG + phlineerr0 [X]. DG)> 4); <br/> long Cr = (psrc [X]. R + (herr. dr + phlineerr0 [X]. dr)> 4); <br/> long RB = bestrgb16_555color_table [CB]; <br/> long Rg = bestrgb16_555color_table [CG]; <br/> long RR = bestrgb16_555color_table [Cr]; <br/> pdst [x] = RB | (RG <5) | (RR <10 ); <br/> phlineerr1 [x + 1]. DB = (CB-getc8color (RB); <br/> phlineerr1 [x + 1]. DG = (Cg-getc8color (RG); <br/> phlineerr1 [x + 1]. dr = (Cr-getc8color (RR); <br/> phlineerr1 [X-1]. DB + = (phlineerr1 [x + 1]. DB * 3); <br/> phlineerr1 [X-1]. DG + = (phlineerr1 [x + 1]. DG * 3); <br/> phlineerr1 [X-1]. dr + = (phlineerr1 [x + 1]. dr * 3); <br/> phlineerr1 [X]. DB + = (phlineerr1 [x + 1]. DB * 5); <br/> phlineerr1 [X]. DG + = (phlineerr1 [x + 1]. DG * 5); <br/> phlineerr1 [X]. dr + = (phlineerr1 [x + 1]. dr * 5); <br/> herr. DB = (phlineerr1 [x + 1]. DB * 7); <br/> herr. DG = (phlineerr1 [x + 1]. DG * 7); <br/> herr. dr = (phlineerr1 [x + 1]. dr * 7); <br/>}</P> <p> void cvspic32to16_errordiffuse_fs (const tpicregion_rgb16_555 & DST, const pixtels32ref & SRC) {<br/> uint16 * pdst = (uint16 *) DST. pdata; <br/> const color32 * psrc = SRC. pdata; <br/> const long width = SRC. width; <br/> terrorcolor * _ hlineerr = new terrorcolor [(width + 2) * 2]; <br/> for (long x = 0; x <(width + 2) * 2; ++ X) {<br/> _ hlineerr [X]. dr = 0; <br/> _ hlineerr [X]. DG = 0; <br/> _ hlineerr [X]. DB = 0; <br/>}< br/> terrorcolor * hlineerr0 = & _ hlineerr [1]; <br/> terrorcolor * hlineerr1 = & _ hlineerr [1 + (width + 2)]; <br/> for (long y = 0; y <SRC. height; ++ y) {<br/> cvspic32to16_errordiffuse_line_fs (pdst, psrc, width, hlineerr0, hlineerr1); <br/> STD: swap (hlineerr0, hlineerr1 ); <br/> (uint8 * &) pdst + = DST. byte_width; <br/> (uint8 * &) psrc + = SRC. byte_width; <br/>}< br/> Delete [] _ hlineerr; <br/>}

 

Speed test:
//////////////////////////////////////// //////////////////////
// Cvspic32to16_errordiffuse_fs 198.81 FPS
//////////////////////////////////////// //////////////////////


See the results:

 

Zoom in and compare these images, and pay attention to the details;

The Error Diffusion algorithm has a good restoration effect, but one drawback is that local fine lines are easily generated;

Some particles formed by error diffusion constitute some visible small textures (because of fixed diffusion template coefficients );

To overcome this problem, consider the following:Improvement Plan


Or its combination:

A. Adjust the threshold randomly to make the calculation of the closest color random;

B. Create a large extension coefficient table and select different extension templates based on the current error values;

C. the template coefficient is dynamically adjusted based on the color error that has been processed. For example, if the dot error on the current point is large, the extended series values (or the values are sorted in descending order) are reduced );

Or adjust the threshold accordingly;

D. you can consider an iterative process for processing local textures: First error processing, and then reverse the processed vertices that meet certain conditions, from the current value (for example, the brightness is increased) to the corresponding threshold (the brightness is reduced). (two adjacent values can be processed simultaneously. One of them is brightened and the other is dimmed ), the inverse condition is whether it is more conducive to the original image restoration. The evaluation model can choose Gaussian value/gradient or other methods (or other evaluation models );

E. If the color gamut of the target is significantly different from that of the current color, you can first consider a linear ing of the Color Gamut to maintain a better overall contrast at the expense of color reproduction;

F. In some cases, if the target color table is very small (for example, the color is converted into black and white two colors), you can first sharpen the image, sharpen the edge, and enhance the local contrast;

G. In some cases, brightness reduction, gradient reduction, and contrast reduction may be more important than color restoration;

 

 

For more complex algorithm processing, compare the overall and enlarged details of these algorithms:

(I wrote a simple tool generated, the output uses the 555 color output mode, here you can download the program: http://bbs.meizu.com/thread-1440271-1-1.html

This tool is used to pre-process images and optimize the Display Effect on devices that do not support real-color display ;)

 

 

 

In another example, convert an image to a small fixed color table:

Source image:

Optimized error diffusion, color reproduction first:

 

Consider the visual model error diffusion brightness restoration first (the effect is much better ):

 

Take the gradient restoration as the top priority: (an image with a strong artistic sense), and pay attention to its smoothness! I have never met anyone else. It should be original)

 

C: Parallel Error Diffusion Algorithm

The algorithm for error propagation is generally used to process one row from top to bottom. The algorithm itself is not suitable for Parallel Processing (except that the * 1/1 template supports parallel processing without affecting rows );

Some suggestions are as follows: each processing core is delayed, and the next line starting from the previous processing core can meet the parallel requirements. However, the actual situation is, the CPU cannot be completed on time and by volume on time (multi-task operating system); the core of the next line is likely to exceed the processing progress of the previous line, resulting in processing failure; if to ensure the processing progress, the addition of some locks is not very good, and the code is complicated. As the number of cores increases, it may slow down! (And the GPU has a parallel core that is much larger than the CPU)

A. If the quality requirement is not high, you can divide the image into multiple blocks and hand it over to multiple cores for parallel processing. After processing, you can consider making some special corrections to the border;

B. the image is divided into multiple zones by fixed block size, so that the blocks can be completely parallel; set a fixed processing sequence and different diffusion coefficient directions for each point in the block,

Using the processing sequence and coefficient direction, several batches can be designed, and the processing sequence of each batch is irrelevant; this can improve enough concurrency;

C. an Iterative Algorithm in thinking is mainly used on systems with many cores (such as GPU). Each vertex starts a core (the maximum parallel granularity ), the algorithm uses the error value * coefficient of the surrounding points (which may change with iteration) + the current value as the input value to generate the optimal target value based on a visual model (the current value in the next iteration) new error value (input value = target value + error value); iteration stops after several iterations or after the difference of the visual model is smaller than a threshold value (this algorithm has not been tested)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.