RGB to YUV conversion optimal algorithm, quickly let you surprised!

Last Update:2015-04-27 Source: Internet

Author: User

Tags bmp image

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

My friend once recommended me a PDF document about code optimization "Let Your Software Fly", after reading it, feel quite deep. In order to promote it, but also for themselves to deepen the impression, it is summed up as a Word document. Here is the detailed summary of its contents, I hope to be able to help others.

Speed depends on algorithm

The same thing, the method is different, the effect is not the same. For example, a car engine that allows you to move faster than a carriage, but not beyond the speed of sound, a turbo engine that can easily go beyond the audio barrier but cannot fly out of the Earth; If you have a rocket engine, you can reach Mars.

The operation speed of the code depends on the following aspects

1, the complexity of the algorithm itself, such as MPEG than JPEG complex, JPEG than BMP image encoding complex.

2, the CPU's own speed and design structure

3, the bus bandwidth of the CPU

4, your own code to the wording

This article mainly describes how to optimize your own code, to achieve software acceleration.

Look at my needs first.

We have an image pattern recognition project, we need to convert the RGB color image into black and white images first.

The equation for image conversion is as follows:

Y = 0.299 * R + 0.587 * G + 0.114 * B;

Image size 640*480*24bit,rgb images have been formatted in RGBRGB order and placed in memory.

I have quietly completed the first optimization

The following is the definition of input and output:

#define XSIZE 640

#define YSIZE 480

#define IMGSIZE Xsize * ysize

typedef struct RGB

{

unsigned char R;

unsigned char G;

unsigned char B;

}rgb;

struct RGB in[imgsize]; Raw data that needs to be computed

unsigned char out[imgsize]; Results after the calculation

First optimization

Optimization principle: The image is a 2D array, I use a one-dimensional array to store. The compiler handles one-dimensional arrays more efficiently than two-dimensional arrays.

Write a code First:

Y = 0.299 * R + 0.587 * G + 0.114 * B;

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i++)

{

Double r,g,b,y;

unsigned char yy;

R = IN[I].R;

g = IN[I].G;

b = in[i].b;

y = 0.299 * r + 0.587 * g + 0.114 * b;

yy = y;

Out[i] = yy;

}

This is probably the simplest way to get out of the writing, really do not see what is wrong, well, compile and run a run.

The first time I tried to run

This code was compiled with vc6.0 and GCC, and generated 2 versions, running on the PC and on my embedded system, respectively.

What's the speed?

On the PC, due to the presence of a hardware floating-point processor, the CPU frequency is high enough and the computation speed is 20 seconds.

My embedded system, without the above 2 advantages, the floating-point operation is decomposed into integer operations by the compiler, the operation speed is about 120 seconds.

Remove floating-point arithmetic

The above code is not running yet, and I already know it will be slow because there are a lot of floating-point operations. As long as you can not use floating point operation, it will be much faster.

Y = 0.299 * R + 0.587 * G + 0.114 * B;

How can this formula be substituted with fixed-point integer operations?

How can 0.299 * r be simplified?

Y = 0.299 * R + 0.587 * G + 0.114 * B;

Y = D + E + F;

D = 0.299 * R;

E = 0.587 * G;

F = 0.114 * B;

Let's just simplify the formula D!

RGB range is 0~255, all integers, but this coefficient is more troublesome, but this coefficient can be expressed as: 0.299 = 299/1000;

So D = (R * 299)/1000;

Y = (R * 299 + G * 587 + B * 114)/1000;

This look, how fast can it be?

The speed on the Embedded system is 45 seconds;

The speed on the PC is 2 seconds;

How 0.299 * R can be simplified

Y = 0.299 * R + 0.587 * G + 0.114 * B;

Y = (R * 299 + G * 587 + B * 114)/1000;

The equation seems a little more complicated, and you can cut off a division operation.

The preceding equation D can be written like this:

0.299=299/1000=1224/4096

So D = (R * 1224)/4096

Y= (r*1224)/4096+ (g*2404)/4096+ (b*467)/4096

Then simplify to:

y= (r*1224+g*2404+b*467)/4096

Here the/4096 division, because it is 2 of the N-square, so can be replaced with a shift operation, to the right shift 12bit is to divide a number divided by 4096.

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i++)

{

int r,g,b,y;

R = 1224 * IN[I].R;

g = 2404 * IN[I].G;

b = 467 * IN[I].B;

y = r + G + B;

y = y >> 12; This removes the division operation.

Out[i] = y;

}

This code compiles, and it's 20% faster.

Although a lot faster, or too slow some, 20 seconds to deal with an image, the Earth people can not accept.

Take a closer look at this equation!

Y = 0.299 * R + 0.587 * G + 0.114 * B;

Y=d+e+f;

D=0.299*r;

E=0.587*g;

F=0.114*b;

RGB value has the article to do, the RGB value will always be greater than or equal to 0, less than or equal to 255, we can d,e,f all pre-calculated it? And then use the table-checking algorithm to calculate it?

We use 3 arrays to store the 256 possible values of Def, and then ...

Check table array initialization

int d[256],f[256],e[256];

void Table_init ()

{

int i;

for (i=0;i<256;i++)

{

d[i]=i*1224;

d[i]=d[i]>>12;

e[i]=i*2404;

e[i]=e[i]>>12;

f[i]=i*467;

f[i]=f[i]>>12;

}

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i++)

{

int r,g,b,y;

R = d[in[i].r];//Check table

g = e[in[i].g];

b = f[in[i].b];

y = r + G + B;

Out[i] = y;

}

This time I was frightened out of a cold sweat, the execution time actually from 30 seconds to improve to 2 seconds! Test the code on the PC, the eyelids haven't blinked yet, the code is done. 15 times times better, cool?

Continue to optimize
Many embedded system 32bit CPUs, have at least 2 alu, can let 2 alu all run up?

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data

{

int r,g,b,y,r1,g1,b1,y1;

R = d[in[i].r];//Check table//Here to the first Alu to execute

g = e[in[i].g];

b = f[in[i].b];

y = r + G + B;

Out[i] = y;

R1 = d[in[i + 1].r];//Check table//Here to the second Alu to execute

G1 = E[in[i + 1].G];

B1 = F[in[i + 1].b];

y = r1 + g1 + b1;

Out[i + 1] = y;

}

2 ALU processing data can not have data dependencies, that is to say: one of the input conditions of the ALU cannot be the output of other Alu, so as to be parallel.

The result is 1 seconds.

View this Code

int d[256],f[256],e[256]; Tabular array

void Table_init ()

{

int i;

for (i=0;i<256;i++)

{

d[i]=i*1224;

d[i]=d[i]>>12;

e[i]=i*2404;

e[i]=e[i]>>12;

f[i]=i*467;

f[i]=f[i]>>12;

}

Here, it seems to be fast enough, but we repeatedly experiment, found that there are ways to go faster!

int can be d[256],f[256],e[256]; Tabular array

Change to

unsigned short D[256],F[256],E[256]; Tabular array

This is because the compiler handles int types and handles unsigned short types with different efficiencies.

Change again

inline void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data

{

int r,g,b,y,r1,g1,b1,y1;

R = d[in[i].r];//Check table//Here to the first Alu to execute

g = e[in[i].g];

b = f[in[i].b];

y = r + G + B;

Out[i] = y;

R1 = d[in[i + 1].r];//Check table//Here to the second Alu to execute

G1 = E[in[i + 1].G];

B1 = F[in[i + 1].b];

y = r1 + g1 + b1;

Out[i + 1] = y;

}

Declare the function inline so that the compiler embeds it into the parent function, reducing the overhead of the CPU calling the child function.

This speed: 0.5 seconds.

In fact, we can also fly out of the earth!

If you add the following measures, you should also be quicker:

1, the table data is placed in the CPU high-speed data cache inside;

2, the function Calc_lum () in assembly language to write

In fact, the potential of the CPU is very large

1, do not complain about your CPU, remember a word: "As long as enough power, bricks can fly!" ”

2, the same demand, the wording is not the same, the speed can change from 120 seconds to 0.5 seconds, indicating that the potential of the CPU is very large! See how you dig.

3, I think: If Microsoft engineers like me to optimize the code, I can probably use 489 run Windows XP!

The above is the "Let your software Fly up" excerpt, below, I will follow the introduction of the cow, the RGB to YCBCR conversion algorithm to do to summarize.

Y = 0.299R + 0.587G + 0.114B
U = -0.147r-0.289g + 0.436B
V = 0.615r-0.515g-0.100b

#deinfe SIZE 256

#define XSIZE 640

#define YSIZE 480

#define IMGSIZE Xsize * ysize

typedef struct RGB

{

unsigned char R;

unsigned char g;

unsigned char b;

}rgb;

struct RGB in[imgsize]; Raw data that needs to be computed

unsigned char out[imgsize * 3]; Results after the calculation

unsigned short y_r[size],y_g[size],y_b[size],u_r[size],u_g[size],u_b[size],v_r[size],v_g[size],v_b[size]; Tabular array

void Table_init ()

{

int i;

for (i = 0; i < SIZE; i++)

{

Y_r[i] = (i * 1224) >> 12; y corresponding to the table array

Y_g[i] = (i * 2404) >> 12;

Y_b[i] = (i * 467) >> 12;

U_r[i] = (i * 602) >> 12; U-table Array

U_g[i] = (i * 1183) >> 12;

U_b[i] = (i * 1785) >> 12;

V_r[i] = (i * 2519) >> 12; V corresponds to the table of Inquiry array

V_g[i] = (i * 2109) >> 12;

V_b[i] = (i * 409) >> 12;

}

inline void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data

{

Out[i] = Y_R[IN[I].R] + Y_G[IN[I].G] + y_b[in[i].b]; Y

Out[i + imgsize] = u_b[in[i].b]-U_R[IN[I].R]-U_G[IN[I].G]; U

Out[i + 2 * imgsize] = V_R[IN[I].R]-V_G[IN[I].G]-v_b[in[i].b]; V

Out[i + 1] = Y_r[in[i + 1].R] + y_g[in[i + 1].G] + y_b[in[i + 1].b]; Y

Out[i + 1 + imgsize] = u_b[in[i + 1].b]-u_r[in[i + 1].R]-u_g[in[i + 1].G]; U

Out[i + 1 + 2 * imgsize] = v_r[in[i + 1].R]-v_g[in[i + 1].G]-v_b[in[i + 1].b]; V

}

According to the Bulls ' point of view, this algorithm should be very fast and can be used directly in the future. ^_^

RGB to YUV conversion optimal algorithm, quickly let you surprised!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More