Let your software fly up-----algorithm optimization

Source: Internet
Author: User
Tags bmp image

Excerpt from the network:

Today accidentally saw a piece of article "Let Your software fly up" look after big is surprised, also has the feeling:

Cover:

Content:

The operation speed of the code depends on the following aspects

1, the complexity of the algorithm itself , such as MPEG than JPEG complex, JPEG than BMP image encoding complex.

2, the CPU's own speed and design structure

3, the bus bandwidth of the CPU

4, your own code to the wording

This article mainly describes how to optimize your own code, to achieve software acceleration.

Look at my needs first.

We have an image pattern recognition project, we need to convert the RGB color image into black and white images first.

The image conversion formula is as follows:Y = 0.299 * R + 0.587 * G + 0.114 * B;

Image size 640*480*24bit,rgb images have been formatted in RGBRGB order and placed in memory.

I have quietly completed the first optimization

The following is the definition of input and output:

#define XSIZE 640

#define YSIZE 480

#define IMGSIZE Xsize * ysize

typedef struct RGB

{

unsigned char R;

unsigned char G;

unsigned char B;

}rgb;

struct RGB in[imgsize]; Raw data that needs to be computed

unsigned char out[imgsize]; Results after the calculation

First Optimization:

Optimization principle : The image is a 2D array, I use a one-dimensional array to store. The compiler handles one-dimensional arrays more efficiently than two-dimensional arrays .

Write a code First:

Y = 0.299 * R + 0.587 * G + 0.114 * B;

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i++)

{

Double r,g,b,y;

unsigned char yy;

R = IN[I].R;

g = IN[I].G;

b = in[i].b;

y = 0.299 * r + 0.587 * g + 0.114 * b;

yy = y;

Out[i] = yy;

}

}

This is probably the simplest way to get out of the writing, really do not see what is wrong, well, compile and run a run.

The first time I tried to run

This code was compiled with vc6.0 and GCC, and generated 2 versions, running on the PC and on my embedded system, respectively.

What's the speed?

On the PC, due to the presence of a hardware floating-point processor, the CPU frequency is high enough and the computation speed is 20 seconds .

My embedded system, without the above 2 advantages, the floating-point operation is decomposed into integer operations by the compiler, the operation speed is about 120 seconds .

  

Optimization two: Removing floating-point arithmetic

The above code is not running yet, and I already know it will be slow because there are a lot of floating-point operations. As long as you can not use floating point operation, it will be much faster.

Y = 0.299 * R + 0.587 * G + 0.114 * B;

How can this formula be substituted with fixed-point integer operations?

How can 0.299 * r be simplified?

Y = 0.299 * R + 0.587 * G + 0.114 * B;

Y = D + E + F;

D = 0.299 * R;

E = 0.587 * G;

F = 0.114 * B;

Let's just simplify the formula D!

RGB range is 0~255, all integers, but this coefficient is more troublesome, but this coefficient can be expressed as: 0.299 = 299/1000;

So D = (R * 299)/1000;

Y = (R * 299 + G * 587 + B * 114)/1000;

This look, how fast can it be?

The speed on the Embedded system is 45 seconds ;

The speed on the PC is 2 seconds ;

Optimization Three: Converting a division into a shift operation

Y = (R * 299 + G * 587 + B * 114)/1000;

The equation seems a little more complicated, and you can cut off a division operation.

The preceding equation D can be written like this:

0.299=299/1000=1224/4096

So D = (R * 1224)/4096

Y= (r*1224)/4096+ (g*2404)/4096+ (b*467)/4096

Then simplify to:

y= (r*1224+g*2404+b*467)/4096

Here the /4096 division, because it is 2 of the N-square , so can be replaced with a shift operation, to the right shift 12bit is to divide a number divided by 4096.

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i++)

{

int r,g,b,y;

R = 1224 * IN[I].R;

g = 2404 * IN[I].G;

b = 467 * IN[I].B;

y = r + G + B;

y = y >> 12;//This removes the division operation

Out[i] = y;

}

}

This code compiles, and it's 20% faster.

Although a lot faster, or too slow some, 20 seconds to deal with an image, the Earth people can not accept

Optimization Four: The method of turning a finite calculation result into a lookup table:

RGB value has the article to do, the RGB value will always be greater than or equal to 0, less than or equal to 255, we can d,e,f all pre-calculated it? And then use the table-checking algorithm to calculate it?

We use 3 arrays to store the 256 possible values of Def, and then ...

Check table array initialization

int d[256],f[256],e[256];

void Table_init ()

{

int i;

for (i=0;i<256;i++)

{

d[i]=i*1224;

d[i]=d[i]>>12;

e[i]=i*2404;

e[i]=e[i]>>12;

f[i]=i*467;

f[i]=f[i]>>12;

}

}

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i++)

{

int r,g,b,y;

R = d[in[i].r];//Check table

g = e[in[i].g];

b = f[in[i].b];

y = r + G + B;

Out[i] = y;

}

}

This time I was frightened out of a cold sweat, the execution time actually from 30 seconds to improve to 2 seconds ! Test the code on the PC, the eyelids haven't blinked yet, the code is done. 15 times times better, cool?

Optimization Five, multiplexing to convert single or serial to multi-channel parallel

Many embedded system 32bit CPUs, have at least 2 alu, can let 2 alu all run up?

void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data

{

int r,g,b,y,r1,g1,b1,y1;

R = d[in[i].r];//Check table //Here to the first Alu to execute

g = e[in[i].g];

b = f[in[i].b];

y = r + G + B;

Out[i] = y;

R1 = d[in[i + 1].r];//Check table //Here to the second Alu to execute

G1 = E[in[i + 1].G];

B1 = F[in[i + 1].b];

y = r1 + g1 + b1;

Out[i + 1] = y;

}

}

2 ALU processing data can not have data dependencies, that is to say: one of the input conditions of the ALU cannot be the output of other Alu, so as to be parallel.

The result is 1 seconds .

Optimize six, change data type or data bits

View this Code

int d[256],f[256],e[256]; Tabular array

void Table_init ()

{

int i;

for (i=0;i<256;i++)

{

d[i]=i*1224;

d[i]=d[i]>>12;

e[i]=i*2404;

e[i]=e[i]>>12;

f[i]=i*467;

f[i]=f[i]>>12;

}

}

Here, it seems to be fast enough, but we repeatedly experiment, found that there are ways to go faster!

int can be d[256],f[256],e[256]; Tabular array

Change to

unsigned short D[256],F[256],E[256]; Tabular array

This is because the compiler handles int types and handles unsigned short types with different efficiencies.

Change again

inline void Calc_lum ()

{

int i;

for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data

{

int r,g,b,y,r1,g1,b1,y1;

R = d[in[i].r];//Check table//Here to the first Alu to execute

g = e[in[i].g];

b = f[in[i].b];

y = r + G + B;

Out[i] = y;

R1 = d[in[i + 1].r];//Check table//Here to the second Alu to execute

G1 = E[in[i + 1].G];

B1 = F[in[i + 1].b];

y = r1 + g1 + b1;

Out[i + 1] = y;

}

}

Declare the function inline so that the compiler embeds it into the parent function, reducing the overhead of the CPU calling the child function.

This speed: 0.5 seconds

Main ideas: the transformation of space and time

Let your software fly up-----algorithm optimization

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.