My friend once recommended me a PDF document about code optimization "Let Your Software Fly", after reading it, feel quite deep. In order to promote it, but also for themselves to deepen the impression, it is summed up as a Word document. Here is the detailed summary of its contents, I hope to be able to help others.
Speed depends on algorithm
The same thing, the method is different, the effect is not the same. For example, a car engine that allows you to move faster than a carriage, but not beyond the speed of sound, a turbo engine that can easily go beyond the audio barrier but cannot fly out of the Earth; If you have a rocket engine, you can reach Mars.
The operation speed of the code depends on the following aspects
1, the complexity of the algorithm itself, such as MPEG than JPEG complex, JPEG than BMP image encoding complex.
2, the CPU's own speed and design structure
3, the bus bandwidth of the CPU
4, your own code to the wording
This article mainly describes how to optimize your own code, to achieve software acceleration.
Look at my needs first.
We have an image pattern recognition project, we need to convert the RGB color image into black and white images first.
The equation for image conversion is as follows:
Y = 0.299 * R + 0.587 * G + 0.114 * B;
Image size 640*480*24bit,rgb images have been formatted in RGBRGB order and placed in memory.
I have quietly completed the first optimization
The following is the definition of input and output:
#define XSIZE 640
#define YSIZE 480
#define IMGSIZE Xsize * ysize
typedef struct RGB
{
unsigned char R;
unsigned char G;
unsigned char B;
}rgb;
struct RGB in[imgsize]; Raw data that needs to be computed
unsigned char out[imgsize]; Results after the calculation
First optimization
Optimization principle: The image is a 2D array, I use a one-dimensional array to store. The compiler handles one-dimensional arrays more efficiently than two-dimensional arrays.
Write a code First:
Y = 0.299 * R + 0.587 * G + 0.114 * B;
void Calc_lum ()
{
int i;
for (i = 0; i < imgsize; i++)
{
Double r,g,b,y;
unsigned char yy;
R = IN[I].R;
g = IN[I].G;
b = in[i].b;
y = 0.299 * r + 0.587 * g + 0.114 * b;
yy = y;
Out[i] = yy;
}
}
This is probably the simplest way to get out of the writing, really do not see what is wrong, well, compile and run a run.
The first time I tried to run
This code was compiled with vc6.0 and GCC, and generated 2 versions, running on the PC and on my embedded system, respectively.
What's the speed?
On the PC, due to the presence of a hardware floating-point processor, the CPU frequency is high enough and the computation speed is 20 seconds.
My embedded system, without the above 2 advantages, the floating-point operation is decomposed into integer operations by the compiler, the operation speed is about 120 seconds.
Remove floating-point arithmetic
The above code is not running yet, and I already know it will be slow because there are a lot of floating-point operations. As long as you can not use floating point operation, it will be much faster.
Y = 0.299 * R + 0.587 * G + 0.114 * B;
How can this formula be substituted with fixed-point integer operations?
How can 0.299 * r be simplified?
Y = 0.299 * R + 0.587 * G + 0.114 * B;
Y = D + E + F;
D = 0.299 * R;
E = 0.587 * G;
F = 0.114 * B;
Let's just simplify the formula D!
RGB range is 0~255, all integers, but this coefficient is more troublesome, but this coefficient can be expressed as: 0.299 = 299/1000;
So D = (R * 299)/1000;
Y = (R * 299 + G * 587 + B * 114)/1000;
This look, how fast can it be?
The speed on the Embedded system is 45 seconds;
The speed on the PC is 2 seconds;
How 0.299 * R can be simplified
Y = 0.299 * R + 0.587 * G + 0.114 * B;
Y = (R * 299 + G * 587 + B * 114)/1000;
The equation seems a little more complicated, and you can cut off a division operation.
The preceding equation D can be written like this:
0.299=299/1000=1224/4096
So D = (R * 1224)/4096
Y= (r*1224)/4096+ (g*2404)/4096+ (b*467)/4096
Then simplify to:
y= (r*1224+g*2404+b*467)/4096
Here the/4096 division, because it is 2 of the N-square, so can be replaced with a shift operation, to the right shift 12bit is to divide a number divided by 4096.
void Calc_lum ()
{
int i;
for (i = 0; i < imgsize; i++)
{
int r,g,b,y;
R = 1224 * IN[I].R;
g = 2404 * IN[I].G;
b = 467 * IN[I].B;
y = r + G + B;
y = y >> 12; This removes the division operation.
Out[i] = y;
}
}
This code compiles, and it's 20% faster.
Although a lot faster, or too slow some, 20 seconds to deal with an image, the Earth people can not accept.
Take a closer look at this equation!
Y = 0.299 * R + 0.587 * G + 0.114 * B;
Y=d+e+f;
D=0.299*r;
E=0.587*g;
F=0.114*b;
RGB value has the article to do, the RGB value will always be greater than or equal to 0, less than or equal to 255, we can d,e,f all pre-calculated it? And then use the table-checking algorithm to calculate it?
We use 3 arrays to store the 256 possible values of Def, and then ...
Check table array initialization
int d[256],f[256],e[256];
void Table_init ()
{
int i;
for (i=0;i<256;i++)
{
d[i]=i*1224;
d[i]=d[i]>>12;
e[i]=i*2404;
e[i]=e[i]>>12;
f[i]=i*467;
f[i]=f[i]>>12;
}
}
void Calc_lum ()
{
int i;
for (i = 0; i < imgsize; i++)
{
int r,g,b,y;
R = d[in[i].r];//Check table
g = e[in[i].g];
b = f[in[i].b];
y = r + G + B;
Out[i] = y;
}
}
This time I was frightened out of a cold sweat, the execution time actually from 30 seconds to improve to 2 seconds! Test the code on the PC, the eyelids haven't blinked yet, the code is done. 15 times times better, cool?
Continue to optimize
Many embedded system 32bit CPUs, have at least 2 alu, can let 2 alu all run up?
void Calc_lum ()
{
int i;
for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data
{
int r,g,b,y,r1,g1,b1,y1;
R = d[in[i].r];//Check table//Here to the first Alu to execute
g = e[in[i].g];
b = f[in[i].b];
y = r + G + B;
Out[i] = y;
R1 = d[in[i + 1].r];//Check table//Here to the second Alu to execute
G1 = E[in[i + 1].G];
B1 = F[in[i + 1].b];
y = r1 + g1 + b1;
Out[i + 1] = y;
}
}
2 ALU processing data can not have data dependencies, that is to say: one of the input conditions of the ALU cannot be the output of other Alu, so as to be parallel.
The result is 1 seconds.
View this Code
int d[256],f[256],e[256]; Tabular array
void Table_init ()
{
int i;
for (i=0;i<256;i++)
{
d[i]=i*1224;
d[i]=d[i]>>12;
e[i]=i*2404;
e[i]=e[i]>>12;
f[i]=i*467;
f[i]=f[i]>>12;
}
}
Here, it seems to be fast enough, but we repeatedly experiment, found that there are ways to go faster!
int can be d[256],f[256],e[256]; Tabular array
Change to
unsigned short D[256],F[256],E[256]; Tabular array
This is because the compiler handles int types and handles unsigned short types with different efficiencies.
Change again
inline void Calc_lum ()
{
int i;
for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data
{
int r,g,b,y,r1,g1,b1,y1;
R = d[in[i].r];//Check table//Here to the first Alu to execute
g = e[in[i].g];
b = f[in[i].b];
y = r + G + B;
Out[i] = y;
R1 = d[in[i + 1].r];//Check table//Here to the second Alu to execute
G1 = E[in[i + 1].G];
B1 = F[in[i + 1].b];
y = r1 + g1 + b1;
Out[i + 1] = y;
}
}
Declare the function inline so that the compiler embeds it into the parent function, reducing the overhead of the CPU calling the child function.
This speed: 0.5 seconds.
In fact, we can also fly out of the earth!
If you add the following measures, you should also be quicker:
1, the table data is placed in the CPU high-speed data cache inside;
2, the function Calc_lum () in assembly language to write
In fact, the potential of the CPU is very large
1, do not complain about your CPU, remember a word: "As long as enough power, bricks can fly!" ”
2, the same demand, the wording is not the same, the speed can change from 120 seconds to 0.5 seconds, indicating that the potential of the CPU is very large! See how you dig.
3, I think: If Microsoft engineers like me to optimize the code, I can probably use 489 run Windows XP!
The above is the "Let your software Fly up" excerpt, below, I will follow the introduction of the cow, the RGB to YCBCR conversion algorithm to do to summarize.
Y = 0.299R + 0.587G + 0.114B
U = -0.147r-0.289g + 0.436B
V = 0.615r-0.515g-0.100b
#deinfe SIZE 256
#define XSIZE 640
#define YSIZE 480
#define IMGSIZE Xsize * ysize
typedef struct RGB
{
unsigned char R;
unsigned char g;
unsigned char b;
}rgb;
struct RGB in[imgsize]; Raw data that needs to be computed
unsigned char out[imgsize * 3]; Results after the calculation
unsigned short y_r[size],y_g[size],y_b[size],u_r[size],u_g[size],u_b[size],v_r[size],v_g[size],v_b[size]; Tabular array
void Table_init ()
{
int i;
for (i = 0; i < SIZE; i++)
{
Y_r[i] = (i * 1224) >> 12; y corresponding to the table array
Y_g[i] = (i * 2404) >> 12;
Y_b[i] = (i * 467) >> 12;
U_r[i] = (i * 602) >> 12; U-table Array
U_g[i] = (i * 1183) >> 12;
U_b[i] = (i * 1785) >> 12;
V_r[i] = (i * 2519) >> 12; V corresponds to the table of Inquiry array
V_g[i] = (i * 2109) >> 12;
V_b[i] = (i * 409) >> 12;
}
}
inline void Calc_lum ()
{
int i;
for (i = 0; i < imgsize; i + = 2)//Parallel processing of 2 data
{
Out[i] = Y_R[IN[I].R] + Y_G[IN[I].G] + y_b[in[i].b]; Y
Out[i + imgsize] = u_b[in[i].b]-U_R[IN[I].R]-U_G[IN[I].G]; U
Out[i + 2 * imgsize] = V_R[IN[I].R]-V_G[IN[I].G]-v_b[in[i].b]; V
Out[i + 1] = Y_r[in[i + 1].R] + y_g[in[i + 1].G] + y_b[in[i + 1].b]; Y
Out[i + 1 + imgsize] = u_b[in[i + 1].b]-u_r[in[i + 1].R]-u_g[in[i + 1].G]; U
Out[i + 1 + 2 * imgsize] = v_r[in[i + 1].R]-v_g[in[i + 1].G]-v_b[in[i + 1].b]; V
}
}
According to the Bulls ' point of view, this algorithm should be very fast and can be used directly in the future. ^_^
RGB to YUV conversion optimal algorithm, quickly let you surprised!