A friend once gave me a suggestion aboutCodeAfter reading the optimized PDF file "Let your software fly", I feel quite deeply. In order to promote it, but also to enhance their impression, it is summarized as a Word document. The following is a summary of the content in detail, hoping to help others.
Speed depends onAlgorithm
The same things have different methods and different effects. For example, an automobile engine can make your speed exceed the carriage speed, but not the speed of sound. A turbine engine can easily surpass the sound barrier, but cannot fly out of the earth. If a rocket engine exists, you can reach Mars.
The speed of code calculation depends on the following aspects:
1,The complexity of the algorithm itself. For example, MPEG is more complex than JPEG and JPEG is more complicated than BMP.
2,CPU speed and design architecture
3,CPU bus bandwidth
4,Write your own code
This article describes how to optimize your code and accelerate software.
First look at my needs
In an image pattern recognition project, color images in RGB format must be first converted into black and white images.
The image conversion formula is as follows:
Y = 0.299 * r + 0.587 * g + 0.114 * B;
The image size is 640*480 * 24bit. The RGB Images are arranged in the rgbrgb sequence and placed in the memory.
I have quietly completed the first optimization.
The following are definitions of input and output:
# Define xsize 640
# Define ysize 480
# Define imgsize xsize * ysize
Typedef struct RGB
{
Unsigned char R;
Unsigned char g;
Unsigned char B;
} RGB;
Struct RGB in [imgsize]; // raw data to be calculated
Unsigned char out [imgsize]; // calculated result
First Optimization
Optimization Principle: The image is a 2D array. I use a one-dimensional array for storage. The compiler must process one-dimensional arrays more efficiently than two-dimensional arrays.
First write a code:
Y = 0.299 * r + 0.587 * g + 0.114 * B;
Void calc_lum ()
{
Int I;
For (I = 0; I
{
Double R, G, B, Y;
Unsigned char YY;
R = in [I]. R;
G = in [I]. g;
B = in [I]. B;
Y = 0.299 * r + 0.587 * g + 0.114 * B;
YY = y;
Out [I] = YY;
}
}
This is probably the simplest way to write the code. I really don't know what's wrong. Well, compile it and run it.
First test run
This code is compiled with vc6.0 and GCC respectively to generate two versions and run on the PC and my embedded system respectively.
How fast?
On a PC, because of the existence of a hardware floating point processor, the CPU frequency is high enough and the computing speed is 20 seconds.
My embedded system does not have the above two advantages. The floating point operation is decomposed into integer operations by the compiler, and the operation speed is about 120 seconds.
Remove floating point operations
The above Code has not been run, and I know it will be slow, because there are a lot of floating point operations. As long as floating-point operations are not needed, it will be much faster.
Y = 0.299 * r + 0.587 * g + 0.114 * B;
How can this formula be replaced by fixed-point integer operations?
How can 0.299 * R be simplified?
Y = 0.299 * r + 0.587 * g + 0.114 * B;
Y = d + E + F;
D = 0.299 * R;
E = 0.587 * g;
F = 0.114 * B;
Let's simplify formula D first!
The RGB value ranges from 0 ~ 255, all are integers, but this coefficient is troublesome, but this coefficient can be expressed as: 0.299 = 299/1000;
So d = (R * 299)/1000;
Y = (R * 299 + G * 587 + B * 114)/1000;
How fast can this problem be?
The embedded system speed is 45 seconds;
The speed on the PC is 2 seconds;
How can 0.299 * R be simplified?
Y = 0.299 * r + 0.587 * g + 0.114 * B;
Y = (R * 299 + G * 587 + B * 114)/1000;
This formula seems to be a little complicated. You can cut down another division operation.
The preceding formula D can be written as follows:
0.299 = 299/1000 = 1224/4096
So d = (R * 1224)/4096
Y = (R * 1224)/4096 + (G * 2404)/4096 + (B * 467)/4096
Simplified:
Y = (R * 1224 + G * 2404 + B * 467)/4096
Here, the/4096 division, because it is the Npower of 2, so it can be replaced by the shift operation, shift to the right 12 bit is to divide a number by 4096.
Void calc_lum ()
{
Int I;
For (I = 0; I
{
Int R, G, B, Y;
R = 1224 * in [I]. R;
G = 2404 * in [I]. g;
B = 467 * in [I]. B;
Y = R + G + B;
Y = Y> 12; // division is removed here.
Out [I] = y;
}
}
This code is 20% faster after compilation.
Although it was a little faster, it was still too slow. It was unacceptable for us to process an image in 20 seconds.
Let's take a closer look at this form!
Y = 0.299 * r + 0.587 * g + 0.114 * B;
Y = d + E + F;
D = 0.299 * R;
E = 0.587 * g;
F = 0.114 * B;
The RGB values include:ArticleThe RGB value is always greater than or equal to 0 and less than or equal to 255. Can we pre-calculate the values of D, E, and F? What about using the Lookup Table Algorithm for calculation?
We use three arrays to store the 256 possible values of def, and then...
Check table array Initialization
Int d [256], F [256], E [256];
Void table_init ()
{
Int I;
For (I = 0; I <256; I ++)
{
D [I] = I * 1224;
D [I] = d [I]> 12;
E [I] = I * 2404;
E [I] = E [I]> 12;
F [I] = I * 467;
F [I] = f [I]> 12;
}
}
Void calc_lum ()
{
Int I;
For (I = 0; I
{
Int R, G, B, Y;
R = d [in [I]. R]; // query table
G = E [in [I]. g];
B = f [in [I]. B];
Y = R + G + B;
Out [I] = y;
}
}
This time, I was shocked by the cold sweat, and the execution time was increased from 30 seconds to 2 seconds! Test the code on the PC. The code is executed without blinking. How nice is it to increase by 15 times?
Continue Optimization
Many 32-bit CPUs of embedded systems have at least two ALUs. Can we run both ALUs?
Void calc_lum ()
{
Int I;
For (I = 0; I
{
Int R, G, B, Y, R1, G1, B1, Y1;
R = d [in [I]. R]; // query table // run the command for the first ALU
G = E [in [I]. g];
B = f [in [I]. B];
Y = R + G + B;
Out [I] = y;
R1 = d [in [I + 1]. R]; // query table // run the command for the second ALU
G1 = E [in [I + 1]. g];
B1 = f [in [I + 1]. B];
Y = R1 + G1 + b1;
Out [I + 1] = y;
}
}
Data Processed by two ALU cannot be dependent on data. That is to say, the input condition of An ALU cannot be the output of another Alu.
The score is 1 second.
View this code
Int d [256], F [256], E [256]; // query Table Array
Void table_init ()
{
Int I;
For (I = 0; I <256; I ++)
{
D [I] = I * 1224;
D [I] = d [I]> 12;
E [I] = I * 2404;
E [I] = E [I]> 12;
F [I] = I * 467;
F [I] = f [I]> 12;
}
}
It seems that it is fast enough, but we have tried it and found that there is still a way to speed it up!
Int d [256], F [256], E [256]; // query Table Array
Change
Unsigned short d [256], F [256], E [256]; // query Table Array
This is because the compiler is not efficient in processing int and unsigned short types.
Change again
Inline void calc_lum ()
{
Int I;
For (I = 0; I
{
Int R, G, B, Y, R1, G1, B1, Y1;
R = d [in [I]. R]; // query table // run the command for the first ALU
G = E [in [I]. g];
B = f [in [I]. B];
Y = R + G + B;
Out [I] = y;
R1 = d [in [I + 1]. R]; // query table // run the command for the second ALU
G1 = E [in [I + 1]. g];
B1 = f [in [I + 1]. B];
Y = R1 + G1 + b1;
Out [I + 1] = y;
}
}
Declare the function as inline, so that the compiler will embed it into the primary function, which can reduce the overhead of CPU calls to subfunctions.
This speed: 0.5 seconds.
In fact, we can fly out of the earth!
If you add the following measures, it should be faster:
1,Place the table data in the high-speed data cache of the CPU;
2,Write the function calc_lum () in an assembly language.
In fact, the CPU has great potential.
1,Don't complain about your CPU. Remember one sentence: "As long as the power is enough, bricks can fly !"
2,The same requirement is different in writing. The speed can be changed from 120 seconds to 0.5 seconds, indicating that the CPU has a great potential! It depends on how you mine data.
3,I think: If Microsoft engineers optimize code like me, I will probably be able to run Windows XP with 489!
The above is an excerpt from "Let your software fly up". below, I will summarize the conversion algorithms from RGB to YCbCr according to this introduction.
Y = 0.299r + 0.587G + 0.114b
U =-0.147r-0.289G + 0.436b
V =0.615r-0.515g-0.100b
# Deinfe size 256
# Define xsize 640
# Define ysize 480
# Define imgsize xsize * ysize
Typedef struct RGB
{
Unsigned char R;
Unsigned char g;
Unsigned char B;
} RGB;
Struct RGB in [imgsize]; // raw data to be calculated
Unsigned char out [imgsize * 3]; // calculated result
Unsigned short Y_R [size], Y_g [size], y_ B [size], u_r [size], u_g [size], u_ B [size], v_r [size], v_g [size], V_ B [size]; // query Table Array
Void table_init ()
{
Int I;
For (I = 0; I <size; I ++)
{
Y_R [I] = (I * 1224)> 12; // query Table Array corresponding to Y
Y_g [I] = (I * 2404)> 12;
Y_ B [I] = (I * 467)> 12;
U_r [I] = (I * 602)> 12; // the query Table Array corresponding to u
U_g [I] = (I * 1183)> 12;
U_ B [I] = (I * 1785)> 12;
V_r [I] = (I * 2519)> 12; // the query Table Array corresponding to V
V_g [I] = (I * 2109)> 12;
V_ B [I] = (I * 409)> 12;
}
}
Inline void calc_lum ()
{
Int I;
For (I = 0; I
{
Out [I] = Y_R [in [I]. R] + Y_g [in [I]. g] + y_ B [in [I]. B]; // y
Out [I + imgsize] = U_ B [in [I]. B]-u_r [in [I]. R]-u_g [in [I]. g]; // U
Out [I + 2 * imgsize] = v_r [in [I]. r]-V_g [in [I]. g]-V_ B [in [I]. b]; // v
Out [I + 1] = Y_R [in [I + 1]. R] + Y_g [in [I + 1]. g] + y_ B [in [I + 1]. B]; // y
Out [I+ 1 + imgsize] = U_ B [in [I + 1]. B]-u_r [in [I + 1]. R]-u_g [in [I + 1]. g]; // U
Out [I+ 1 + 2 * imgsize] = v_r [in [I + 1]. r]-V_g [in [I + 1]. g]-V_ B [in [I + 1]. b]; // v
}
}
According to Niu Ren, this algorithm should be very fast and can be used directly in the future. Pai_^
Note: let your software fly:
Http://www.chituwang.com/FileDetail.aspx? ArticleID = 13
For more articles, see:
Http://www.chituwang.com/video/index.aspx
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/wxzking/archive/2009/07/11/4339650.aspx