Program Optimization and Optimization

When I was in a school lab, I often wrote an algorithm to let the program go around and return, and the results came out. After work, the algorithm efficiency seems to be much more important. After all, we have to put it into products and sell it to customers. In many cases, we have to get embedded devices to run it in real time, this is a huge workload ~~~. During this period, the program optimization is also a little bit of knowledge. Next we will talk about this issue.

The program optimization mentioned here refers to the optimization of program efficiency. Generally, program optimization involves the following three steps:

**1.****Algorithm Optimization**

**2.****Code optimization**

**3.****Command Optimization**

**Algorithm Optimization**

Algorithm optimization is the first and most important step. Generally, we need to analyze the time complexity of the algorithm, that is, the relationship between the processing time and the size of the input data. An excellent algorithm can reduce the complexity of the algorithm by several magnitude, the average time consumption is generally less than other algorithms with high complexity (this does not mean that any input is faster ).

For example, in a sorting algorithm, the time complexity of fast sorting is O (nlogn), and the time complexity of insertion sorting is O (n * n). In a statistical sense, quick sorting is faster than insert sorting, and the time difference between the two will increase with the length of n in the input sequence. However, if the input data is already in ascending (or descending) order, it will be slower to sort quickly.

Therefore, to implement the same function, the algorithm with low time complexity is preferred. For example, if two-dimensional Gaussian convolution is performed on an image, the image size is MxN, And the convolution kernel size is PxQ

Computation based on convolution. the time complexity is O (MNPQ)

If two one-dimensional convolution is used, the time complexity is O (MN (P + Q ))

Two one-bit convolution + FFT are used for implementation. The time complexity is O (MNlogMN)

If Gaussian filtering is used for recursion, the time complexity is O (MN) (see paper: Recursive implementation of the Gaussian filter, the source code is available in GIMP)

Obviously, the efficiency of the above four algorithms is gradually improved. Generally, the last method is used.

In another case, the algorithm itself is complicated and its time complexity is difficult to reduce, but its efficiency does not meet the requirements. At this time, you need to understand the algorithm and make some modifications. One is to maintain the algorithm effect to improve efficiency, and the other is to discard some results in exchange for a certain degree of efficiency. The specific method must be based on the actual situation.

**Code optimization**

Code optimization usually needs to be synchronized with algorithm optimization. code optimization mainly involves specific coding techniques. The same algorithms and functions, different writing methods may also cause huge differences in program efficiency. Generally, code optimization mainly analyzes and processes the loop structure. The following principles are currently taken into consideration:

**A. Avoid the multiplication (Division) method inside the loop and redundant calculation.**

This principle is to put the operation out of the loop as much as possible to put it out of the external, the unnecessary multiplication and division inside the loop can be replaced by addition. In the following example, the grayscale image data is included in an array of BYTE Img [MxN], and the pixel gray sum is obtained for its sub-blocks (row R1 to R2, column C1 to column C2, the simple and crude method is as follows:

1 int sum = 0; 2 for (int I = R1; I <R2; I ++) 3 {4 for (int j = C1; j <C2; j ++) 5 {6 sum + = Image [I * N + j]; 7} 8}View Code

But there is another way of writing:

1 int sum = 0; 2 BYTE * pTemp = Image + R1 * N; 3 for (int I = R1; I <R2; I ++, pTemp + = N) 4 {5 for (int j = C1; j <C2; j ++) 6 {7 sum + = pTemp [j]; 8} 9}View Code

We can analyze the operation times of the two writing methods, assuming R = R2-R1, C = C2-C1, the previous writing method I ++ executes the R times, j ++ and sum + =... this statement executes RC, and the total number of executions is 3RC + R addition and RC multiplication. In the same sample plot, we can analyze the next method and execute 2RC + 2R + 1 addition, 1 multiplication. Obviously, the performance is good or bad.

**B. Avoid excessive dependency and redirection inside the loop so that the cpu can flow**

The CPU assembly line technology can be google/baidu. The computation or logic in the loop structure is too complex, which will make the cpu unable to flow. This loop is equivalent to splitting n segments of repeated code.

In addition, the value ii is an important indicator to measure the cycle structure. The value ii refers to the number of commands required to execute a loop. The smaller the value ii, the shorter the program execution time. Is a simple flow of cpu:

Let's take a look at the following code:

1 for (int I = 0; I <N; I ++) 2 {3 if (I <100) a [I] + = 5; 4 else if (I <200) a [I] + = 10; 5 else a [I] + = 20; 6}View Code

The function implemented by this code is very simple. It accumulates a different value for different elements of array a. However, there are three branches in the loop that need to be judged each time, which is too inefficient and may not be able to run smoothly; it can be rewritten to three loops, so that the internal loop is not judged, so although the amount of Code increases, but when the array size is large (N is large, its efficiency can be quite advantageous. The rewritten code is:

1 for (int I = 0; I <100; I ++) 2 {3 a [I] + = 5; 4} 5 for (int I = 100; I <200; I ++) 6 {7 a [I] + = 10; 8} 9 for (int I = 200; I <N; I ++) 10 {11 a [I] + = 20; 12}View Code

For the dependency inside the loop, see the following program:

1 for (int I = 0; I <N; I ++) 2 {3 int x = f (a [I]); 4 int y = g (x ); 5 int z = h (x, y); 6}View Code

F, g, and h are all functions. In this Code, x depends on a [I], y depends on x, and z depends on xy, each step of computing needs to wait until the previous calculation is complete. This is also quite unfavorable to the cpu's flow structure. Avoid such writing as much as possible. In addition, the restrict keyword in the C language can be used to modify the pointer variable, that is, to tell the compiler that the Pointer Points to the memory is only modified by itself, so that the compiler can be optimized with no scrubs, at present, the VC compiler does not seem to support this keyword. On the DSP, after the restrict command is used, the efficiency of some loops can be improved by 90%.

**C. Fixed Point**

The idea of fixed-point operations is to convert floating-point operations into integer operations. Currently, I personally feel that the difference is not big on PC, but its role cannot be underestimated in many general performance DSPs. The fixed-point method is to multiply the data by a large number and convert all operations into integer calculations. For example, if I only care about the third digit after the decimal point, after all the data is multiplied by 10000, the result of integer calculation will satisfy the required precision.

**D. Change the time by Space**

The most classic way of changing the space time is the look-up table method. Some calculations are quite time-consuming, but the value range of their independent variables is relatively limited. In this case, the function values corresponding to each independent variable can be calculated in advance, in a table, you can index the corresponding function value based on the value of the independent variable each time. For example:

1 // directly calculate 2 for (int I = 0; I <N; I ++) 3 {4 double z = sin (a [I]); 5} 6 7 // query table calculation 8 double aSinTable [360] = {0 ,..., 1 ,..., 0 ,..., -1 ,..., 0}; 9 for (int I = 0; I <N; I ++) 10 {11 double z = aSinTable [a [I]; 12}View Code

The following lookup method consumes an additional array of double aSinTable [360] space, but the efficiency is much faster.

**E. Pre-allocated memory**

The pre-allocated memory is mainly used to process data cyclically. For example, video processing requires a certain amount of caching for each frame of image processing. If each frame is applied for release, the algorithm efficiency will inevitably be reduced, as shown below:

1 // Process one frame 2 void Process (BYTE * pimg) 3 {4 malloc 5... 6 free 7} 8 9 // cyclically process a video 10 for (int I = 0; I <N; I ++) 11 {12 BYTE * pimg = readimage (); 13 Process (pimg); 14}View Code 1 // Process one frame 2 void Process (BYTE * pimg, BYTE * pBuffer) 3 {4... 5} 6 7 // cyclically process a Video 8 malloc pBuffer 9 for (int I = 0; I <N; I ++) 10 {11 BYTE * pimg = readimage (); 12 Process (pimg, pBuffer); 13} 14 freeView Code

The previous code is malloc and free for processing each frame, and the later code is passed into the cache at the upper layer, so that each internal application and release are not required. Of course, the above is just a simple description. The actual situation is much more complicated than this, but the overall idea is consistent.

**Command Optimization**

For programs optimized by the previous algorithms and code, the efficiency is generally good. For some special requirements, it is necessary to further reduce the time consumption of the program, so the command optimization should be on the stage. Command optimization generally uses a specific instruction set to quickly implement certain operations. At the same time, another core idea of command optimization is package operations. Currently, intel instruction sets on PCs include MMX, SSE, and SSE2/3/4. DSP depends on the specific model. Different models support different instruction sets. The intel instruction set must be compiled by the intel compiler. After the icc is installed, the help documentation and detailed instructions on all instructions are provided.

For example, the command _ m64 _ mm_add_pi8 (_ m64 m1, _ m 64 m2) in MMX is to add the numbers of 8 8 bits in m1 and m2, the result is stored in the bit segment corresponding to the returned value. Assuming that two N arrays are added, N addition commands are generally required, but only N/8 commands are required to use the preceding commands because one of them can process 8 data.

Evaluate the mean of two BYTE arrays, that is, z [I] = (x [I] + y [I])/2, you can directly calculate the mean and use the MMX command to implement the following two methods:

1 # define N 800 2 BYTE x [N], Y [N], Z [N]; 3 Inner x, y ;... 4 // directly calculate the mean value 5 for (int I = 0; I <N; I ++) 6 {7 z [I] = (x [I] + y [I])> 1; 8} 9 10 // use the MMX command to calculate the mean, here N is an integer multiple of 8, regardless of the remaining data processing 11_m64 m64X, m64Y, m64Z; 12 for (int I = 0; I <N; I + = 8) 13 {14 m64X = * (_ m64 *) (x + I); 15 m64Y = * (_ m64 *) (y + I ); 16 m64Z = _ mm_avg_pu8 (m64X, m64Y); 17 * (_ m64 *) (x + I) = m64Z; 18}View Code

Notes for using command optimization include:

A. for example, when two 8-bit values are added together, the value of the value may overflow. If the value does not overflow, eight data records can be processed at a time. Otherwise, the performance must be reduced, use other commands to process four data records at a time;

B. the remaining data is generally 4, 8, or 16 integers. If the length of the data to be processed is not an integer multiple of the number of data processed at a time, the remaining data must be processed separately;

**Supplement-how to locate program hotspots**

Program hotspot refers to the most time-consuming part of the program. Generally, program optimization focuses on optimizing the hotspot part. How can we locate the program hotspot?

Generally, there are two main methods. One is to observe and analyze the program hotspots through analysis algorithms. On the other hand, observe the code structure, generally, the hot spot is the biggest loop, which is also the reason that the previous optimization methods are aimed at the loop structure.

Another method is to use tools to find program hotspots. In x86, you can use vtune to locate hot spots. In DSP, you can use the profile function of ccs to locate time-consuming functions. In a closer step, you can view the compiled and retained asm files, you can analyze the structure of each loop and find out whether the loop can flow, the loop ii value, and the variable dependency or calculation amount, in this way, targeted optimization is performed. As Vtune has just been removed, it cannot; it is a loop in an asm file generated by CCS Compilation:

Finally, some codes can be compiled much faster by using the Intel compiler than those compiled by the vc compiler. The fastest difference I have ever encountered is 10 times. The gcc compilation efficiency has not been tested yet.