In fact, most compilers can provide some simple optimizations by themselves, such as GCC, which optimizes the program by using-O2 or-O3 options. But the compiler's optimizations are always limited, because it must be careful to ensure that the optimization process does not change the functionality of the program. The programmer should therefore have an optimal awareness of the program itself. In my opinion, this is also a good programming habit.
Several simple optimization measures: 1. Code movement
Calculations that are going to execute multiple times (such as in a loop) but do not change the result of the calculation, move to the part of the code that is not evaluated many times before. To give a more extreme example:
/*convert string to Lowercase:slow*/voidLowerChar*s) { inti; for(i =0; I <strlen (s); i++) if(S[i] >='A'&& S[i] <='Z') S[i]-= ('A'-'a');
}
Because the C-language string is null-terminated, the function Strlen must also check the sequence step after year until a null character is encountered. So pretend that if the string s is a very long string, then this function will naturally cause a lot of unnecessary overhead!!
Therefore, in the circulating body, pay attention to the calculation results do not change the calculation moved to the front to avoid repeated calculations.
Optimized code:
/* Convert string to lowercase:faster*/void lower (char *s) { int i; int len = strlen (s); for (i = 0;i < len;i++) if (S[i] >= ' A ' && s[i] <= ' Z ') s[i]-= (' A '-' a ');}
2. Eliminate unnecessary memory references
In the C language, reading and writing pointer variables is used to indirectly address the CPU register and then read and write from memory, while using local variables inside the function is the common register in the CPU. And the main memory read-write and the CPU internal general register the speed of the address dozens of times times difference. Give a small example
for (i = 0;i < len;i++) { *dest = *dest + data[i];}
This loop body reads and writes from main memory every time, and is optimized as follows:
int acc;for (i = 0;i < len;i++) { ACC = acc + data[i];} *dest = ACC;
This will cause the pointer to be written only once, and the ACC variable will speed up by using the CPU's internal general register to read and write during the execution of the CPU.
3. Cyclic expansion
Loop expansion, as the name implies, is to expand the iteration cycle one step at a time to two or more, reducing the number of iterations. Cyclic expansion improves the performance of the program from two aspects, first, it reduces the number of operations that do not directly contribute to program results, such as cyclic index calculations and conditional branching. Second, it provides methods to further change the code to reduce the number of operations on the critical path in the calculation. Compare the following two functions, the first one is the regular loop, the second is the loop expansion function,
Normal function to add all element of V
voidvec_ptr v, data_t *dest) { int0; long int length = Vec_length (v);
data_t *data = Get_vec_start (v);
data_t acc = IDENT;
for 0; i < length;i++ ) { = acc + data[i]; } *dest = acc;}
Unroll Loop by 2
void Combine2 (Vec_ptr V, data_t *dest) { int i; long int length = Vec_length (v); loing int limit = length-1; data_t *data = Get_vec_start (v); data_t acc = IDENT; for (i = 0;i < Limit;i + = 2) { acc = (ACC + data[i]) + data[i+1]; } for (; i < length;i++) { ACC = acc + data[i]; } *dest = ACC;}
The second function loops through and the final check will not be missed. The process is optimized by reducing some of the key steps.
4. Improve parallelism
In the CPU, the program is translated into assembly instructions, but it is not a single instruction executed sequentially, but the pipeline executes concurrently, that is, multiple unrelated directives are executed together. This is the machine feature of the CPU, and all we have to do is take advantage of this machine feature.
Let's analyze the inner statement of the core loop in the program's COMBINE2:acc = (ACC + data[i]) + data[i+1]; In this cycle, data[i+1] calculations must be placed after (ACC + data[i]), because they are interrelated, which is obviously not conducive to the parallel operation of the program, improved as follows.
Unroll loop by 2,2-way parallelismvoid Combine3 (vec_ptr V, data_t *dest) { int i; long int length = Vec_length (v); loing int limit = length-1; data_t *data = Get_vec_start (v); data_t acc0 = IDENT; data_t acc1 = IDENT; for (i = 0;i < Limit;i + = 2) { acc0 = acc0 + data[i]; ACC1 = acc1 + data[i+1]; } for (; i < length;i++) { acc0 = acc0 + data[i]; } *dest = acc0 + acc1;}
This code will split ACC into ACC0 and ACC1, so that the program can simultaneously calculate concurrently, and finally two sets of results to add, improve program performance.
Code optimization often leads to a reduction in readability, and the choice should be well thought out and, if necessary, additional explanatory notes.
Program performance optimization using C as an example--"in-depth understanding of computer Systems" fifth Chapter reading notes