I. Code movement
The computation that will be calculated in the loop many times, but the result will not change, moves outside the loop.
Example:
Before optimization:
void Lower1 (char *s) {
int i;
For (I=0;i<strlen (s); ++i)
if (s[i]>= ' a ' &&s[i]<= ' Z ')
s[i]-= (' A '-' a ');
}
After optimization:
void Lower2 (char *s) {
int i;
int Len=strlen (s);
for (int i=0;i<len;++i)
if (s[i]>= ' a ' &&s[i]<= ' Z ')
s[i]-= (' A '-' a ');
}
Pre-optimized versions, because each loop calls strlen to compute the length of s, the actual complexity becomes O (N2), and the optimized version only needs to compute the length of s once, so performance is better than the pre-optimized version.
Two. Reduce function calls
Example:
Before optimization:
void Sum1 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
*dest=0;
for (i=0;i<len;++i) {
data_t val;
Get_vec_element (v,i,&val);
*dest+=val
}
}
After optimization:
data_t Get_vec_start (Vec_ptr v) {return
v->data;
}
void Sum2 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
data_t *data=get_vec_start (v);
*dest=0;
for (I=0;i<len;++i)
*dest+=data[i];
}
The version before the optimization is called once in each loop get_vec_element gets the corresponding item, and the optimized version only needs to get the starting memory address with a get_vec_start in the loop, and the loop directly accesses the memory without calling the function.
Three. Reduce Memory access
Example:
Before optimization:
void Sum2 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
data_t *data=get_vec_start (v);
*dest=0;
for (I=0;i<len;++i)
*dest+=data[i];
}
After optimization:
void Sum3 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
data_t *data=get_vec_start (v);
data_t acc=0;
for (I=0;i<len;++i)
acc+=data[i];
*DEST=ACC;
}
The version before the optimization is read from the Dest value plus data[i], and then the result is written back to Dest. This is a waste of reading and writing, so the value read from Dest at the beginning of each iteration is the point that the last iteration writes back to Dest. After the optimized version is added to the ACC temporary variable, it accumulates the calculated results in the loop, and then writes back the loop when it ends.
Here are two versions of the corresponding assembly results can clearly see the difference:
Before optimization:
The version before the optimization is read from the Dest value plus data[i], and then the result is written back to Dest. This is a waste of reading and writing, so the value read from Dest at the beginning of each iteration is the point that the last iteration writes back to Dest. After the optimized version is added to the ACC temporary variable, it accumulates the calculated results in the loop, and then writes back the loop when it ends.
The second and fourth lines respectively read and write to Dest.
After optimization:
From the assembly results, it can be seen that the compiler put ACC directly in the register, the loop does not need to read and write memory.
Four. Cycle unfold
Cyclic expansion can reduce the number of cycles, the performance of the program with two improvements. The first is to reduce the calculation that does not contribute directly to the cycle, such as the calculation of the cyclic counting variables and the execution of branch jump instructions. The second is to provide the opportunity to further utilize the optimization of machine characteristics.
Example:
Before optimizing the code see SUM3 in the previous blog.
After optimization:
void Sum4 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
int limit=len-3;
data_t *data=get_vec_start (v);
data_t acc=0;
for (i=0;i<limit;i+=4) {
acc=acc+data[i]+data[i+1];
ACC=ACC+DATA[I+2]+DATA[I+3];
}
for (; i<len;++i)
acc+=data[i];
*DEST=ACC;
}
By looping through, each iteration will accumulate 4 elements, reduce the number of cycles, thereby reducing the total execution time (this optimization method is used alone, there is hardly any increase in floating-point numbers, but integer fatigue benefits from a significant increase in the compiler's heavy association code changes).
This optimization can be done directly using the compiler, the optimization level is set to a higher level, the compiler will automatically loop expansion. Using GCC, you can use the-funroll-loops option explicitly.
Five. Improving parallelism
Most modern processors adopt pipelining and Superscalar technology, which can realize instruction level parallelism. We can use this feature to further optimize the code.
2.1 Using multiple cumulative variables
Optimizing Code Samples
void Sum5 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
int limit=len-1;
data_t *data=get_vec_start (v);
data_t acc0=0;
data_t acc1=0;
for (i=0;i<limit;i+=2) {
acc0+=data[i];
ACC1+=DATA[I+1];
}
for (; i<len;++i)
acc0+=data[i];
*DEST=ACC0+ACC1;
}
At the same time, the use of cyclic expansion and the use of a number of cumulative variables, on the one hand reduce the number of cycles, on the other hand, the characteristics of instruction-level parallelism so that each iteration of the two addition can be executed in parallel. Based on these two points, the program execution time can be significantly reduced. By increasing the number of expansion times and the number of cumulative variables, the performance of the program can be further improved until the throughput limit of the machine instruction execution.
2.2 Combined Transform
In addition to using multiple cumulative variables to explicitly use the machine's command-level parallelism, the operation can be combined with transformations to break the sequential dependencies to enjoy the benefits of instruction-level parallelism.
In Sum4, the binding order of acc=acc+data[i]+data[i+1] is acc= (acc+data[i)) +data[i+1];
We turn it into acc=acc+ (data[i]+data[i+1]);
The code is as follows:
void Sum6 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
int limit=len-3;
data_t *data=get_vec_start (v);
data_t acc=0;
for (i=0;i<limit;i+=4) {
acc=acc+ (data[i]+data[i+1]);
acc=acc+ (data[i+2]+data[i+3]);
for (; i<len;++i)
acc+=data[i];
*DEST=ACC;
}
Further increase the number of cyclic expansion, can further improve the program performance, and ultimately can achieve the machine instruction execution throughput limit. (The performance improvement of the integer multiplication mentioned in the loop show is that the compiler implicitly takes this transformation, but because the floating-point number is not binding, the compiler is not adopted, but the programmer can use this explicitly to ensure that the program results are correct).