Introduction to some introductory-level methods for cyclic structure optimization in C language

Introduction to some introductory-level methods for cyclic structure optimization in C language _c language

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Code movement

The computation that will be calculated in the loop many times, but the result will not change, moves outside the loop.

Example:

Before optimization:

void Lower1 (char *s) {
int i;
For (I=0;i<strlen (s); ++i)
   if (s[i]>= ' a ' &&s[i]<= ' Z ')
    s[i]-= (' A '-' a ');
}
After optimization:
void Lower2 (char *s) {
int i;
int Len=strlen (s);
for (int i=0;i<len;++i)
  if (s[i]>= ' a ' &&s[i]<= ' Z ')
    s[i]-= (' A '-' a ');
}

Pre-optimized versions, because each loop calls strlen to compute the length of s, the actual complexity becomes O (N2), and the optimized version only needs to compute the length of s once, so performance is better than the pre-optimized version.

Two. Reduce function calls

Example:

Before optimization:

void Sum1 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
*dest=0;
for (i=0;i<len;++i) {
  data_t val;
  Get_vec_element (v,i,&val);
  *dest+=val
}
}

After optimization:

data_t Get_vec_start (Vec_ptr v) {return
v->data;
}

void Sum2 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
data_t *data=get_vec_start (v);
*dest=0;
for (I=0;i<len;++i)
  *dest+=data[i];
}

The version before the optimization is called once in each loop get_vec_element gets the corresponding item, and the optimized version only needs to get the starting memory address with a get_vec_start in the loop, and the loop directly accesses the memory without calling the function.

Three. Reduce Memory access

Example:

Before optimization:

void Sum2 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
data_t *data=get_vec_start (v);
*dest=0;
for (I=0;i<len;++i)
  *dest+=data[i];
}

After optimization:

void Sum3 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
data_t *data=get_vec_start (v);
data_t acc=0;
for (I=0;i<len;++i)
acc+=data[i];
*DEST=ACC;
}

The version before the optimization is read from the Dest value plus data[i], and then the result is written back to Dest. This is a waste of reading and writing, so the value read from Dest at the beginning of each iteration is the point that the last iteration writes back to Dest. After the optimized version is added to the ACC temporary variable, it accumulates the calculated results in the loop, and then writes back the loop when it ends.

Here are two versions of the corresponding assembly results can clearly see the difference:

Before optimization:

The second and fourth lines respectively read and write to Dest.

After optimization:

From the assembly results, it can be seen that the compiler put ACC directly in the register, the loop does not need to read and write memory.

Four. Cycle unfold

Cyclic expansion can reduce the number of cycles, the performance of the program with two improvements. The first is to reduce the calculation that does not contribute directly to the cycle, such as the calculation of the cyclic counting variables and the execution of branch jump instructions. The second is to provide the opportunity to further utilize the optimization of machine characteristics.

Example:

Before optimizing the code see SUM3 in the previous blog.

After optimization:

void Sum4 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
int limit=len-3;
data_t *data=get_vec_start (v);
data_t acc=0;
for (i=0;i<limit;i+=4) {
  acc=acc+data[i]+data[i+1];
  ACC=ACC+DATA[I+2]+DATA[I+3];
}
for (; i<len;++i)
  acc+=data[i];

*DEST=ACC;
}

By looping through, each iteration will accumulate 4 elements, reduce the number of cycles, thereby reducing the total execution time (this optimization method is used alone, there is hardly any increase in floating-point numbers, but integer fatigue benefits from a significant increase in the compiler's heavy association code changes).

This optimization can be done directly using the compiler, the optimization level is set to a higher level, the compiler will automatically loop expansion. Using GCC, you can use the-funroll-loops option explicitly.

Five. Improving parallelism

Most modern processors adopt pipelining and Superscalar technology, which can realize instruction level parallelism. We can use this feature to further optimize the code.

2.1 Using multiple cumulative variables

Optimizing Code Samples

void Sum5 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
int limit=len-1;
data_t *data=get_vec_start (v);
data_t acc0=0;
data_t acc1=0;
for (i=0;i<limit;i+=2) {
  acc0+=data[i];
  ACC1+=DATA[I+1];
}
for (; i<len;++i)
  acc0+=data[i];

*DEST=ACC0+ACC1;
}

At the same time, the use of cyclic expansion and the use of a number of cumulative variables, on the one hand reduce the number of cycles, on the other hand, the characteristics of instruction-level parallelism so that each iteration of the two addition can be executed in parallel. Based on these two points, the program execution time can be significantly reduced. By increasing the number of expansion times and the number of cumulative variables, the performance of the program can be further improved until the throughput limit of the machine instruction execution.

2.2 Combined Transform

In addition to using multiple cumulative variables to explicitly use the machine's command-level parallelism, the operation can be combined with transformations to break the sequential dependencies to enjoy the benefits of instruction-level parallelism.

In Sum4, the binding order of acc=acc+data[i]+data[i+1] is acc= (acc+data[i)) +data[i+1];

We turn it into acc=acc+ (data[i]+data[i+1]);

The code is as follows:

void Sum6 (vec_ptr v,data_t *dest) {
int i;
int Len=vec_length (v);
int limit=len-3;
data_t *data=get_vec_start (v);
data_t acc=0;
for (i=0;i<limit;i+=4) {
  acc=acc+ (data[i]+data[i+1]);
  acc=acc+ (data[i+2]+data[i+3]);
for (; i<len;++i)
  acc+=data[i];

*DEST=ACC;
}

Further increase the number of cyclic expansion, can further improve the program performance, and ultimately can achieve the machine instruction execution throughput limit. (The performance improvement of the integer multiplication mentioned in the loop show is that the compiler implicitly takes this transformation, but because the floating-point number is not binding, the compiler is not adopted, but the programmer can use this explicitly to ensure that the program results are correct).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to some introductory-level methods for cyclic structure optimization in C language _c language

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to some introductory-level methods for cyclic structure optimization in C language _c language

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support