Improvement of cyclic Invariant

Source: Internet
Author: User

at the end of 1990s, one of my colleagues was writing a program for processing medical images , which used a lot of trigonometric functions, so the program runs slowly (the machine is also slow at that time, and the clock speed may be around MB). It takes more than 20 seconds to process an image. Then he asked me if there is any way to increase the running speed.
I took a look at his Code and made a simple modification, which suddenly increased the speed by more than three times. The reason is that his code contains some cyclic constants that can be easily optimized. For example, his code is as follows:
for (x = 0; x for (y = 0; y for (Z = 0; Z sum + = sin (x) * A [x] [y] [Z] + sin (y) * B [x] [y] [Z] + sin (z) * C [x] [y] [Z];
}< BR >}

Code like this, X and Y won't change after every loop Z comes in, so there is no need to re-calculate sin (x), sin (y ). because sin (X) and sin (y) are function calls, the compiler may not necessarily know that function calls do not need to be generated repeatedly. Therefore, if you let the compiler do it completely, it will not be able to eliminate these repeated function calls, then the running speed will naturally be slow, and if we change the above Code:
For (x = 0; x <s_x; X ++ ){
Double SiNx = sin (X );
For (y = 0; y <s_y; y ++ ){
Double Siny = sin (y );
For (Z = 0; Z <s_z; Z ++ ){
Sum + = SiNx * A [x] [y] [Z] + Siny * B [x] [y] [Z] + sin (z) * C [x] [y] [Z];
}
}
}

Then we can use a manual method to increase the circular invariant (about circular Z) to the outside of circular Z, thus reducing the access to such function calls and thus improving the speed.
Of course, this optimization is now possible for some compilers, such as the preceding sin (.) function, the compiler can identify some constant library functions in advance, such as trigonometric functions, it knows that these functions do not have side effects, so for these functions, repeated calls with the same parameters can be eliminated. But for more cases, the compiler still cannot analyze, which requires more attention when writing a program, so as to write higher quality code.
For example, for the following functions:
Int sqr_sum (double * err, double A [], int size_a, double B [], int size_ B ){
Int I;

If (ERR = NULL | size_a! = Size_ B)
Return 0;

* Err = 0;
For (I = 0; I <size; I ++)
* Err + = (a [I]-B [I]) * (a [I]-B [I]);
Return 1;
}
This is a very common code, but its efficiency is not high enough. The main reason is that the memory should be accessed repeatedly in the loop * err.
After the code inside this loop is expanded, it is actually similar:
Load * err;
Load a [I];
Load B [I];
Computing...
Store * err;
Since err is a pointer to the double type, the compiler cannot determine whether err will point to array a [.]. B [.], therefore, all the four memory accesses may access the same memory address. In this case, the compiler cannot swap their read/write memory order, thus further optimization is not possible.
However, if we rewrite the code:
Int sqr_sum (double * err, double A [], int size_a, double B [], int size_ B ){
Int I;
Double local_err;

If (ERR = NULL | size_a! = Size_ B)
Return 0;

Local_err = 0;
For (I = 0; I <size; I ++)
Local_err + = (a [I]-B [I]) * (a [I]-B [I]);
* Err = local_err;
Return 1;
}
The performance of this code will be much higher. First, the compiler can place the local variable local_err in the Register so that all accesses to local_err do not need to pass through the memory, thus reducing the number of memory accesses, which improves the access speed, in addition, the number of commands is reduced.
Secondly, since the compiler knows that local_err does not overlap with the memory of arrays A [], B [], and so on, the memory space accessed by each two statements in this loop must be completely different, we can execute these different statements in parallel. On servers that support SSE, we can execute multiple statements in parallel by one SSE statement. At the same time, for multi-CPU machines, we can let multiple CPUs run in parallel, for example, the first CPU accumulates the previous part, the second CPU accumulates the subsequent part, after completion, it can be accumulated once in a unified manner.

For more information about Compiler optimization, see:
Http://bbs.emath.ac.cn/thread-173-1-1.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.