Code optimization-optimize Division

Source: Internet
Author: User
Tags integer division



Tag:CodeOptimization, division, Newton iteration, subtraction instead of division, Division Optimization

Note:ArticleMany of the data may have different results in different CPUs or different system environments. The data is for reference only.
X86 series CPUs can complete basic commands such as bit operations, addition, and subtraction within one CPU cycle (the current CPU can still be executed in disorder, so that the average CPU cycle of the command is smaller); the current CPU, multiplication is also very fast (requires several CPU cycles, each cycle may start a new multiplication command (x87 )), however, the Division as a basic instruction is more than expected. It is a very slow operation, and the division between integers and floating points is slow; I tested the division of the Intel P5 proyang CPU floating point number, which is about 37 CPU cycles, the integer division is 80 CPU cycles, and the amd2200 + floating point division is about 21 CPU cycles, the division of integers is 40 CPU cycles. (Division is ineffective if FPU precision is changed) (the remainder and Division operations on x86 are implemented using the same CPU command. It is said that many Integer Division operations on CPUs are completed using the floating-point delimiters of the mathematical coprocessor. One inference is, the floating point division and integer division cannot be executed in parallel.(PS: Intel's P4 imul command may have a delay of 14 cycles (or 15-18) before the result can be obtained)

This article provides some methods or replacements for Division optimization.Algorithm(Warning: Some substitution algorithms cannot be completely equivalent !)

1. Try to use less Division
For example, if (x/y> Z )...
Change to: If (Y> 0) & (x> y * z) | (Y <0) & (x <y * z )))...

(Less use for remainder)
For example: + index; If (index> = count) Index = index % count; // assert (index <count * 2 );
Changed to: + index; If (index> = count) Index = index-count;

2. use subtraction instead of Division
if we know that the divisor is a small multiple of the divisor, we can use subtraction instead of Division.
example: uint32 x = 200;
uint32 y = 70;
uint32 z = x/y;
changed to: uint z = 0;
while (x> = y)
{< br> X-= y; ++ Z;

A Division completed by subtraction and shift (if you do not have Division instructions available :)
Uint32 Div (uint64 U, uint32 Z) // return U/Z
Uint32 x = (uint32) (U> 32 );
Uint32 y = (uint32) U;
// Y storage Vendor X storage Remainder

For (INT I = 0; I <32; ++ I)
Uint32 t = (int32) x)> 31;
X = (x <1) | (Y> 31 );
Y = Y <1;
If (X | T)> = z)
X-= z;
++ Y;
Return y;
(This function has been tested; Z = 0 needs to be processed by yourself; For signed division, you can use the absolute value method (of course it is not easy)
Write out the fully equivalent signed Division :); if the 64-bit length of S is not required and only 32-bit is required, you can simplify this function, but there is not much improvement)

3. replace division with shift (many compilers can optimize it automatically)
It is required that the divisor be a constant of the power of 2. (likewise, for some applications, such a number can be preferentially selected for the divisor)
For example, uint32 x = 213432575;
Uint32 y = X/8;
Changed to: Y = x> 3;

For signed integers;
For example, int32 x = 213432575;
Int32 y = X/8;
Change to: If (x> = 0) y = x> 3;
Else y = (x + (1 <3-1)> 3;

4. Merge Division (alternative methods are not equivalent, and many compilers will not help you with this optimization)
This method is applicable when division cannot be avoided by other methods;
For example, double X = A/B/C;
Change to: Double X = A/(B * C );

For example, double X = A/B + C/B;
Change to: Double X = (a + C)/B;

for example, double X = A/B;
Double Y = C/d;
double z = E/F;
changed to double TMP = 1.0/(B * D * F );
double X = A * TMP * D * F;
Double Y = C * TMP * B * F;
double z = E * TMP * B * D;
5. make full use of the time occupied by Division
when the CPU performs division, you do not have to wait for this result (that is, the command inserted later does not use this division result ), insert multiple simple integers
commands (excluding integer division, and the result cannot be a global or external variable) to save the time occupied by division;
(this method is useful when division is inevitable)

6. Use the lookup method instead of division.
(Applicable to the case where the possible value range of the divisor and the divisor is small, otherwise the space consumption is too large)
For example, uint8 X; uint8 y;
Uint8 z = x/y;
Change to uint8 z = table [x] [Y]; // Where table is a pre-calculated table, table [I] [J] = I/J;
// You need to process the division by zero based on your application.
Or: uint8 z = table [x <8 + Y]; // Where table [I] = (I> 8) /(I & (1 <8-1 ));

For example, uint8 X;
Uint8 z = x/17;
Change to uint8 z = table [X]; // Where table [I] = I/17;

7. Use multiplication instead of Division
(The alternative method is not equivalent. Many compilers will not help you with this optimization)
For example, double X = y/21.3432575;
Change to: Double X = y * (1/21. 3432575); // If the compiler cannot be optimized (1/21. 3432575), calculate the result in advance.

For integers, you can use a method similar to the number of points to process the reciprocal.
(This alternative method is not equivalent)
For example, uint32 X, Y; X = y/213;
Change to: x = y * (1 <16)/213)> 16;
In some applications, y * (1 <16)/213) may exceed the value range. In this case, you can use int64 to expand the value range.
Uint32 x = (uint64) y) * (1 <31)/213)> 31;
You can also use floating point numbers to expand the value range.
Uint32 x = (uint32) (y * (1.0/213); (warning: floating point numbers are forcibly converted to Integers in many advanced languages.
A very slow operation. The optimization method will be provided in the next few articles)

For this method, some Division has an exact equivalent optimization method:
Unsigned number divided by 3: uint32 X, Y; X = y/3;
Reasoning: Y (1 <33) + 1 y
X = y/3 => X = [-] => X = [-+ ---------] => X = [--------- * -------] // [] indicates an integer.
3 3 3*(1 <33) 3 (1 <33)
Y 1
(Verifiable: 0 <= --------- <-)
3*(1 <33) 3
That is: x = (uint64 (y) * m)> 33; where the magic number M = (1 <33) + 1)/3 = 2863311531 = 0 xaaaaaaaaab;

Unsigned number divided by 5: uint32 X, Y; X = y/5; (Y <(1 <31 ))
Reasoning: Y 3 * Y (1 <33) + 3 Y
X = y/5 => X = [-] => X = [-+ ---------] => X = [--------- * -------]
5 5 5*(1 <33) 5 (1 <33)
3 * Y 1
(Verifiable: 0 <= --------- <-)
5*(1 <33) 5
That is: x = (uint64 (y) * m)> 33; where the number of magic M = (1 <33) + 3)/5 = 1717986919 = 0x66666667;

Unsigned number divided by 7: uint32 X, Y; X = y/7;
Reasoning: omitted
That is: x = (uint64 (y) * m)> 33 + Y)> 3; where the magic number M = (1 <35) + 3) /7-(1 <32) = 613566757 = 0x24924925;

There are other alternative formulas for this completely equivalent substitution, which are not discussed in the alternative algorithm with signed Division. Many numbers haveCompletely equivalent substitution algorithms, and those magic numbers are also well-regulated. "aggressive" compilers should help users deal with this optimization (as we can see in vc6! The method for laziness is to directly view the assembly code generated by vc6 :);

8. If division is known to be an integer multiple of the divisor, an alternative algorithm or an improved algorithm can be obtained;
Here we only discuss the situation where the divisor is a constant;
For example, for (32-bit Division): x = y/a; // It is known that Y is a multiple of A, and assume that A is an odd number.
(If A is an even number, first convert it to a = A0 * (1 <k); y/a = (y/A0)> K; A0 is an odd number)
Obtain a' so that (a' * A) % (1 <32) = 1;
So: x = y/a => X = (y/a) * (A * A') % (1 <32 )) => X = (y * A') % (1 <32); // The actual Perform a remainder command
(This algorithm supports both signed and unsigned division)
For example, uint32 y;
Uint32 x = y/7; // It is known that Y is a multiple of 7.
Change to uint32 x = (uint32) (y * m); // where M = (5*(1 <32) + 1)/7

9. Approximate Calculation Division (this alternative method is not equivalent)
Optimize the operation of divisor 255 (257 or 1026, and so on) (or, and the formula for pushing and exporting is slightly different)
For example, in color processing: uint8 color = colorsum/255;
Changed to: uint8 color = colorsum/256; that is, color = colorsum> 8;
This error is often acceptable in color processing.
To reduce the error, you can change it to uint8 color = (colorsum + (colorsum> 8)> 8;
Derivation: X/255 = (x + x/255)/(255 + 1) = (x + a)> 8; A = x/255;
Change a to a = x> 8 (a small error is introduced). Then, the formula X/255 equals (x + (x> 8)> 8 Formulas
Likewise, X/255 is approximately equal to (x + (x> 8) + (x> 16)> 8 and other more accurate formulas (Please export the error items to determine whether the accuracy is sufficient)

For example:X/255 is written as (x + (x> 8) + 1)> 8, which is equivalent to 0 <= x <65535

10. Newton Iteration Method for Division
(Many internal division commands of the CPU are implemented using this method or similar methods)
Assume that there is a function y = f (x); evaluate the value of X when 0 = f (x; (Here we will not discuss the situation of multiple solutions, escape, fall into an endless loop or fall into a chaotic state)

(Reference Image)
Evaluate the function's guiding function to Y = f' (x); // you can understand the meaning of this function: the slope of the Function Y = f (x) at the X point;
A. Obtain an appropriate x Initial Value x0 and obtain Y0 = f (x0 );
B. Use (x0, y0) as a straight line with the slope of f' (x0) and submit it to X1 on the X axis;
C. Use the obtained X1 as the initial value for iteration;
After many iterations, it is considered that x1 will be very close to the solution of equation 0 = f (x), which is the Newton Iteration;
Simplify the above process and obtain the Newton iteration formula: x (n + 1) = x (n)-y (x (n)/Y' (x (n ))

Here we use the Newton iteration formula to calculate the reciprocal. (use the reciprocal to obtain the Division: Y = x/a = x * (1/))
Let a = 1/X; have the equation Y = A-1/X;
Evaluate the exported y' = 1/x ^ 2;
X (n + 1) = x (n)-y (x (n)/Y' (x (n ));
There is an iterative formula x_next = (2-A * X) * X; // it can be proved that this formula is a Level 2 convergence formula; that is, the effective accuracy of the calculated solution increases exponentially;

Proof of convergence: x = (1/a) + dx; // dx is a small amount
Then x_next-(1/a) = (2-A * (1/A + dx) * (1/A + dx)-1/
= (-A) * DX ^ 2 // ^ indicates the exponential Operator
Certificate completed.

The program can use this method to implement division and determine the number of iterations according to its own precision requirements;
(For the initial iteration values, you can use the lookup table method or use the IEEE floating point format to obtain an approximate calculation expression. There is an rcpps (rcpss) in the SSE instruction set) the command can also be used to calculate the approximate reciprocal value. With the initial value, the above iteration formula can be used to obtain more accurate results)

Appendix: using Newton Iteration Method to implement square operation
// The square operation can be expressed as Y = x ^ 0.5 = 1/(1/x ^ 0.5); Evaluate 1/x ^ 0.5 first
1/A ^ 0.5,
Let a = 1/x ^ 2; there is an equation Y = A-1/x ^ 2;
Evaluate the exported y' = 2/x ^ 3;
X (n + 1) = x (n)-y (x (n)/Y' (x (n ));
There is an iterative formula x_next = (3-a * x * X) * x * 0.5; // it can be proved that the formula is a Level 2 convergence formula. // The process of the method is omitted from the proof above.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.