Note: A lot of data in this article may have different results in new CPUs, different CPUs, or different system environments, and may not be comprehensive)

X86 series CPUs can complete basic commands such as bit operations, addition, and subtraction within one CPU cycle (the current CPU can still be executed in disorder, so that the average CPU cycle of the command is smaller); the current CPU, multiplication is also very fast (several CPU cycles are required, each cycle may start a new multiplication command ), however, the Division as a basic instruction is more than expected. It is a very slow operation, and the division between integers and floating points is slow; the division of the Intel P5 proyang CPU floating point is almost 37 CPU cycles, the division of integers is 80 CPU cycles, and the Division of AMD2200 + floating point is almost 21 CPU cycles, the division of integers is 40 CPU cycles. (Division is ineffective if FPU precision is changed) (the remainder and Division operations on x86 are implemented using the same CPU command. It is said that many Integer Division operations on CPUs are completed using the floating-point delimiters of the mathematical coprocessor. One inference is, the floating point division and integer division cannot be executed in parallel. (ps: intel's p4 imul command may have a delay of 14 cycles (or 15-18) before the result can be obtained)

This article provides some division optimization methods or substitution algorithms (warning: Some substitution algorithms cannot be completely equivalent !)

1. Try to use less Division

For example, if (x/y> z )...

Change to: if (y> 0) & (x> y * z) | (y <0) & (x <y * z )))...

(Less use for remainder)

For example: + index; if (index> = count) index = index % count; // assert (index <count * 2 );

Changed to: + index; if (index> = count) index = index-count;

2. Use subtraction instead of Division

If we know that the divisor is a small multiple of the divisor, we can use subtraction instead of division.

For example, uint32 x = 200;

Uint32 y = 70;

Uint32 z = x/y;

Changed to: uint z = 0;

While (x> = y)

{

X-= y; ++ z;

}

A Division completed by subtraction and shift (if you do not have Division instructions available :)

Uint32 div (uint64 u, uint32 z) // return u/z

{

Uint32 x = (uint32) (U> 32 );

Uint32 y = (uint32) U;

// Y storage Vendor X storage Remainder

For (INT I = 0; I <32; ++ I)

{

Uint32 t = (int32) x)> 31;

X = (x <1) | (Y> 31 );

Y = Y <1;

If (X | T)> = z)

{

X-= z;

++ Y;

}

}

Return y;

}

(This function has been tested; z = 0 needs to be processed by yourself; For signed division, you can use the absolute value method (of course it is not easy)

Write out the fully equivalent signed Division :); if the 64-bit length of s is not required and only 32-bit is required, you can simplify this function, but there is not much improvement)

3. replace division with shift (many compilers can optimize it automatically)

It is required that the divisor be a constant of the power of 2. (likewise, for some applications, such a number can be preferentially selected for the divisor)

For example, uint32 x = 213432575;

Uint32 y = X/8;

Changed to: Y = x> 3;

For signed integers;

For example, int32 x = 213432575;

Int32 y = X/8;

Change to: If (x> = 0) y = x> 3;

Else y = (x + (1 <3-1)> 3;

4. Merge Division (alternative methods are not equivalent, and many compilers will not help you with this optimization)

This method is applicable when division cannot be avoided by other methods;

For example, double X = A/B/C;

Change to: Double X = A/(B * C );

For example, double X = A/B + C/B;

Change to: Double X = (a + C)/B;

For example, double X = A/B;

Double Y = C/d;

Double Z = E/F;

Change to: Double TMP = 1.0/(B * D * F );

Double X = A * TMP * D * F;

Double Y = C * TMP * B * F;

Double Z = E * TMP * B * D;

5. Make full use of the time occupied by Division

When the CPU performs division, it does not have to wait for the result (that is, the instruction inserted later does not use the division result), but inserts multiple simple integers.

Command (excluding integer division, and the result cannot be a global or external variable), which saves the time occupied by division;

(This method is useful when division is inevitable)

6. Use the lookup method instead of division.

(Applicable to the case where the possible value range of the divisor and the divisor is small, otherwise the space consumption is too large)

For example, uint8 X; uint8 y;

Uint8 z = x/y;

Change to uint8 z = table [x] [y]; // Where table is a pre-calculated table, table [I] [j] = I/j;

// You need to process the division by zero based on your application.

Or: uint8 z = table [x <8 + y]; // Where table [I] = (I> 8) /(I & (1 <8-1 ));

For example, uint8 x;

Uint8 z = x/17;

Change to uint8 z = table [x]; // Where table [I] = I/17;

7. Use multiplication instead of Division

(The alternative method is not equivalent. Many compilers will not help you with this optimization)

For example, double x = y/21.3432575;

Change to: double x = y * (1/21. 3432575); // If the compiler cannot be optimized (1/21. 3432575), calculate the result in advance.

For integers, you can use a method similar to the number of points to process the reciprocal.

(This alternative method is not equivalent)

For example, uint32 x, y; x = y/213;

Change to: x = y * (1 <16)/213)> 16;

In some applications, y * (1 <16)/213) may exceed the value range. In this case, you can use int64 to expand the value range.

Uint32 x = (uint64) y) * (1 <31)/213)> 31;

You can also use floating point numbers to expand the value range.

Uint32 x = (uint32) (y * (1.0/213); (warning: floating point numbers are forcibly converted to Integers in many advanced languages.

A very slow operation. The optimization method will be provided in the next few articles)

For this method, some Division has an exact equivalent optimization method:

Unsigned number divided by 3: uint32 x, y; x = y/3;

Reasoning: y (1 <33) + 1 y

X = y/3 => x = [-] => x = [-+ ---------] => x = [--------- * -------] // [] indicates an integer.

3 3 3*(1 <33) 3 (1 <33)

Y 1

(Verifiable: 0 <= --------- <-)

3*(1 <33) 3

That is: x = (uint64 (y) * m)> 33; where the magic number M = (1 <33) + 1)/3 = 2863311531 = 0 xaaaaaaaaab;

Unsigned number divided by 5: uint32 X, Y; X = y/5; (Y <(1 <31 ))

Reasoning: y 3 * y (1 <33) + 3 y

X = y/5 => x = [-] => x = [-+ ---------] => x = [--------- * -------]

5 5 5*(1 <33) 5 (1 <33)

3 * Y 1

(Verifiable: 0 <= --------- <-)

5*(1 <33) 5

That is: x = (uint64 (y) * m)> 33; where the number of magic M = (1 <33) + 3)/5 = 1717986919 = 0x66666667;

Unsigned number divided by 7: uint32 X, Y; X = y/7;

Reasoning: omitted

That is: x = (uint64 (y) * M)> 33 + y)> 3; where the magic number M = (1 <35) + 3) /7-(1 <32) = 613566757 = 0x24924925;

There are other substitution formulas for this completely equivalent substitution, which are not discussed. There is no discussion about alternative algorithms with signed Division, and many numbers have completely equivalent substitution algorithms, the magic numbers are also regular; the "aggressive" compiler should help the user deal with this optimization (as we can see in vc6! The method for laziness is to directly view the assembly code generated by vc6 :);

8. If division is known to be an integer multiple of the divisor, an alternative algorithm or an improved algorithm can be obtained;

Here we only discuss the situation where the divisor is a constant;

For example, for (32-bit Division): x = y/a; // It is known that y is a multiple of a, and assume that a is an odd number.

(If a is an even number, first convert it to a = a0 * (1 <k); y/a = (y/a0)> k; a0 is an odd number)

Obtain a' so that (a' * a) % (1 <32) = 1;

So: x = y/a => x = (y/a) * (a * A') % (1 <32 )) => x = (y * A') % (1 <32); // you do not need to perform a remainder command here.

(This algorithm supports both signed and unsigned division)

For example, uint32 y;

Uint32 x = y/7; // It is known that y is a multiple of 7.

Change to uint32 x = (uint32) (y * M); // where M = (5*(1 <32) + 1)/7

9. Approximate Calculation Division (this alternative method is not equivalent)

Optimize the operation of divisor 255 (257 or 1026, and so on) (or, and the formula for pushing and exporting is slightly different)

For example, in color processing: uint8 color = colorsum/255;

Changed to: uint8 color = colorsum/256; that is, color = colorsum> 8;

This error is often acceptable in color processing.

To reduce the error, you can change it to uint8 color = (colorsum + (colorsum> 8)> 8;

Derivation: x/255 = (x + x/255)/(255 + 1) = (x + A)> 8; A = x/255;

Change a to a = x> 8 (A small error is introduced). Then, the formula x/255 equals (x + (x> 8)> 8 Formulas

Likewise, x/255 is approximately equal to (x + (x> 8) + (x> 16)> 8 and other more accurate formulas (Please export the error items to determine whether the accuracy is sufficient)

10. Newton Iteration Method for Division

(Many internal division commands of the CPU are implemented using this method or similar methods)

Assume that there is a function y = f (x); evaluate the value of x when 0 = f (x; (Here we will not discuss the situation of multiple solutions, escape, fall into an endless loop or fall into a chaotic state)

(Reference Image)

Evaluate the function's guiding function to y = f' (x); // you can understand the meaning of this function: the slope of the Function y = f (x) at the x point;

A. Obtain an appropriate x Initial Value x0 and obtain y0 = f (x0 );

B. Use (x0, y0) as a straight line with the slope of f' (x0) and submit it to x1 on the X axis;

C. Use the obtained x1 as the initial value for iteration;

After many iterations, it is considered that x1 will be very close to the solution of equation 0 = f (x), which is the Newton Iteration;

Simplify the above process and obtain the Newton iteration formula: x (n + 1) = x (n)-y (x (n)/Y' (x (n ))

Here we use the Newton iteration formula to calculate the reciprocal. (use the reciprocal to obtain the Division: Y = x/a = x * (1/))

1/,

Let a = 1/X; have the equation Y = A-1/X;

Evaluate the exported y' = 1/x ^ 2;

X (n + 1) = x (n)-y (x (n)/Y' (x (n ));

There is an iterative formula x_next = (2-A * X) * X; // it can be proved that this formula is a Level 2 convergence formula; that is, the effective accuracy of the calculated solution increases exponentially;

Proof of convergence: x = (1/a) + dx; // dx is a small amount

Then x_next-(1/a) = (2-A * (1/A + dx) * (1/A + dx)-1/

= (-A) * DX ^ 2 // ^ indicates the exponential Operator

Certificate completed.

The program can use this method to implement division and determine the number of iterations according to its own precision requirements;

(For the initial iteration values, you can use the lookup table method or use the IEEE floating point format to obtain an approximate calculation expression. There is an RCPPS (RCPSS) in the SSE instruction set) the command can also be used to calculate the approximate reciprocal value. With the initial value, the above iteration formula can be used to obtain more accurate results)

Appendix: using Newton Iteration Method to implement square operation

// The square operation can be expressed as y = x ^ 0.5 = 1/(1/x ^ 0.5); Evaluate 1/x ^ 0.5 first

1/a ^ 0.5,

Let a = 1/x ^ 2; there is an equation y = A-1/x ^ 2;

Evaluate the exported y' = 2/x ^ 3;

X (n + 1) = x (n)-y (x (n)/Y' (x (n ));

There is an iterative formula x_next = (3-a * x) * x * 0.5; // it can be proved that the formula is a Level 2 convergence formula. // The process of the method is omitted from the proof above.

11. Fast Division using a fast Multiplier

This is a fast division algorithm. Currently, it is best to use hardware for implementation. Please refer to the relevant materials.

<Code optimization-optimization division> author: HouSisong@263.net