Abstract: The current CPU multiplication is very fast (about one CPU cycle, or two or three cycles are required, but each cycle can start a new multiplication command ), however, Division as a basic instruction is more than expected. It is a very slow operation, and division between integers and floating points is slow. This article will provide some methods or alternative algorithms for division;

<Code optimization-optimization division> author: HouSisong@263.net

(Note: A lot of data in this article may have different results in new CPUs, different CPUs, or different system environments, and may not be comprehensive)

X86 series CPUs can complete basic commands such as bit operations, addition, and subtraction within one CPU cycle (the current CPU can still be executed in disorder, so that the average CPU cycle of the command is smaller); the current

CPU, multiplication is also very fast (about one CPU cycle, or two or three cycles are required, but each cycle can start a new multiplication command ), however, Division as a basic instruction is beyond the expectation of many people,

It is a very slow operation, and integer and floating point division are both slow; I tested the Intel P5 proyang CPU floating point division is almost 37 CPU cycles, the division of integers is 80 CPU cycles,

Amd2200 + floating point number division is almost 21 CPU cycles, while integer division is 40 CPU cycles. (Changing FPU operation precision is not effective for Division) (SSE instruction set low-path ticket precision

Number division command divps 18 CPU cycles, four single precision number division command divss 36 CPU cycles)

(The remainder and Division operations on x86 are implemented using the same CPU command. It is said that many Integer Division operations on CPUs are completed using the floating-point delimiters of the mathematical coprocessor. One inference is, floating Point division and

Integer Division cannot be executed in parallel)

This article provides some division optimization methods or substitution algorithms (warning: Some substitution algorithms cannot be completely equivalent !)

1. Try to use less Division

For example, if (x/y> Z )...

Change to: If (Y> 0) & (x> y * z) | (Y <0) & (x

(Less use for remainder)

For example: + index; If (index> = count) Index = index % count;

// Assert (Index

Changed to: + index; If (index> = count) Index = index-count;

2. Use subtraction instead of Division

If we know that the divisor is a small multiple of the divisor, we can use subtraction instead of division.

For example, uint32 x = 200;

Uint32 y = 70;

Uint32 z = x/y;

Changed to: uint z = 0;

While (x> = y)

{

X-= y; ++ Z;

}

A Division completed by subtraction and shift (if you do not have Division instructions available :)

Uint32 Div (uint64 S, uint32 Z) // return U/Z

{

Uint32 x = (uint32) (S> 32 );

Uint32 y = (uint32) S;

// Y storage Vendor X storage Remainder

For (INT I = 0; I <32; ++ I)

{

Uint32 t = (int32) x)> 31;

X = (x <1) | (Y> 31 );

Y = Y <1;

If (X | T)> = z)

{

X-= z;

++ Y;

}

}

Return y;

}

(This function has been tested; Z = 0 needs to be processed by yourself; For signed division, you can use the absolute value method (of course it is not easy)

Write out the fully equivalent signed Division :); if the 64-bit length of S is not required and only 32-bit is required, you can simplify this function, but there is not much improvement)

3. replace division with shift (many compilers can optimize it automatically)

It is required that the divisor be a constant of the power of 2. (likewise, for some applications, such a number can be preferentially selected for the divisor)

For example, uint32 x = 213432575;

Uint32 y = X/8;

Changed to: Y = x> 3;

For signed integers;

For example, int32 x = 213432575;

Int32 y = X/8;

Change to: If (x> = 0) y = x> 3;

Else y = (x + (1 <3-1)> 3;

4. Merge Division (alternative methods are not equivalent, and many compilers will not (and should not) Help You With this optimization)

This method is applicable when division cannot be avoided by other methods;

For example, double X = A/B/C;

Change to: Double X = A/(B * C );

For example, double X = A/B + C/B;

Change to: Double X = (a + C)/B;

For example, double X = A/B;

Double Y = C/d;

Double Z = E/F;

Change to: Double TMP = 1.0/(B * D * F );

Double X = A * TMP * D * F;

Double Y = C * TMP * B * F;

Double Z = E * TMP * B * D;

5. Make full use of the time occupied by Division

When the CPU performs division, it does not have to wait for the result (that is, the instruction inserted later does not use the division result), but inserts multiple simple integers.

Command (excluding integer division, and the result cannot be a global or external variable), which saves the time occupied by division;

(This method is useful when division is inevitable)

6. Use the lookup method instead of division.

(Applicable to the case where the possible value range of the divisor and the divisor is small, otherwise the space consumption is too large)

For example, uint8 X; uint8 y;

Uint8 z = x/y;

Change to uint8 z = table [x] [Y]; // Where table is a pre-calculated table, table [J] = I/J;

// You need to process the division by zero based on your application.

Or: uint8 z = table [x <8 + Y];

// Table = (I> 8)/(I & (1 <8-1 ));

For example, uint8 X;

Uint8 z = x/17;

Change to uint8 z = table [X]; // where Table = I/17;

7. Use multiplication instead of Division

(The alternative method is not equivalent. Many compilers won't (and shouldn't) Help You With this optimization)

For example, double X = y/21.3432575;

Change to: Double X = y * (1/21. 3432575); // If the compiler cannot be optimized (1/21. 3432575), calculate the result in advance.

For integers, you can use a method similar to the number of points to process the reciprocal.

(This alternative method is not equivalent)

For example, uint32 X, Y; X = y/213;

Change to: x = y * (1 <16)/213)> 16;

In some applications, y * (1 <16)/213) may exceed the value range. In this case, you can use int64 to expand the value range.

Uint32 x = (uint64) y) * (1 <31)/213)> 31;

You can also use floating point numbers to expand the value range.

Uint32 x = (uint32) (y * (1.0/213); (warning: floating point numbers are forcibly converted to Integers in many advanced languages.

A very slow operation. The optimization method will be provided in the next few articles)

For this method, some Division has an exact equivalent optimization method:

Unsigned number divided by 3: uint32 X, Y; X = y/3;

Reasoning: Y (1 <33) + 1 y

X = y/3 => X = [-] => X = [-+ ---] => X = [--- * ---] // [] indicates the integer.

3 3 3*(1 <33) 3 (1 <33)

Y 1

(Verifiable: 0 <= --------- <-)

3*(1 <33) 3

That is, x = (uint64 (y) * m)> 33;

The magic number M = (1 <33) + 1)/3 = 2863311531 = 0 xaaaaaaab;

Unsigned number divided by 5: uint32 X, Y; X = y/5; (Y <(1 <31 ))

Reasoning: Y 3 * Y (1 <33) + 3 Y

X = y/5 => X = [-] => X = [-+ ---] => X = [--- * ---]

5 5 5*(1 <33) 5 (1 <33)

3 * Y 1

(Verifiable: 0 <= --------- <-)

5*(1 <33) 5

That is, x = (uint64 (y) * m)> 33;

Magic M = (1 <33) + 3)/5 = 1717986919 = 0x66666667;

Unsigned number divided by 7: uint32 X, Y; X = y/7;

Reasoning: omitted

That is, x = (uint64 (y) * m)> 33) + Y)> 3;

Magic M = (1 <35) + 3)/7-(1 <32) = 613566757 = 0x24924925;

There are other substitution formulas for this completely equivalent substitution, which are not discussed. There is no discussion about alternative algorithms with signed Division, and many numbers have completely equivalent substitution algorithms, the magic numbers are also regular; the "aggressive" compiler should help the user deal with this optimization (as we can see in vc6!

The method for laziness is to directly view the assembly code generated by vc6 :);

8. If division is known to be an integer multiple of the divisor, an alternative algorithm or an improved algorithm can be obtained;

Here we only discuss the situation where the divisor is a constant;

For example, for (32-bit Division): x = y/a; // It is known that Y is a multiple of A, and assume that A is an odd number.

(If A is an even number, first convert it to a = A0 * (1 <k); y/a = (y/A0)> K; A0 is an odd number)

Obtain a' so that (a' * A) % (1 <32) = 1; // the existence of a' can be deduced by using a as an odd number and Y as a multiple of.

So: x = y/a => X = (y/a) * (A * A') % (1 <32 )) => X = (y * A') %

(1 <32); // you do not need to perform a remainder command here.

(This algorithm supports both signed and unsigned division)

For example, uint32 y;

Uint32 x = y/7; // It is known that Y is a multiple of 7.

Change to uint32 x = (uint32) (y * m); // where M = (5*(1 <32) + 1)/7

9. Approximate Calculation Division (this alternative method is not equivalent)

Optimize the operation of divisor 255 (257 or 1026, and so on) (or, and the formula for pushing and exporting is slightly different)

For example, in color processing: uint8 color = colorsum/255;

Changed to: uint8 color = colorsum/256; that is, color = colorsum> 8;

This error is often acceptable in color processing.

To reduce the error, you can change it to uint8 color = (colorsum + (colorsum> 8)> 8;

Derivation: X/255 = (x + x/255)/(255 + 1) = (x + a)> 8; A = x/255;

Change a to a = x> 8

(A small error is introduced). The formula X/255 is approximately equal to (x + (x> 8)> 8.

Likewise, X/255 is approximately equal to (x + (x> 8) + (x> 16)> 8 and other more accurate formulas (Please export the error items to determine whether the accuracy is sufficient)

10. Newton Iteration Method for Division

(Many internal division commands of the CPU are implemented using this method or similar methods)

Assume that there is a function y = f (x); evaluate the value of X when 0 = f (x; (Here we will not discuss the situation of multiple solutions, escape, fall into an endless loop or fall into a chaotic state)

(Reference Image)

Evaluate the function's guiding function to Y = f' (x); // you can understand the meaning of this function: the slope of the Function Y = f (x) at the X point;

A. Obtain an appropriate x Initial Value x0 and obtain Y0 = f (x0 );

B. Use (x0, y0) as a straight line with the slope of f' (x0) and submit it to X1 on the X axis;

C. Use the obtained X1 as the initial value for iteration;

After many iterations, it is considered that x1 will be very close to the solution of equation 0 = f (x), which is the Newton Iteration;

Simplify the above process and obtain the Newton iteration formula: x (n + 1) = x (n)-y (x (n)/Y' (x (n ))

Here we use the Newton iteration formula to calculate the reciprocal. (use the reciprocal to obtain the Division: Y = x/a = x * (1/))

1/,

Let a = 1/X; have the equation Y = A-1/X;

Evaluate the exported y' = 1/x ^ 2;

X (n + 1) = x (n)-y (x (n)/Y' (x (n ));

There is an iterative formula x_next = (2-A * X) * X; // it can be proved that this formula is a Level 2 convergence formula; that is, the effective accuracy of the calculated solution increases exponentially.

Proof of convergence: x = (1/a) + dx; // dx is a small amount

Then x_next-(1/a) = (2-A * (1/A + dx) * (1/A + dx)-1/

= (-A) * DX ^ 2 // ^ indicates the exponential Operator

Certificate completed.

The program can use this method to implement division and determine the number of iterations according to its own precision requirements;

(The selection of initial values is not discussed)

Appendix: using Newton Iteration Method to implement square operation

// The square operation can be expressed as Y = x ^ 0.5 = 1/(1/x ^ 0.5); Evaluate 1/x ^ 0.5 first

1/A ^ 0.5,

Let a = 1/x ^ 2; there is an equation Y = A-1/x ^ 2;

Evaluate the exported y' = 2/x ^ 3;

X (n + 1) = x (n)-y (x (n)/Y' (x (n ));

There is an iterative formula x_next = (3-a * x * X) * x * 0.5; // it can be proved that the formula is a Level 2 convergence formula. // The process of the method is omitted from the proof above.

11. Fast Division using a fast Multiplier

This is a fast division algorithm. Currently, it is best to use hardware for implementation. Please refer to the relevant materials.