Introduction

In one of my previous articles [ThoughtWorks Code Challenge--fizzbuzzwhizz Game General High Speed (C & C + +)] It was mentioned that the compiler was optimized to handle the division of the divisor as a constant, sorted out today, One can understand how it is implemented, and secondly, if you want to write a compiler someday, this theory can be used. Also, count me a note.

Instance

Let's take a look at examples of compiler optimizations. We call Integer division with a constant divisor (for unsigned integers, signed integers we'll discuss later), referring to unsigned int a, B, C, for example: A/10, B/5, C/3 and so on integer division, we first Take a look at the compiler unsigned int a, p; p = A/10; How it is implemented is the result of the VS 2013 optimization, and the disassembly results are as follows:

The test code is as follows:

As we can see, the compiler is not compiled into an integer division div instruction , but instead it is multiplied by a 0xCCCCCCCD (mul instruction), then the edx is shifted to the right by 3 bits (SHR edx, 0x03), the value of edx is the quotient we require.

The principle is as follows: VALUE/10 ~= [[(Value * 0xCCCCCCCD)/2^32]/2^3], square brackets represent rounding.

The first thing we need to know is that the result of the 32-bit unsigned integer (uint32_t) on Intel x86 is actually a 64-bit integer, namely uint64_t, the result of multiplying is stored in EAX and edx two registers, EAX, edx is 32 bits, Together is a 64-bit integer, generally recorded as Edx:eax, wherein EAX is the low uint64_t, EDX is high. This feature is present on most 32-bit CPUs, that is, the bit length of the integer multiplication result is twice times the length of the multiply digits, and of course does not exclude some CPUs for some special reasons, the result of multiplying is still the same as the bit length of the multiplier, but such CPU should be very rare. Only CPUs that satisfy this feature can achieve integer division optimization of constants.

As for why the DIV directive is not used directly , the reason is: even in today's development has been relatively well-developed also more popular x86 series CPU, integer or floating point division instruction is still a slow instruction, CPU in integer or floating point division generally have two implementation methods, One is the trial business law (this is very similar to the human being in the process of computing division, but here the CPU uses the binary to try to determine how to end with conditional judgment), the other is multiplied by the reciprocal of dividend (that is, X (1.0/divisor)), divide the division into multiplication, because multiplication in the CPU implementation is relatively easy, Only the displacement and adder are implemented and can be processed in parallel. The former method, similar to human computational division, cannot be processed in parallel, and the computational process is indefinite and may end soon (exactly divisible), and may be calculated to the appropriate precision. The second method, because the process of seeking (1.0/divisor) with infinite series involves the problem of precision, the process takes a long time, and with this method, integer division must be converted to floating-point division in order to proceed, and the result is converted to integer. The commercial law can divide division and floating-point division into two sets of schemes, or it can be combined into a set of schemes. As for efficiency, the second method is faster, but the power consumption will be greater than the first method, because a large number of multipliers, so the x86 on the specific use of the method, I am not very clear, but the Intel manual says that the Core 2 architecture div Instruction execution period is about 13-84 cycles. Subsequent out of the new CPU division instruction there is no faster I'm not sure, from the implementation principle, after all, there is no multiplication fast, unless the division design has a big breakthrough, because the Core 2 on the integer/floating point multiplication only requires 3 clock cycles, it is difficult to approach the efficiency of multiplication, the key point is that, at present, The clock period of the division instruction is variable, may be short, may be long, and the average clock cycle is n times the multiplication instruction .

Principle

So, what is the principle of such optimization? How did you get that? Can the integer division be applied to all dividend constants that are uint32_t? The answer is: Yes . But the precondition is that the divisor must be a constant, and if it is a variable, the compiler can only honestly use div instructions.

Because the blog park to Letex support is not good (inconvenient, color is not easy to see), so I put the process of deduction to [this site], the following is the process of deduction:

(Since $c = 2^k$, we can use displacement directly to achieve division, so we don't discuss this.) ）

Is the principle of the deduction process, let us analyze the error:

Here is a description of why the total error $E $ satisfies the condition is: $E < 1/c$:

For example: when $a/10 = q + 0.0$, when the total error $E >= 1.0$, Q will be greater than the real Q value of 1;

More extreme situation: when $a/10 = q + 0.9$, when the total error $E >= 0.1$, Q will be larger than the real Q value of 1.

Fully available, as long as the total error $E $ meet $E < 1/c$, you can ensure that the Q value after rounding is equal to the true Q value.

In fact, if the upward rounding is not satisfied (1), then the downward rounding is bound to meet the (1) formula, why?

Although the result of rounding up is greater than the real value, the result of rounding down is less than the real value, whether it is greater or less than, we call it error, the error analysis of the downward rounding is similar to the above.

We set the error for rounding up to $e $, the error for rounding down is $e ' $, then there must be: $e + e ' = 1$, (because if the error of rounding up is 0.2, then the error of rounding down is necessarily 0.8)

And if the $ <= e < 2^k/c$ is not satisfied, it means the inevitable satisfaction: $ <= (1-E) < 2^k/c$, (here $c $ is satisfied $2^k < C < 2^ (k+1) $, so $0.5 & Lt 2^K/C < 1$)

That is: $ <= e ' < 2^k/c$, that is, if the $e $ is not satisfied (1), then $e ' $ must meet (1) formula.

Even in the most extreme cases, $e = 0.5$, because $0.5 < 2^k/c < 1$, so $e = e ' = 0.5$ is bound to satisfy (1) type.

Practice

Know the principle, then we come to the actual calculation of/10 optimization, the divisor c = 10, the first to find a number k, meet 2^k < C < 2^ (k+1), then K is 3, because 2^3 < ten < 2^ (3+1), we set the required multiplication factor For M, then another m = [2^ (32+k)/C], here is rounded up, i.e. M = [2^ (32+3)/10] = [3435973836.8] = 3435973837, which is hexadecimal notation 0xCCCCCCCD with a maximum error of E = 0.2 (Because we zoomed in (3435973837-3435973836.8) = 0.2), we see whether it satisfies (1), substituting: 0 < 0.2 < 2^3/10, i.e. 0 < 0.2 < 0.8, set up, so multiply the coefficient 0xCCCCC The CCD is established, and the number of digits to the right shift is k = 3 bits, as demonstrated earlier in the article.

Let's look at the compiler's optimization coefficients for other C-values, and we can tell that the multiplication factor for the/100 is: 0x51eb851f, 5-bit right SHIFT,/3 multiplication factor: 0xAAAAAAAB, 1-bit right shift,/5 multiplication factor: 0xCCCCCCCD, shift right 2 bit, actually it's equivalent to/ 10, only 1 bits less displacement, or 10 is actually equivalent to/5, but the displacement is different. The multiplication factor for the/1000 is: 0X10624DD3, 6-bit right,/125 multiplication factor: 0X10624DD3, 3-bit right shift.

We found that, in fact, for dividend C, if C can be $2^n$ divisible, you can first put $c $ into $c = 2^n * c ' $, and then C ' to do the above optimization, and then the right to move the number of bits plus N-bit right shift can, but actually do not do this step is also possible, but it is helpful to reduce The range of multiplication coefficients.

Optimization of 32-bit signed integers

Very similar to the 32-bit unsigned integer division optimization, still in/10 as an example, the result of optimization is:

We can see that the optimization process is multiplied by a 32-bit integer 0x66666667, (high) arithmetic right-shift 2-bit, and then the high-level logic to the right of 31 bits (actually to obtain a high position of the sign bit), and then on the basis of the original edx plus this value (the worth divisor is a negative number of 1, the divisor is When divisor A is a positive number, it is easy to understand that reasoning is exactly the same as the above 32-bit unsigned integer, and that the value of 0x66666667 is determined by a positive range of 32-bit signed integers from 0 to 2^31-1, and the deduction process is similar to the 32-bit unsigned integer.

When divisor A is negative, the situation is slightly more complex. First, let's look at how integer rounding is defined, when a >= 0 o'clock, | A | is the smallest integer less than or equal to a, when a < 0 o'clock, | A | is the smallest integer greater than or equal to a. Then for negative numbers, for example: |-0.5| = The smallest integer greater than or equal to-0.5, which is |-0.5| = 0, according to this: |-1.5| = -1.0,|-11.3| =-11.0.

Let's take/10 for example, if $a = -5$, then $a/10 = -5/10 = q + R = (-1.0) + 0.5$, and $|a/10| = |-5/10| = |-0.5| = 0.0$, if $a = -15$, then $a/10 = -15/10 = q + R = (-2.0) + 0.5$, and $|-15/10| = |-1.5| = -1.0$, we see, in order to guarantee the remainder $r $ range in $ <= R < 1.0$, based on the previous definition of negative integer rounding, here the value of $|a/10|$ rounding is 1 larger than the value of $q $. And the value of $q $ is exactly the result of our 32-bit unsigned integer, and we're going to calculate $|a/10|$, so we just need to add 1 to the result of $q $ value. So with the above the sign of edx shifted right 31 bits, and then the original edx value of the process, the final cumulative result is the last $|a/10|$ value.

Conclusion: when the divisor is a constant, unless you want to calculate the value must be forced to use a 32-bit signed integer, otherwise use 32-bit unsigned integer as far as possible, because the 32-bit signed integral type of the constant division will be more than the unsigned integer two instructions , a displacement , an Add (addition) directive.

Conclusion

After the above deduction, we have mastered the coefficients used to multiply the optimization and the number of displacements (right shift), but why some compilers to the 32-bit unsigned integer constant division/10 is also optimized to: VALUE/10 ~= [[(Value * 0x66666667)/2 ^32]/2^2], this is very simple, as long as the divisor a in the range of 0 ~ 2^32-1 The error is not more than 1/c, no matter how much you multiply the number, the final result is set up.

/constants and% constants are equivalent : In fact, this optimization is commonly used in the itoa () function, in fact, in most cases, will not appear in the form of/10,/5,/3, mostly with% 10,% 5,% 31 arise, if done at the same time/10,% 10 Operations, these two operations can be combined into a constant division optimization, because the% 10 is just/10 of the remainder of the process, and obtained/10, then the% 10 is only a calculation $r = a-c * q$ process, more than one multiplication and a subtraction instruction. While the itoa () function of Windows needs to specify Radix (base), the 10 binary must specify Radix = 10, because radix is a variable, so its internal use is a DIV directive, which is a bad function.

There are a lot of tricks, such as: sprintf () in the implementation code, generally through a variable base change to achieve a number conversion to 8 binary, 10 or 16, even if base = 8; base = 10; base = 16; But the compiler can be optimized for displacement when dealing with base = 8 and 16, but strangely enough, when base = 10 is used as a Div, this feature is true in every version of Visual Studio, including the latest VS 2013, 2014, efficiency is discounted. GCC and Clang I did not confirm, I guess clang can be correctly optimized (of course, there may be a prerequisite), available to study. But in fact this problem can be solved by modifying the code, we have to use constants for each part of the base to replace it, so do not rely on the compiler optimization, because it is specified as a constant, the compiler will understand how to do, no ambiguity.

Jimi

Here to make a small ad, recommend one of my project, I call it Jimi, GitHub address: https://github.com/shines77/jimi/, the goal is to achieve a high performance, practical, scalable, interface-friendly, closer to Java , CSharp's C + + class library. The initial inspiration came from Osgi.net,http://www.iopenworks.com, who intended to implement a scalable, freely-matched C + + class library like OSGi (Open Service Gateway Initiative). So originally wanted to call Jimu, but not too good to hear, so changed to the present name. Because the implementation of the Java-like OSGi for C + + is currently not feasible (due to the efficiency and language itself problems), so the direction has changed. It contains some of the above I mentioned on the itoa () and sprintf () optimization techniques, although still a lot of things are not complete, nor do any promotion, but there are many practical skills, waiting for you to find.

Reference documents

Discussion b Plan the large number multiplied by 10, divided by 10 for the discussion and implementation of the fast algorithm by LIANGBCH

Http://bbs.emath.ac.cn/thread-521-3-1.html

[CSDN Forum] Use shift instruction to achieve a number divided by 10

http://bbs.csdn.net/topics/320096074

How does the compiler implement constant integer division optimization for 32-bit integers? [C + +]