How does the compiler optimize the division of 32-bit Integer constants and integers? [C/C ++], an integer constant

Source: Internet
Author: User
Tags integer division

How does the compiler optimize the division of 32-bit Integer constants and integers? [C/C ++], an integer constant
Introduction

In my previous article [ThoughtWorks code challenge-FizzBuzzWhizz general-purpose high-speed game edition (C/C ++ & C #)] it has been mentioned that the compiler is optimized when processing division where the divisor is a constant. Today, we can understand how to implement it, and if you want to write the compiler one day, this theory can be used. In addition, it is also a note of mine.

Instance

Let's take a look at the example of Compiler optimization. The division of integers whose divisor is a constant (For unsigned integer, we will discuss it later) refers to the unsigned int a, B, c, for example: for integer division such as a/10, B/5, c/3, let's first look at how the compiler unsigned int a, p; p = a/10; implements it, is the result of VS 2013 optimization. The disassembly result is as follows:

 

The test code is as follows:

We can see that the compiler did not compile the DIV command into an integer division, but multiplied by a 0 xCCCCCCCD (MUL command), and shifted EDX three places to the right (SHR edx, 0x03). The value of EDX is the vendor we require.

The principle is: Value/10 ~ = [[(Value * 0 xCCCCCCCD)/2 ^ 32]/2 ^ 3], square brackets represent an integer.

First, we need to know that the result of multiplying 32-bit unsigned integer (uint32_t) on Intel x86 is actually a 64-bit integer, that is, uint64_t, the result of multiplication is stored in the EAX and EDX registers. EAX and EDX are both 32-bit, and the Union is a 64-bit integer, which is generally recorded as EDX: EAX, EAX is the low position of uint64_t, while EDX is the high position. This feature exists on most 32-bit CPUs, that is, the bit length of the integer multiplication result is twice the length of the multiplier bit. Of course, some CPUs are not excluded for some special reasons, the result of multiplication is still the same as the bit length of the multiplier, but such CPU should be rare. Only the CPU that meets this feature can optimize the integer division of constants.

The reason for not directly using DIV commands is: even on the x86 series CPUs that are well developed and popular today, the integer or floating-point division command is still a slow command. The CPU has two methods in integer or floating-point division, one is trial commercial law (this is very similar to the process in which humans calculate division, but here the CPU uses binary to test the operator and uses conditional judgment to determine how to end ), the other is multiplied by the reciprocal of the divisor (that is, x (1.0/divisor) to convert division into multiplication, because multiplication is easier to implement in the CPU, you only need to implement the displacement and addator, and can process them in parallel. The previous method is similar to the Division of Human computing. It cannot be processed in parallel, and the computing process is not long and may soon end (just like Division ), it may need to be calculated to an appropriate precision. The second method, because the process of finding (1.0/divisor) using an infinite series involves precision issues, the time spent in this process is also not long, and when using this method, the integer division must be converted to the floating point division, and the result must be converted to an integer. The trial commercial law can divide division and floating point division into two sets of solutions, or combine them into one. As for efficiency, the second method is faster, but the power consumption is greater than the first method. because a large number of multiplier is used, the specific method is used on x86, I am not very clear, but the Intel manual says that the execution cycle of DIV commands in Core 2 architecture is about 13-84. I don't know whether the new CPU division commands will be faster. In terms of implementation principle, there will be no fast multiplication after all, unless there is a major breakthrough in division design, because integer/floating-point multiplication on Core 2 only requires three clock cycles, it is very difficult to approach the multiplication efficiency. The key point is that at present, the clock cycle of a division instruction is variable, which may be short or long. The average time consumed is N times that of a multiplication instruction.

Principle

So what is the principle of such optimization? How to get it? Can it be applied to the integer division of all constants whose divisor is uint32_t? The answer is: yes. But the precondition is that the divisor must be a constant. If it is a variable, the compiler can only honestly use the DIV command.

As the blog Park does not support LeTex well (inconvenient, and the color is not good), I put the deduction process to [this website]. The deduction process is as follows:

(Since $ c = 2 ^ k $, we can use displacement directly to implement division, so we will not discuss this situation .)

Is the deduction process of the principle. Let's analyze the error below:

Here we will explain why the total error $ E $ meets the following conditions: $ E <1/c $:

For example, when $ a/10 = q + 0.0 $, then when the total error $ E> = 1.0 $, q will be 1 larger than the actual q value;

More challenging: When $ a/10 = q + 0.9 $, when the total error $ E> = 0.1 $, q will be 1 larger than the actual q value.

To sum up, as long as the total error $ E $ satisfies $ E <1/c $, the q value after the integer is equal to the real q value.

 

In fact, if the upward integer does not satisfy the (1) formula, the downward integer must satisfy the (1) formula. Why?

Although the result of rounded up is greater than the actual value, the result of rounded down is smaller than the actual value. Whether it is greater than or less than, we call it an error, the Error Analysis for downgrading is similar to the preceding one.

 

We set the error of rounded up to $ e $, and the error of rounded down to $ e '$, which must be: $ e + E' = 1 $, (because if the error of rounded up is 0.2, the error of rounded down is 0.8)

If $0 <= e <2 ^ k/c $ is not satisfied, it means that $0 <= (1-e) <2 ^ k/c $, (here $ c $ meets $2 ^ k <c <2 ^ (k + 1) $, so $0.5 <2 ^ k/c <1 $)

That is: $0 <= E' <2 ^ k/c $, that is, if $ e $ does not meet the (1) formula, $ e '$ must meet the (1) formula.

Even in the most extreme case, $ e = 0.5 $, because $0.5 <2 ^ k/c <1 $, therefore, $ e = E' = 0.5 $ must meet the (1) formula.

Practice

After understanding the principle, we can calculate the optimization of/10. The divisor c = 10 gives priority to a number k, if 2 ^ k <c <2 ^ (k + 1) is met, then k is 3 because 2 ^ 3 <10 <2 ^ (3 + 1 ), let's set the required multiplication coefficient to m, then the other m = [2 ^ (32 + k)/c], here we use the rounded up, that is, m = [2 ^ (32 + 3)/10] = [3435973836.8] = 3435973837, that is, 0 xcccccccccd in hexadecimal notation, the maximum error is e = 0.2 (because we enlarged (3435973837-3435973836.8) = 0.2). We can see whether the formula (1) is satisfied. The input value is 0 <0.2 <2 ^ 3/10, that is, 0 <0.2 <0.8, true, so the multiplication coefficient 0 xCCCCCCCD is true, and the number of digits to the right is k = 3, which is consistent with the one demonstrated earlier in the article.

Let's take a look at the optimization coefficient of the compiler for other c values. We can know that the multiplication coefficient of/100 is 0x51EB851F, 5 digits to the right, and the multiplication coefficient of/3 is 0 xaaaaaaaaab, shifts one digit to the right, and the multiplication coefficient of/5 is: 0 xCCCCCCCD. Shifts two digits to the right. In fact, it is equivalent to/10, but the displacement is less than one digit, or/10 is equivalent to/5, but the displacement is different. The multiplication coefficient of/1000 is 0x000024dd3, 6 digits are shifted to the right, and the multiplication coefficient of/125 is 0x000024dd3, and three digits are shifted to the right.

We found that for divisor c, if c can be divisible by $2 ^ N $, you can convert $ c $ to $ c = 2 ^ N * c '$ first, perform the above Optimization on c, and then add N to the right shift of the right shift. However, it is okay not to do this step, however, this helps narrow the multiplication coefficient range.

32-Bit Signed Integer Optimization

It is very similar to the Division optimization of a 32-bit unsigned integer. The optimization result is as follows:

We can see that the optimization process is multiplied by a 32-bit integer 0x66666667, (high) arithmetic Right Shift 2 bits, next, shift the high logic 31 bits to the right (actually the symbol bit that gets the high position), and add this value on the basis of the original EDX (this value is 1 when the division number is negative, when the divisor is positive, it is 0 ). When divisor a is a positive integer, we can easily understand that the reasoning is the same as the 32-bit unsigned integer above, the value of 0x66666667 is from 0 ~ 2 ^ 31-1: The deduction process is similar to a 32-bit unsigned integer.

When divisor a is a negative number, the situation is slightly complicated. First, let's take a look at how integers are defined. When a> = 0, | a | is the smallest integer smaller than or equal to a. When a <0, | a | it is the smallest integer greater than or equal to. For a negative number, for example, |-0.5 | = a minimum integer greater than or equal to-0.5, that is, |-0.5 | = 0. Therefore, the value can be: |-1.5 | =-1.0, |-11.3 | =-11.0.

Take/10 as an example. If $ a =-5 $, then $ a/10 =-5/10 = q + r = (-1.0) + 0.5 $, and $ | a/10 | = |-5/10 | = |-0.5 | = 0.0 $, if $ a =-15 $, then $ a/10 =-15/10 = q + r = (-2.0) + 0.5 $, while $|-15/10 | = |-1.5 | =-1.0 $, we can see that, to ensure that the remainder $ r$ is in the range of $0 <= r <1.0 $, according to the definition of the negative integer mentioned above, here $ | a/10 | $ the integer value is 1 greater than the value of $ q $. The value of $ q $ is exactly what we get from a 32-bit unsigned integer. What we want to calculate is $ | a/10 | $, therefore, you only need to add 1 to the result of $ q $. Therefore, with the above 31-bit shift to the right of the EDX symbol and the process of accumulating the original EDX value, the final result is the final value of $ | a/10 | $.

Conclusion: When the divisor is a constant, unless the value to be calculated must use a 32-bit signed integer, try to use a 32-bit unsigned integer, because the constant division of a 32-bit signed integer is two more commands than that of an unsigned integer, one shift and one ADD (addition) command.

Conclusion

After the above deduction, we have mastered the coefficient used for multiplication during optimization and the number of digits of the displacement (right shift, but why does some compilers optimize the division of constants/10 for 32-bit unsigned integers to: Value/10 ~ = [[(Value * 0x66666667)/2 ^ 32]/2 ^ 2], this principle is very simple, as long as it can satisfy the division a in the 0 ~ 2 ^ if the error in the 32-1 range is not greater than 1/c, the final result is true no matter what number you multiply.

/Constants and % constants are equivalent: in fact, this optimization is often used in the itoa () function. In fact, in most cases, it will not appear in the form of/10,/5,/3, most of these operations appear together with % 10, % 5, and % 3. If you perform the/10, % 10 operations at the same time, these two operations can be merged into a constant division optimization, because % 10 is only a remainder process of/10, and get/10, % 10 is only a process for calculating $ r = a-c * q $, there is only one more multiplication and one subtraction command. In Windows, the itoa () function needs to specify radix (base). radix = 10 must be specified in decimal notation. Because radix is a variable, it uses div commands internally, this is a bad function.

There are also a lot of tips, such as: sprintf () implementation code, usually through a variable base change to achieve a number conversion to octal, base = 8; base = 10; base = 16; but the compiler can optimize base = 8 or 16 to shift, but it is strange that when the base is 10, the div command is used, which is used in all versions of Visual Studio, including the latest VS 2013,201 4, efficiency is compromised. I have not verified GCC and clang. I guess clang may be able to be properly optimized (of course, there may be a prerequisite). If you have time, study it. But in fact, this problem can be solved by modifying the code. We can replace each part of the base with constants, so that we don't rely on Compiler optimization, because they are all specified as constants, the compiler will understand what to do and there is no ambiguity.

Jimi

Here is a small advertisement, I recommend a project, I call it Jimi, Github address: https://github.com/shines77/jimi/, the goal is to achieve a high performance, practical, scalable, interface friendly, closer to Java, CSharp C/C ++ class library. The first inspiration was OSGi. net, http://www.iopenworks.com, the original intention is to achieve the OSGi (Open Service Gateway Initiative) as scalable can be any combination of C ++ class library, so originally wanted to call Jimu, but not nice, so change it to the current name. Implementation of OSGi like Java is currently not feasible for C ++ (due to efficiency and language problems), so the direction has changed. It contains some optimization techniques I mentioned above for itoa () and sprintf (). Although there are still many incomplete things, they have not been promoted yet, but there are a lot of practical skills in it, waiting for you to discover.

References

[Discussion] the big number of Plan B multiplied by 10, and the discussion and implementation of a fast algorithm divided by 10 by liangbch

Http://bbs.emath.ac.cn/thread-521-3-1.html

[CSDN Forum] Using Shift commands to divide the number by 10

Http://bbs.csdn.net/topics/320096074

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.