Problems and Solutions for a cross-platform C/C ++ floating point number

Last Update:2017-01-13 Source: Internet

Author: User

Tags centos truncated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background to put it simply, a recent project written in C # involves floating-point computation, which is omitted from the ins and outs. Let's look at the following code.

Float p3x = 80838366f;
Float p2y =-2499366f;
Double v321 = p3x * p2y;
Console. WriteLine (v321 );

The result is-202014162 immediately. No problem. Isn't C # generating such a result? IMPOSSIBLE. Start Visual Studio and try the copy code. The result is-202014162. Is that all done? Apparently not! You can change the compilation option from AnyCPU to x64 ~ (The Server environment is 64-bit !!) The result is-202014160, right, that is-202014160. I don't believe it. If I run it again, it will still be-202014160. Er, I figured it out. Because of the floating point operation error, the result of-202014160 is reasonable.

Why? It is normal because the above p3x and p2y are two float types. Although v321 is double, it is also converted to double after calculation of the two float types, the float precision is originally only 7 bits. Therefore, there is no way to ensure the precision of the hundreds of millions of data records.

But why does changing the CPU type have different effects? Well, let's try C/C ++ again.

# Include
Using namespace std;

Int main ()
{
Float p3x = 80838366f;
Float p2y =-2499366f;
Double v321 = p3x * p2y;
Std: cout. precision (15 );
Std: cout <v321 <std: endl;

Return 0;
}

The result of the above C ++ code on different platforms is as follows:

Windows 32/64 bits:-202014160
64-bit Linux (CentOS 6 gcc 4.4.7)-202014160,
For 32-bit Linux (Ubuntu 12.04 + gcc 4.6.3):-202014162

The reasonable result should be-202014160, the correct calculation result is-202014162, and the rationality is caused by insufficient floating point precision (the rationality is explained later ). If we multiply two double values to get the correct and reasonable calculation result (note: declare the p3x and p2y types in the above C ++ program as double, and then we can get the correct result, because double is double-precision and float is single-precision, double has enough digits to store more digits ). However, we do not understand why the "correct" number can be calculated in 32-bit Linux, rather than the "reasonable" number.

Like C ++, C # does not get the same result in 32-bit and 64-bit (DEBUG, which will be described later, let's take a look at the assembly code of C ++/C # (use the disassemble/m main Command of gdb, and the following shows only the compilation of the line of code whose float * float is converted to double)

Compile with G ++ on Linux

// Ubuntu 12.04 in C ++ 32-bit systems
Double v321 = p3x * p2y;
0x0804860f <+ 27>: flds 0x18 (% esp)
0x08048613 <+ 31>: fmuls 0x1c (% esp)
0x08048617 <+ 35>: fstpl 0x10 (% esp)

.......

// CentOS 6 in a 64-bit C ++ system
Double v321 = p3x * p2y;
0x0000000000000040083c <+ 24>: movss-0x20 (% rbp), % xmm0
0x0000000000400841 <+ 29>: mulss-0x1c (% rbp), % xmm0
0x0000000000400846 <+ 34>: unpcklps % xmm0, % xmm0
0x0000000000400849 <+ 37>: cvtps2pd % xmm0, % xmm0
0x0000000000000040084c <+ 40>: movsd % xmm0,-0x18 (% rbp)

Compile with Visual Studio on Windows

// C # AnyCPU compilation, Windows VS2012
Double v321 = p3x * p2y;
00000049 mongodword ptr [ebp-40h]
2017004c fmul dword ptr [ebp-44h]
2017004f fstp qword ptr [ebp-4Ch]

// C #64-bit compiling Windows7 VS2012
Double v321 = p3x * p2y; </pre>
009B43B8 movss xmm0, dword ptr [p3x]
009B43BD mulss xmm0, dword ptr [p2y]
009B43C2 cvtss2sd xmm0, xmm0
009B43C6 movsd mmword ptr [v321], xmm0

From the assembly code above, we can see that the Assembly commands for Linux and Windows, C ++ or C #32-bit and 64-bit floating point numbers are not the same. The 32-bit command used to generate code is Executor/fmul/fstp, while the 64-bit command used movss/mulss/movsd. It seems that this is related to the platform.

We continue to investigate and find that, among them, the explain/fmul/fstp commands are implemented by the FPU (float point unit) floating point computing processor. To be precise, they are FPU x87 commands, FPU uses 80-bit registers for floating-point operations, and truncate the data into 32-bit or 64-bit based on float/double, by default, FPU will minimize the precision issues caused by rounding. See Floating point computation standard IEEE-754 recommendation standard implementers provide Extended precision formats, Intel x86 processor has FPU (float point unit) floating point computation processor supports this extension.

In non-FPU scenarios, the 128-bit register in SSE is used (float actually only uses 32 bits, and the computation is also based on 32 bits ), this is the final cause of the above problems. For detailed analysis, see the description at the end of this article.

With this in mind, we can take a look at the document in man g ++. We can find a compilation option called-mfpmath. In 32 bits, the default value of this compilation option is 387, that is, the x87 FPU command. In 64-bit mode, the value of this compilation option is sse, that is, the command using SSE. Therefore, for the example in this article, if you add-mfpmath = 387 to 64 bits, you will get a "correct" result instead of a "reasonable" result.

In VS2012, for C ++, the compilation option can be set (in code generation) (optional),/fp: [precise | fast | strict], in this example, if precise or strict is used in the Release 32-bit system, a reasonable result (-202014160) is obtained, and fast generates the correct result (-202014162 ), the results in fast debug/release are not the same (only optimized in release ). 64. You can test the results (Debug/Release) on your own to see what the intermediate code after VS compilation looks like. (Chen Hao note: my VS2012 compilation is the same no matter how you set/fp parameter values, Assembly is the same, using the SSE command, and Release is different, however, under my release, the compilation of code is very weird and the source code is on the upper number. For many years, we don't need to use Windows for Development. The use of VS is only on VC6 ++/VC2005)

Therefore, when we Port code from the x87 FPU command to the SSE command, we may encounter the precision problem of such a floating point, this accuracy problem may be worse in scientific computing for many times. This problem is not simply calculated in 32-bit and 64-bit systems, but mainly depends on the implementation of the language compiler. In a more advanced language, for example, C99 or Fortran 2003, "long double" is introduced to implement Extension Double, which can eliminate more precision problems.

Next we will change the program to long double (note: the type here is changed to long double)

# Include
Using namespace std;

Int main ()
{
Long double p3x= 80838.0;
Long double p2y =-2499.0;
Long double v321 = p3x * p2y;
Std: cout. precision (15 );
Std: cout <v321 <std: endl;

Return 0;
}

Using disassemble/m main of gdb, you will see the following computation assembly (using the fmlp command ):

// Linux 32-bit system
Long double v321 = p3x * p2y;
0x08048633 <+ 63>: fldt 0x10 (% esp)
0x08048637 <+ 67>: fldt 0x20 (% esp)
0x0804863b <+ 71>: fmulp % st, % st (1)
0x0804863d <+ 73>: fstpt 0x30 (% esp)

// Linux 64-bit system
Long double v321 = p3x * p2y;
0x0000000000400818 <+ 52>: fldt-0x30 (% rbp)
0x0000000000000040081b <+ 55>: fldt-0x20 (% rbp)
0x0000000000000040081e <+ 58>: fmulp % st, % st (1)
0x0000000000400820 <+ 60>: fstpt-0x10 (% rbp)

We can see that the 32-bit and 64-bit systems use the same assembly command (of course, I don't have so many physical machines, I just tested it on the VMWare Play virtual machine, therefore, the above example does not necessarily apply to all places. In addition, the C/C ++ language has a great relationship with the compiler and platform ), the reason is that we use the extended double precision data type long double. (Note: If you use double or float, on Linux, 32-bit is compiled using the x87 FPU command, and 64-bit is compiled using the SSE command)

Now let's go back to C #. The floating point of C # supports this standard, its official documentation also mentions that floating-point operations may produce higher-precision values than the return type (just as the returned value accuracy exceeds the float precision ), it also indicates that if the hardware supports scalable floating point precision, all floating point operations will use this precision to improve efficiency. For example, x * y/z, the value of x * y may be out of the range of double's capacity, but the actual situation may be divided by z and the result can be pulled back to the range of double. In this case, when FPU is used, an exact double value is obtained, instead of an infinite number.

Therefore, for C #, you obviously cannot find a "solution" that uses compiler options like C/C ++ to solve this problem (in fact, using compiler parameters is a pseudo solution ).

In addition, it is not necessary to modify the compiler option to solve this problem, because it is obviously not a FPU or SSE problem. FPU is an outdated technology and SSE is a reasonable technology. Therefore, if you don't want your floating point number to be computed, and you need accuracy, the correct solution is not to compile parameters, instead, you must use data types with a higher precision of bytes, such as double or long double.

In addition, when writing code, you must ensure the consistency of the actual running environment, test environment, and development environment (including OS architecture and compilation options) AH (especially C/C ++, and there may be many pitfalls in compiler parameters, and some pitfalls may mask problems in your program ), otherwise, an inexplicable problem will occur. (This article describes the problem caused by inconsistency between the development environment and the running environment. It takes a long time to find out the cause ); when floating point operations are involved, do not forget that this may be the cause. Pay special attention to the use of float and double types.

Reference:

[1] C # Language Specification Floating point types
[2] Are floating-point numbers consistent in C #? Can they be?
[3] The FPU Instruction Set

Appendix

808380000f *-24990000f =-202014160.0 description of floating point operation process

In the computer, 32-bit floating point numbers are represented by one-bit sign-bit (s)-Eight-bit exponent-23-bit valid numbers (M ).
32-bit Float = (-1) ^ s * (1 + m) * 2 ^ (e-127) where e is actually converted to 1. xxxxx * 2 ^ e index, m is the previous xxxxx (save 1 digit)

80838366f = 1 0011 1011 1100 0110.0 = 1.00111011110001100*2 ^ 16
Valid bits M = 0011 1011 1100 0110 0000
Exponential E = 16 + 127 = 143 = 10001111
Internal representation 80838.0 = 0 [1000 1111] [0011 1011 1100 0110 0000 000]
= 0100 0111 1001 1101 1110 0011 0000
= 47 9d e3 00 // The memory value displayed during actual debugging may be 00 e3 9d 47 because the debugging environment uses the small-end representation method: Low-byte-row memory and low-address end, high rank memory high address

-2499.0 =-100111000011.0 =-1.001110000110*2 ^ 11
Valid bits M = 0011 1000 0110 0000 0000
Exponential E = 11 + 127 = 138 = 10001010
Symbol bit s = 1
Internal representation-2499.0 = 1 [10001010] [0011 1000 0110 0000 0000]
= 1100 0101 0001 1100 0011 0000 0000
= C5 1c 30 00

80838.0 X-2499.0 =?

First, the index e = 11 + 16 = 27
Exponential E = e + 127 = 154 = 10011010
The result of multiplication of valid bits is 1.1000 0001 0100 1111 1011 1010 01 // you can calculate the result by yourself.
In reality, there can only be 23 digits, and the following is truncated: 1000 0001 0100 1111 1011 1010 01
The internal representation of the multiplication result is = 1 [10011010] [1000 0001 0100 1111 1011 101].
= 1100 1101 0100 0000 1010 0111 1101
= Cd 40 a7 dd

Result =-1.1000 0001 0100 1111 1011*2 ^ 27
=-11000 0001 0100 1111 1011
=-202014160
Convert it to double or-202014160.

If it is FPU, the above valid bit result will not be truncated, that is
FPU result =-1.1000 0001 0100 1111 1011*2 ^ 27
=-11000 0001 0100 1111 1011
=-202014162

If you have any omissions in this article, please confirm.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More