Original URL: http://blog.sina.com.cn/s/blog_827d041701017ctm.html

**Questions raised: 12.0f-11.9f=0.10000038, "Lose The Endless" why?**

An explanation from MSDN:

Http://msdn.microsoft.com/zh-cn/c151dt3s.aspx

Why floating-point numbers may lose precision floating-point decimal values typically do not have exactly the same binary representation. This is a side effect of the floating-point data representation used by the CPU. For this reason, you may experience some loss of precision, and some floating-point operations may produce unexpected results.

The cause of this behavior is one of the following:

Binary representations of decimal numbers may not be accurate.

Type mismatch between the numbers used (for example, blending with floating-point and double-precision types).

To resolve this behavior, most programmers either make sure that the value is larger or smaller than needed, or get and use a binary encoded decimal (BCD) library that maintains precision.

Now let's analyze in detail why a floating point operation causes loss of precision.

**1, Decimal binary representation problem**

first of all, we need to figure out the following two questions:

(1) How decimal integers are converted to binary numbers

The algorithm is simple. For example, 11 is represented as a binary number:

11/2=5 Yu 1

5/2=2 Yu 1

2/2=1 Yu 0

1/2=0 Yu 1

0 End 112 binary represented as (from bottom to top): 1011

Here is a point: as long as the result of the meeting except for the end of 0, we think, all the integers divided by 2 is not sure to be able to finally get 0. In other words, will all integers be converted into binary numbers without an infinite loop? Absolutely not, **integers can always be expressed** in binary precision, but decimals are not necessarily.

(2) How decimal decimals are converted into binary numbers

The algorithm is multiplied by 2 until there are no decimals. For example, 0.9 is represented as a binary number

0.9*2=1.8 Take integer Part 1

0.8 (fractional part of 1.8) *2=1.6 take integer portion 1

0.6*2=1.2 Take integer Part 1

0.2*2=0.4 take integer part 0

0.4*2=0.8 take integer part 0

0.8*2=1.6 Take integer Part 1

0.6*2=1.2 take integer part 0

......... 0.9 binary is represented as (from top down): 1100100100100 ...

Note: The above calculation process loops, that is, * * can never eliminate the fractional part, so that the algorithm will be indefinitely. It is clear that the binary **representation of decimals is sometimes impossible** to be precise. In fact, the reason is very simple, decimal system can be accurately expressed in 1/3? The same binary system cannot accurately represent 1/10. This explains why floating-point subtraction has a "lost" precision loss problem.

**2.** **float type in-memory storage**

As we all know, the Java float type occupies 4 bytes in memory. The 32 bits structures of float are as follows

**Float memory storage structure**

**4bytes--------0**

**represents the real number sign bit exponential sign bit digit digit significant digit**

**Where the sign bit 1 indicates positive, 0 means negative. The significant digits are 24 bits, and one of them is the real number sign bit.**

The ** steps to convert a float to a memory storage format are:**

(1) The absolute value of this real number is converted into a binary format, noting that the binary method of the integer and fractional parts of the real number has been explored above.

(2) Move the decimal point of the binary format real number to the left or right by n bits until the decimal point moves to the right of the first valid digit.

(3) The first digit to the right of the decimal point begins with a number of 23 digits placed in the 22nd to No. 0 place.

(4) If the real number is positive, put "0" in the 31st place, otherwise put "1".

(5) If n is left-shifted, the exponent is positive and the 30th position is placed in "1". If n is right-shifted or n=0, the 30th bit is placed in "0".

(6) If n is left-shifted, then n minus 1 is converted to binary, and "0" is added to the left to complement seven bits, placed 29th to 23rd digits. If n is right-shifted or n=0, then n is added to the left with "0" to complement the seven-bit, then you seek the reverse, and then put the 29th to 23rd place.

Example: 11.9 memory storage format

(1) It is approximately "1011" after the 11.9 is converted into binary **.** 1110011001100110011001100 ... ".

(2) Move the decimal point to the left three bits to the right of the first significant bit: "1**.** 011 11100110011001100110". Ensure that the effective number of digits is 24 bits, and the right side of the excess intercept ( **error is generated here** ).

(3) This already has 24 valid figures, the leftmost one "1" is removed, get " 01111100110011001100110 " Total 23bit. Place it in the 22nd to No. 0 position of the float storage structure.

(4) Since 11.9 is a positive number, put "0" in the 31st bit of the real sign bit.

(5) Since we shifted the decimal point to the left, we put "1" in the 30th digit exponent sign bit.

(6) Because we are moving the decimal point to the left 3 bits, so 3 minus 1 to 2, to the binary, and the top 7 bits to get 0000010, put in 29th to 23rd place.

The last indication is 11.9:0 1 0000010 011 11100110011001100110

One more example: 0.2356 of memory storage formats

(1) 0.2356 is converted to binary after about 0.00111100010100000100100000.

(2) Move the decimal point to the right by three bits to get 1.11100010100000100100000.

(3) 23 significant digits from the right of the decimal point, i.e. 11100010100000100100000

Into the 22nd to No. 0 place.

(4) Since 0.2356 is positive, put "0" in the 31st place.

(5) Since we shifted the decimal point to the right, we put "0" in the 30th place.

(6) Because the decimal point is shifted to the right 3 bits, so 3 to the binary, on the left side of the "0" to complement seven

Bit, get 0000011, you take the counter, get 1111100, put in 29th to 23rd place.

the last indication is 0.2356:0 0 1111100 11100010100000100100000

To **convert the float binary format of a memory store into a decimal step:**

(1) write the binary number from 22nd to No. 0, and fill a "1" on the leftmost side to get 24 valid digits. Place the decimal point to the right of the "1" on the far left.

(2) Remove the value n represented by the 29th to 23rd bits. When the 30-bit is "0" the n will be reversed. Increase n by 1 when 30 bits is "1".

(3) Move the decimal point to the left N bit (when 30 bits are "0") or right shift n bits (when 30 bits are "1"), get a binary representation of the real number.

(4) The binary number is a decimal, and according to the 31st bit is "0" or "1" plus a positive or negative sign.

**3. Subtraction of floating-point type**

The process of floating-point subtraction is more complicated than fixed-point operation. The process of completing the floating-point subtraction operation is broadly divided into four steps:

(1) 0 The check of the operation number;

If you determine that two floating-point numbers that need to be reduced have a number of 0, you will be able to know the results of the operation without having to order some of the column operations.

(2) Compare the size of the order (digit) and complete the order;

two floating-point number to add and subtract, first of all to see two numbers of **decimal point** is the same, that is, whether the decimal position is aligned. If the two digits are the same, indicating that the decimal point is aligned, you can perform the addition and subtraction of the mantissa. Conversely, if the two-order code is different, indicating that the decimal place is not aligned, at this point must be two number of the same order, this process is called **the order** .

How to order (assuming that the exponent of the two floats is e*x* and E*y* ):

change e*x* or e*y* by shifting the mantissa to make it equal. Since the number of floating-point representation is normalized, the left shift of the mantissa causes the highest bit loss, resulting in a large error, while the mantissa right shift causes the loss of the least significant bit, but the error is small, so the order operation rules make the mantissa right, and the mantissa right moves to increase the order code accordingly, and its value remains unchanged. Obviously, an increased order is equal to the other, and the added order code must be a small order. Therefore, in the order, always make the **small order to the large order** , that is, the small order of the mantissa shifted to the right (equivalent to the left of the decimal point), each right to move one bit, its order plus 1, until the two number of the order of equal, the right to move the number of bits equal to the order E.

(3) The mantissa (the effective digit) carries on the addition or subtraction operation;

After the completion of the order, it is possible to sum the digits effectively. Both the addition and subtraction operations are performed by addition, and the method is exactly the same as the fixed-point addition and subtraction operation.

(4) The result is normalized and rounded.

slightly

**4. Calculation 12.0f-11.9f**

the 12.0f Memory storage format is: 0 1 0000010 10000000000000000000000

the 11.9f Memory storage format is: 0 1 0000010 011 11100110011001100110

It can be seen that the digits of the two numbers are identical, as long as the effective digits are subtracted.

12.0f-11.9f Results: 0 1 0000010 00000011001100110011010

Restore the result to decimal: 0.000 11001100110011010= **0.10000038**

Detailed analysis

Due to improper use of float or double, there may be an issue of loss of precision. The problem probably can be understood by the following code:

View Plaincopy to Clipboardprint?

public class Floatdoubletest {

public static void Main (string[] args) {

float F = 20014999;

Double d = f;

Double D2 = 20014999;

System.out.println ("f=" + f);

System.out.println ("d=" + D);

System.out.println ("d2=" + D2);

}

}

public class Floatdoubletest {

public static void Main (string[] args) {

float F = 20014999;

Double d = f;

Double D2 = 20014999;

System.out.println ("f=" + f);

System.out.println ("d=" + D);

System.out.println ("d2=" + D2);

}

}

The results are as follows:

F=2.0015e7

D=2.0015e7

D2=2.0014999e7

From the output you can see that the double can correctly represent 20014999, and float has no way to represent 20014999, only to get an approximate value. The results are surprising. 20014999 such a small number cannot be expressed under float. So with this question, did a float and double learning, do a simple share, hope to help you understand the Java floating point number.

About Java for float and double

The Java language supports two basic floating-point types: float and double. The floating-point types of Java are based on the IEEE 754 standard. IEEE 754 defines both 32-bit and 64-bit double-precision floating-point binary decimal standards.

IEEE 754 uses scientific notation to represent floating-point numbers with a decimal number of 2. 32-bit floating-point numbers use 1-bit notation for the number, 8-bit to represent the exponent, and 23 bits to denote the mantissa, which is the fraction of the decimal. An exponent that is a signed integer can have positive or negative points. The fractional part is represented by a binary (base 2) decimal number. For 64-bit double-precision floating-point numbers, a 1-bit symbol representing the number, a 11-bit exponent, and a 52-bit indicating the mantissa. The following two graphs indicate:

Float (32-bit):

Double (64-bit):

are divided into three parts:

(1) A separate sign bit s directly encodes the symbol S.

(2) The power exponent E of the K-bit, the shift code representation.

(3) n bits of the decimal, the original code is indicated.

So why is it that 20014999 can't be correctly represented by float?

Combining the notation of float and double, the answer can be known by analyzing the binary representation of 20014999.

The following procedure can be used to derive a binary representation of 20014999 under double and float.

View Plaincopy to Clipboardprint?

public class FloatDoubleTest3 {

public static void Main (string[] args) {

Double d = 8;

Long L = double.doubletolongbits (d);

System.out.println (Long.tobinarystring (l));

float F = 8;

int i = float.floattointbits (f);

System.out.println (integer.tobinarystring (i));

}

}

public class FloatDoubleTest3 {

public static void Main (string[] args) {

Double d = 8;

Long L = double.doubletolongbits (d);

System.out.println (Long.tobinarystring (l));

float F = 8;

int i = float.floattointbits (f);

System.out.println (integer.tobinarystring (i));

}

}

The output results are as follows:

double:100000101110011000101100111100101110000000000000000000000000000

float:1001011100110001011001111001100

The results of the output analysis are as follows. For a binary that does not double the left side of the symbol bit 0 just can get 64 bits of binary number. According to the notation of double, the number of symbols, power exponent and mantissa are divided into three parts as follows:

0 10000010111 0011000101100111100101110000000000000000000000000000

For float the left side of the sign bit 0 just can get 32 bit binary number. According to the notation of float, the number of symbols, power exponent and mantissa are also divided into three parts as follows:

0 10010111 00110001011001111001100

The green part is the sign bit, the red part is the power exponent, the blue part is the mantissa.

The comparison can be drawn: The sign bit is 0, the power exponent is the shift code representation, both are equal. The only difference is the mantissa.

The mantissa in the double is: 001100010110011110010111 0000000000000000000000000000, omitting the subsequent 0, which requires at least 24 bits to be correctly represented.

And the mantissa below float is: 00110001011001111001100, total 23 bits.

Why is that? The reason is obvious, because the float mantissa can only represent 23 bits, so the 24-bit 001100010110011110010111 is rounded down to 23-bit 00110001011001111001100 under float. So 20014999 has become 20015000 under float.

That is to say that 20014999 is within the range of float, but the float representation in IEEE 754 has no way to represent 20014999, but only by rounding to get an approximate value.

Loss of float and double accuracy in "Go" Java programs