"Go" why floating-point numbers may lose precision

Source: Internet
Author: User
Tags number sign

Turn from pencil

Why can floating point numbers lose precision?
Floating-point decimal values typically do not have the exact same binary representation. This is a side effect of the floating-point data representation used by the CPU. For this reason, you may experience some loss of precision, and some floating-point operations may produce unexpected results.

The cause of this behavior is one of the following:
1. Binary representations of decimal numbers may not be accurate.
2. Type mismatch between the numbers used (for example, mixing floating-point and double-precision types).

To resolve this behavior, most programmers either make sure that the value is larger or smaller than needed, or get and use a binary encoded decimal (BCD) library that maintains precision.

Detailed analysis : Why do floating-point operations cause precision loss?
1, Decimal binary representation problem
First of all, we need to figure out the following two questions:
(1) How decimal integers are converted to binary numbers
The algorithm is simple. For example, 11 is represented as a binary number:
11/2 = 5 more than 1
5/2 = 2 more than 1
2/2 = 1 more than 0
1/2 = 0 More than 1
0 End

112 binary representation (from bottom to top): 1011

Here is a point: as long as the result of the meeting except for the end of 0, we think, all the integers divided by 2 is not sure to be able to finally get 0. In other words, will all integers be converted into binary numbers without an infinite loop? Absolutely not, integers can always be expressed in binary precision, but decimals are not necessarily.

(2) How decimal decimals are converted into binary numbers
The algorithm is multiplied by 2 until there are no decimals. For example, 0.9 is represented as a binary number
0.9*2=1.8 Take integer Part 1
0.8*2=1.6 Take integer Part 1
0.6*2=1.2 Take integer Part 1
0.2*2=0.4 take integer part 0
0.4*2=0.8 take integer part 0
0.8*2=1.6 Take integer Part 1
0.6*2=1.2 take integer part 0
.........

0.9 binary is represented as (from top down): 1100100100100 ...
  

Note: The above calculation process loops, that is, * * can never eliminate the fractional part, so that the algorithm will be indefinitely. It is clear that the binary representation of decimals is sometimes impossible to be precise. In fact, the reason is very simple, decimal system can be accurately expressed in 1/3? The same binary system cannot accurately represent 1/10. This explains why floating-point subtraction has a "lost" precision loss problem.

2. Float type storage in memory

As we all know, the float type of Java occupies 4 bytes in memory. The 32 bits structures of float are as follows
———————————————————————————————
| 4bytes 31 30 29-23 22-0 |
| Represents the real number sign bit exponential sign bit digits significant digits |
———————————————————————————————
Where the sign bit 1 indicates positive, 0 means negative. The significant digits are 24 bits, and one of them is the real number sign bit.

The steps to convert a float to a memory storage format are:

(1) The absolute value of this real number is converted into a binary format, noting that the binary method of the integer and fractional parts of the real number has been explored above.
(2) Move the decimal point of the binary format real number to the left or right by n bits until the decimal point moves to the right of the first valid digit.
(3) The first digit to the right of the decimal point begins with a number of 23 digits placed in the 22nd to No. 0 place.
(4) If the real number is positive, put "0" in the 31st place, otherwise put "1".
(5) If n is left-shifted, the exponent is positive and the 30th position is placed in "1". If n is right-shifted or n=0, the 30th bit is placed in "0".
(6) If n is left-shifted, then n minus 1 is converted to binary, and "0" is added to the left to complement seven bits, placed 29th to 23rd digits. If n is right-shifted or n=0, then n is added to the left with "0" to complement the seven-bit, then you seek the reverse, and then put the 29th to 23rd place.

Example: 11.9 Memory storage format
(1) The 11.9 is converted to binary after about "1011." 1110011001100110011001100 ... ";
(2) Move the decimal point to the left three bits to the right of the first significant bit: "1. 011 11100110011001100110 ". Ensure that the effective number of digits is 24 bits, and the right side of the excess intercept (error is generated here).
(3) This already has 24 valid figures, the leftmost one "1" is removed, get "011 11100110011001100110" a total of 23bit. Place it in the 22nd to No. 0 position of the float storage structure.
(4) Since 11.9 is a positive number, put "0" in the 31st bit of the real sign bit.
(5) Since we shifted the decimal point to the left, we put "1" in the 30th digit exponent sign bit.
(6) Because we are moving the decimal point to the left 3 bits, so 3 minus 1 to 2, to the binary, and the top 7 bits to get 0000010, put in 29th to 23rd place.
The last indication is 11.9:0 1 0000010 01111100110011001100110

One more example: 0.2356 of memory storage formats
(1) 0.2356 is converted to binary after about 0.00111100010100000100100000.
(2) Move the decimal point to the right by three bits to get 1.11100010100000100100000.
(3) 23 valid digits from the right of the decimal point, i.e. 11100010100000100100000 in the 22nd to No. 0 place.
(4) Since 0.2356 is positive, put "0" in the 31st place.
(5) Since we shifted the decimal point to the right, we put "0" in the 30th place.
(6) Because the decimal point is shifted to the right 3 bits, so 3 to binary, on the left to fill the "0" top seven, get 0000011, you take the reverse, get 1111100, put in 29th to 23rd place.

The last indication is 0.2356:0 0 1111100 11100010100000100100000

To convert the float binary format of a memory store into a decimal step:
(1) write the binary number from 22nd to No. 0, and fill a "1" on the leftmost side to get 24 valid digits. Place the decimal point to the right of the "1" on the far left.
(2) Remove the value n represented by the 29th to 23rd bits. When the 30-bit is "0" the n will be reversed. Increase n by 1 when 30 bits is "1".
(3) Move the decimal point to the left N bit (when 30 bits are "0") or right shift n bits (when 30 bits are "1"), get a binary representation of the real number.
(4) The binary number is a decimal, and according to the 31st bit is "0" or "1" plus a positive or negative sign.

3. Subtraction of floating-point type
The process of floating-point subtraction is more complicated than fixed-point operation. The process of completing the floating-point subtraction operation is broadly divided into four steps:
(1) 0 The check of the operation number;
If you determine that two floating-point numbers that need to be reduced have a number of 0, you will be able to know the results of the operation without having to order some of the column operations.
(2) Compare the size of the order (digit) and complete the order;
To add and subtract two floating-point numbers, the first thing to do is to see if the two numbers have the same exponential position, that is, whether the decimal point is aligned. If the two digits are the same, indicating that the decimal point is aligned, you can perform the addition and subtraction of the mantissa. Conversely, if the two-order code is different, indicating that the decimal place is not aligned, at this point must be two number of the same order, this process is calledto order

How to order(assuming that the exponent for both floats is Ex and Ey):
Change the Ex or Ey by shifting the mantissa to make it equal. Since the number of floating-point representation is normalized, the left shift of the mantissa causes the highest bit loss, resulting in a large error, while the mantissa right shift causes the loss of the least significant bit, but the error is small, so the order operation rules make the mantissa right, and the mantissa right moves to increase the order code accordingly, and its value remains unchanged. Obviously, an increased order is equal to the other, and the added order code must be a small order. Therefore, in the order, always make the small order to the large order, that is, the small order of the mantissa shifted to the right (equivalent to the left of the decimal point), each right to move one bit, its order plus 1, until the two number of the order of equal, the right to move the number of bits equal to the order E.

(3) The mantissa (the effective digit) carries on the addition or subtraction operation;
After the completion of the order, it is possible to sum the digits effectively. Both the addition and subtraction operations are performed by addition, and the method is exactly the same as the fixed-point addition and subtraction operation.
(4) The result is normalized and rounded.

"Go" why floating-point numbers may lose precision

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.