[In-depth] floating point storage and precision loss issues

Source: Internet
Author: User
Document directory
  • Float storage Parsing
  • Floating-point Subtraction
  • 12.0f-11.9f Calculation

This article uses float as an example to explain how to store floating point numbers and How to Lose precision.
First, let's take a look:

float f = 12.0f - 11.9f;System.err.println(f);

The result is: 0.10000038 instead of the expected 0.1.

Float storage Parsing

1. Float is 4-byte stored in Java, and 32 bits are as follows:

The storage and computing method of float in a computer:
(1) first, convert the absolute value of the real number to the binary format. The method is: the integer part of the real number is in addition to the remainder of 2 and the decimal part is multiplied by 2 to take an integer.
(2) move the decimal point of the binary real number to the left or right until the decimal point is moved to the right of the first valid number.
(3) from the first digit on the right of the decimal point, Count 23 digits into 22nd to 0th digits.
(4). If the real number is positive, put "0" in the 31st bits; otherwise, put "1 ".
(5) If n is obtained from the Left shift, it indicates that the exponent is positive, and the 30th bits are placed in "1 ". If n is obtained from the right shift or n = 0, the 30th bits are placed in "0 ".
(6) If n is obtained from the Left shift, subtract n from 1 and convert it to binary. Then add "0" to the left to fill in the seven bits, and place the 29th to 23rd bits. If n is obtained from the right shift or n = 0, convert n to binary and add "0" to the left to fill up the seven digits. Then, you can reverse the request and place the values between 29th and 23rd digits.

Calculate the binary representation of 11.9 according to the above step:
(1) convert 11.9 to binary: 1011. 1110,0110, 0110,0110, 0110,0110 ,....
(2) move the decimal point to three places to the right of the first valid digit: 1. 11100110011001100110 ". Ensure that the number of valid digits is 24 bits, and the redundant Truncation on the right (the error is generated here)
(3) there are 24 valid digits. Remove the "1" on the leftmost side and get 23 digits in 011 11100110011001100110. Put it into the 22nd-0th bits of the float storage structure.
(4) Because 11.9 is a positive number, put "0" in the 31st-bit real sign bit"
(5) Because we move the decimal point left, we put "1" in the 30th-bit index symbol bit"
(6) Because we shift the decimal point to three places to the left, we subtract 3 from 1 to 2 and convert it into binary, and make up 7 digits to get 0000010, And put 29th to 23rd digits.
The values of 11.9 are 0100,0001, 0011,1110, 0110,0110, 0110,0110.

Another example is the binary representation of 23.172001:
(1) 23.172001 binary format: 0001,011 1.0010, 0001 ....
(2) Four shifts left: 1.0111.0010, 0001
(3) 23 digits after the decimal point: 0111.0010, 1000,010, (these 23 digits are the last 23 digits of the float binary code)
(4) 31st bits: 0
(5) 30th bits: 1
(6) shifts 4 bits to the left, and the binary 0011 of (4-1). If less than 7 bits are filled with zeros, the value is 0000,011.
The binary representation of the last 23.172001 in the computer is: 0100,0001, 1011,1001, 0110,0000, 0100,0010

2. to convert the float binary format stored in one memory to decimal:
(1) write the binary numbers between 22nd and 0th bits, and add a "1" to the leftmost side to get the twenty-four valid digits. Place the decimal point on the right of the leftmost "1.
(2) obtain the value n represented by 29th to 23rd bits. When the value of 30 digits is "0", we will reverse all N. When the value of 30 digits is "1", n is increased by 1.
(3) shifts the decimal point N places to the left (when the 30 digits are "0") or shifts n places to the right (when the 30 digits are "1") to obtain a real number in binary representation.
(4) convert the binary real number into a decimal number, and add a positive or negative number based on whether the 31st bits are "0" or "1.

Floating-point Subtraction

The floating-point addition and subtraction operation is more complex than the fixed-point operation. There are four steps to complete the floating point addition and subtraction operation:
(1) Check the 0 operand;
If one of the two floating point numbers to be added or subtracted is 0, the calculation result can be obtained without the need for sequential Column Operations.
(2) Compare the order code (index bit) and complete the order;
To add or subtract two floating point numbers, you must first check whether the two numbers have the same exponent bits, that is, whether the decimal points are aligned. If the two exponent bits are the same, the decimal point is aligned, and the addition and subtraction of the ending number can be performed. Otherwise, if the two order codes are different, the decimal point is not aligned. In this case, the Order Codes of the two numbers must be the same. This process is called the inverse order.
How to rank (assume that the exponent bits of two floating point numbers are ex and ey ):
Shift the tail to change the ex or ey to make it equal. Because the number of floating points is normalized, the Left shift of the tail number will lead to the loss of the highest bit, resulting in a large error. Although the right shift of the tail number causes the loss of the lowest valid bit, however, the error is small. Therefore, the order operation requires that the number of tails be shifted to the right, and the order code be increased after the number of tails is shifted to the right, so that the value remains unchanged. Obviously, the order code after one addition is equal to the other, and the increment must be a small order. Therefore, in order, the smallest order is always aligned to the greater order, that is, the tail number of the smallest order is shifted to the right (equivalent to shifts the decimal point to the left), and the order code is added to 1 for each right shift, until the Order Codes of the two numbers are equal, the number of digits shifted to the right is equal to △e.
(3) addition or subtraction of the ending number (valid digit;
The sum of valid digits can be obtained after the order is completed. Both addition and subtraction operations are performed according to addition operations. The method is the same as the fixed-point addition and subtraction operation.
(4) Normalize the result and perform rounding.
Omitted

12.0f-11.9f Calculation

12.0f memory storage format: 0 1 0000010 10000000000000000000000
11.9f memory storage format: 0 1 0000010 011 11100110011001100110
It can be seen that the exponent bits of the two numbers are exactly the same. You only need to subtract the valid digits.
12.0f-11.9f result: 0 1 0000010 00000011001100110011010

Returns the result in decimal format: 0.000 11001100110011010 = 0.10000038.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.