Analyze the problem of Float memory storage and precision loss

Last Update:2018-12-03 Source: Internet

Author: User

Tags decimal to binary

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Question: 12.0f-11.9f = 0.10000038, why?

Now let's take a closer look at why floating point operations cause loss of precision?

1. binary representation of decimals

First, we need to clarify the following two problems:

(1) how to convert a decimal integer to a binary number

The algorithm is simple. For example, 11 represents the binary number:

11/2 = 5 + 1

5/2 = 2 + 1

2/2 = 1 + 0

1/2 = 0 + 1

0 end 11 binary representation (from bottom up): 1011

Here, we will mention that the Division will end when the result is 0. Think about whether all integers divided by 2 will eventually get 0. In other words, will all algorithms whose integers are converted into binary numbers go through an infinite loop? Absolutely not,Integers can always be exactly expressed in binary.But decimal places are not necessarily.

(2) how to convert decimal to binary

The algorithm is multiplied by 2 until there is no decimal point. For example, 0.9 represents the binary number.

0.9*2 = 1.8 integer part 1

0.8 (decimal part of 1.8) * 2 = 1.6 take integer part 1

0.6*2 = 1.2 integer part 1

0.2*2 = 0.4 integer part 0

0.4*2 = 0.8 integer part 0

0.8*2 = 1.6 integer part 1

0.6*2 = 1.2 integer part 0

...... 0.9 binary representation (from top to bottom): 1100100100100 ......

Note: the calculation process above is a loop, that is to say, * 2 will NEVER eliminate the fractional part, so that the algorithm will go infinitely. Apparently,The binary representation of decimal places is sometimes not accurate.. In fact, the principle is very simple. Can we accurately express 1/3 in a decimal system? Similarly, the binary system cannot accurately represent 1/10. This explains why floating-point subtraction causes the loss of precision due to the "reduction.

2,Float storage in memory

As we all know, Java's float type occupies 4 bytes in memory. The 32 binary structure of float is as follows:

Float memory storage structure

4 bytes 31 30 29----23 22----0

Represents the real symbol bit index bit valid Digit

The sign bit 1 indicates positive, and 0 indicates negative. The valid digit is 24 bits, one of which is a real sign bit.

To convert a float type to a memory storage format, follow these steps:

(1) first, convert the absolute value of the real number to the binary format. Note that the binary method of the integer and decimal parts of the real number has been discussed above.
(2) shifts the decimal point of the binary real number to the left or right until the decimal point is moved to the right of the first valid number.
(3) Place the numbers from the first digit on the right of the decimal point to 22nd to 0th digits.
(4) If the real number is positive, put "0" in the 31st bits; otherwise, put "1 ".
(5) If n is obtained from the Left shift, it indicates that the exponent is positive, and the 30th bits are placed in "1 ". If n is obtained from the right shift or n = 0, the 30th bits are placed in "0 ".
(6) If n is obtained from the Left shift, subtract n from 1 and convert it to binary. Then add "0" to the left to fill in the seven digits and place the 29th to 23rd digits. If n is obtained from the right shift or n = 0, convert n to binary and add "0" to the left to fill up the seven digits. Then, you can reverse the request and place the values between 29th and 23rd digits.

Example: 11.9 memory storage format

(1) convert 11.9 to binary and then it is about "1011.1110011001100110011001100 ...".

(2) move the decimal point to three places to the right of the first valid digit: "1.011 11100110011001100110 ". Ensure that the number of valid digits is 24 bits, and the redundant Truncation on the right (Errors are generated here.).

(3) there are 24 valid digits. Remove the leftmost digit "1" and get a total of 23 bits in 11100110011001100110. Put it into the 22nd-0th bits of the float storage structure.

(4) Because 11.9 is a positive number, "0" is placed in the 31st-bit real sign bit ".

(5) Because we move the decimal point to the left, we put "1" in the 30th-bit index symbol ".

(6) Because we shift the decimal point to three places to the left, we subtract 3 from 1 to 2 and convert it into binary, and make up 7 digits to get 0000010, And put 29th to 23rd digits.

The value 11.9 is 0 1 0000010 011 11100110011001100110.

Another example: 0.2356 memory storage format
(1) convert 0.2356 to binary and then about 0.00111100010100000100100000.
(2) Move the three digits to the right of the decimal point to 1.11100010100000100100000.
(3) extract 23 valid digits from the right of the decimal point, that is, put 11100010100000100100000
The value ranges from 22nd to 0th.
(4) Because 0.2356 is positive, put "0" in the 31st bits ".
(5) Because we shifted the decimal point to the right, we put the decimal point "0" in the 30th-bit format ".
(6) because the decimal point is shifted three places to the right, convert the decimal point to binary, and add "0" to the left to fill in the seven
BITs, get 0000011, get 1111100, put 29th to 23rd bits.

The value 0.2356 is 0 0 1111100.

To convert a memory-stored float binary format to decimal:
(1) write the binary numbers between 22nd and 0th bits, and add a "1" to the leftmost side to get the twenty-four valid digits. Place the decimal point on the right of the leftmost "1.
(2) obtain the value n represented by 29th to 23rd bits. When the value of 30 digits is "0", we will reverse all N. When the value of 30 digits is "1", n is increased by 1.
(3) shifts the decimal point N places to the left (when the 30 digits are "0") or shifts n places to the right (when the 30 digits are "1") to obtain a real number in binary representation.
(4) convert the binary real number into a decimal number, and add a positive or negative number based on whether the 31st bits are "0" or "1.

3. Floating-point Subtraction

The floating-point addition and subtraction operation is more complex than the fixed-point operation. There are four steps to complete the floating point addition and subtraction operation:
(1) Check the 0 operand;

If one of the two floating point numbers to be added or subtracted is 0, the calculation result can be obtained without the need for sequential Column Operations.

(2) Compare the order code (index bit) and complete the order;

Add or subtract two floating point numbers.Exponential bitWhether it is the same, that is, whether the decimal point is aligned. If the two exponent bits are the same, the decimal point is aligned, and the addition and subtraction of the ending number can be performed. Otherwise, if the two order codes are different, the decimal point is not aligned. In this case, the Order Codes of the two numbers must be the same. This process is calledLevel 1.

How to rank (assume that the exponent bit of two floating-point numbers is EXAnd EY):

Change e by moving the tailXOr EYTo make them equal. Because the number of floating points is normalized, the Left shift of the tail number will lead to the loss of the highest bit, resulting in a large error. Although the right shift of the tail number causes the loss of the lowest valid bit, however, the error is small. Therefore, the order operation requires that the number of tails be shifted to the right, and the order code be increased after the number of tails is shifted to the right, so that the value remains unchanged. Obviously, the order code after one addition is equal to the other, and the increment must be a small order. Therefore, in orderLevel 1 to Level 2That is, the tail number of the smallest order is shifted to the right (equivalent to shifts the decimal point to the left), and shifts one digit to the right. The order code is added with 1 until the Order Codes of the two numbers are equal, the number of digits in the right shift is equal to △e in the order difference.
(3) addition or subtraction of the ending number (valid digit;

The sum of valid digits can be obtained after the order is completed. Both addition and subtraction operations are performed according to addition operations. The method is the same as the fixed-point addition and subtraction operation.
(4) Normalize the result and perform rounding.

Omitted

Floating Point addition and subtraction: see the http://www.zzslxx.com/wmy/jy/Chap02/2.7.1.htm

4. Calculate 12.0f-11.9f

12.0f memory storage format: 0 1 0000010 10000000000000000000000

11.9f memory storage format: 0 1 0000010 011 11100110011001100110

It can be seen that the exponent bits of the two numbers are exactly the same. You only need to subtract the valid digits.

12.0f-11.9f result: 0 1 0000010 00000011001100110011010

Restore the result to decimal: 0.000 11001100110011010 =0.10000038

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More