Float Type in memory

Last Update:2018-12-04 Source: Internet

Author: User

Tags bit set float number

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's talk about the binary algorithm in the computer:

Integer
We should be familiar with the binary algorithm of integers, that is, dividing them by two to get the remainder, and then sorting the remainder in reverse order. For example, calculate the binary value of 9:
9/2 = 4 + 1
4/2 = 2 + 0
2/2 = 1 + 0
1/2 = 0 + 1
The result is calculated until the quotient is 0. Then, the remainder is arranged from bottom to top, and the binary value of 9 is 1001.
From the above algorithm, we can see that dividing by an integer by 2 will eventually be able to reach 0. Therefore, integers can be accurately expressed in binary.

Decimal
The decimal binary algorithm is roughly the opposite of an integer, that is, the fractional part is constantly multiplied by the integer part of the product 2, and then arranged in the forward order. For example, calculate the binary value of 0.9:
0.9*2 = 1.8 get 1
0.8*2 = 1.6 get 1
0.6*2 = 1.2 get 1
0.2*2 = 0.4 0
0.4*2 = 0.8 0
0.8*2 = 1.6 get 1
... ...
This loop goes on. So the binary decimal places I get are infinite loops: 0. 11100110011...
From the decimal binary algorithm, we can know that if we want to stop this algorithm, it can only be used when the decimal part is 0.5, but unfortunately this type of decimal point is very few. Therefore, it is difficult to accurately represent most Decimals in binary format.

------------------------ I am a split line ------------------------------

OK. With the above knowledge, let's go to the topic: see how the float type is represented in the memory.
The float type is also called the single-precision floating point type.IEEE 754-2008Is defined as follows:

S eeeeeeeeee fffffffffffffffffffff
31 30 23 22 0

Float Type 4 bytes in total -- 32 bits:

Symbol bit
The leftmost sign, 0 is positive, and 1 is negative.
Index
The following E is an index, which contains eight digits and is also represented in binary.
Tail number
The final F is the decimal part, and the ending number is composed of the decimal part of the 23 digits + 1 digits. (This will be explained later ).

Here we need to talk about the index. Although the index is also expressed in 8-bit binary, IEEE has made some effort to define it and used offset to calculate the index.

IEEE stipulates that in the float type, the offset used to calculate the index is 127. That is to say, if your index is actually 0, then the binary value of 0 + 127 = 127 is saved in the memory. Later, let's take a look at how to use this.

After reading so much, we should demonstrate how the computer converts a decimal real number into a binary number. Take the number 6.9 as an example. -_-|!

First, we convert integers and decimals into the corresponding binary values according to the method described above. In this way, the binary value of 6.9 is 110. 1110011001100 .... It can be seen here
6.9 is converted to binary, and the fractional part is infinite loop, which cannot be accurately expressed in the current computer system. This is one of the reasons why computers are often inaccurate when calculating floating-point numbers.

Next, move the decimal point left (or right) to the first valid number. Generally speaking, it is to move the decimal point after the first 1. In this case, we need to shift the decimal point to the left of 2 places for the above 110. 1110011001100 ....

The next thing is interesting. First, we get the number of 1. 101110011001100 .., starting from the first digit after the decimal point, number 23, and fill in the above float memory.
The ending part of the structure (where the pile of F is located). Here we count 10111001100110011001100. Here, an inaccuracy occurs again, and the part that exceeds 23 digits after the decimal point will be discarded, which is terrible.

However, here is one thing that may make everyone feel particularly bad, that is, the first digit of the decimal point is not needed. Taking a closer look at the memory structure above, there is indeed no place to store this 1. The reason is: IEEE vision
Well, since we all agree to move the decimal point to the first valid number, there must be one before the decimal point by default, so it is a waste to save this one, simply don't. In the future, everyone will be so tacit. That's why I mentioned that the ending number is 23 + 1.

After filling the ending number, the filling index is reached. This index is the number of digits that we moved to the decimal point. We shifted the value to positive and the value to negative. Then, according to the offset algorithm mentioned above, the filled index should be 2 + 127 = 129. To convert to an 8-bit binary value is 10000001.

Finally, fill in the sign bit based on the positive and negative values of this number. We have a positive number here, So enter 0. In this way, the storage result of 6.9 in the memory will come out:

0 10000001 10111001100110011001100

To sum up, the method for converting a real number to a binary float type is as follows:

A. Convert the integers and decimals of the real numbers into binary values.
B. move the decimal point left or right after the first valid number.
C. From the first digit after the decimal point, Count 23 digits to the ending part.
D. Shift the number of digits to the decimal point to the left to the positive value, and shift the Right to the negative value. Add the offset 127 to convert the obtained sum to the binary value and fill it in the index.
E. Fill the sign bit based on the positive and negative values of the real number. 0 is positive, and 1 is negative.

If you want to convert the float binary data back to the decimal real number, you just need to move the above steps backwards.

------------------------ I am a split line ------------------------------

Notes:

23-digit ending number Filling
Although I did not find the corresponding description in the ieee754 standard, but in actual processing, when intercepting the 23-bit ending number, we need to perform the zero-in-one operation on the 24th-bit, at least in the Java Virtual Machine. If you are interested, try 0.7f-0.6f.
Rounding to the right during operation
This is also a problem encountered during actual operations. So far, I have not been able to determine whether the right-to-right operation has also been performed. If you are interested, try 9.6f-6.9f.
Zero index problems
A zero index indicates a special float number. The zero float type can be divided into two situations:
- The end number is zero. The current float number is 0. It can be divided into + 0 and-0 based on the symbol bit. The two are equal on JVM.
  Here you need to explain. Because of the default 1 problem of IEEE, the float type cannot represent 0. Therefore, only some mandatory rules can be made to indicate 0, that is to say, there is a saying that all zeros exist above.
- The ending number is not all zero. This indicates that the current float number is a non-normalized number.
Index all-one
All the indexes indicate that the float number is an unusual number. It can also be divided into two situations:
- The end number is zero. At this time, it is divided into positive infinity (+ infinity) and negative infinity (-infinity) based on the symbol bit ). Note that these two items are not equal in JVM.
- The ending number is not all zero. This indicates that the float number is not a number (Nan, not a number ). This Nan is also divided into qnan (quiet Nan) and snan (signalling
  NAN ). As to the differences between the two Nan S, the following section illustrates the differences, but I do not have the knowledge in this area, so I dare not translate them, so I have to put the original article here:
  A qnan is a nan with the most significant fraction bit set. qnan's propagate
  Freely through most arithmetic operations. These values pop out of an operation when the result is not mathematically defined.
  An snan is a nan with the most significant fraction bit clear. it is used to signal an exception when used in operations. snan's can be handy to assign to uninitialized variables to trap premature usage.
  Semantically, qnan's denote indeterminate operations, while snan's denote invalid operations.
  In the last sentence, we can see that qnan is the result of an uncertain operation, while snan is purely an illegal operation.

------------------------ I am a split line -----------------------------

Okay. I think I have a general idea about the float type. After float understands it, the double type is easy to say, basically the same as above, but the number of digits of the index and the ending number is different.

Refer:

IEEE Standard 754 floating point numbers: http://steve.hollasch.net/cgindex/coding/ieeefloat.htm

IEEE 754-1985: http://en.wikipedia.org/wiki/IEEE_754-1985

IEEE 754-2008: http://en.wikipedia.org/wiki/IEEE_754-2008

Java Theory and Practice: Where are your decimal points? : Http://www.ibm.com/developerworks/cn/java/j-jtp0114/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More