Let's talk about the binary algorithm in the computer:

- Integer

We should be familiar with the binary algorithm of integers, that is, dividing them by two to get the remainder, and then sorting the remainder in reverse order. For example, calculate the binary value of 9:

9/2 = 4 + 1

4/2 = 2 + 0

2/2 = 1 + 0

1/2 = 0 + 1

The result is calculated until the quotient is 0. Then, the remainder is arranged from bottom to top, and the binary value of 9 is 1001.

From the above algorithm, we can see that dividing by an integer by 2 will eventually be able to reach 0. Therefore, integers can be accurately expressed in binary.

- Decimal

The decimal binary algorithm is roughly the opposite of an integer, that is, the fractional part is constantly multiplied by the integer part of the product 2, and then arranged in the forward order. For example, calculate the binary value of 0.9:

0.9*2 = 1.8 get 1

0.8*2 = 1.6 get 1

0.6*2 = 1.2 get 1

0.2*2 = 0.4 0

0.4*2 = 0.8 0

0.8*2 = 1.6 get 1

... ...

This loop goes on. So the binary decimal places I get are infinite loops: 0. 11100110011...

From the decimal binary algorithm, we can know that if we want to stop this algorithm, it can only be used when the decimal part is 0.5, but unfortunately this type of decimal point is very few. Therefore, it is difficult to accurately represent most Decimals in binary format.

------------------------ I am a split line ------------------------------

OK. With the above knowledge, let's go to the topic: see how the float type is represented in the memory.

The float type is also called the single-precision floating point type.__IEEE 754-2008__Is defined as follows:

S eeeeeeeeee fffffffffffffffffffff 31 30 23 22 0 |

Float Type 4 bytes in total -- 32 bits:

- Symbol bit

The leftmost sign, 0 is positive, and 1 is negative.
- Index

The following E is an index, which contains eight digits and is also represented in binary.
- Tail number

The final F is the decimal part, and the ending number is composed of the decimal part of the 23 digits + 1 digits. (This will be explained later ).

Here we need to talk about the index. Although the index is also expressed in 8-bit binary, IEEE has made some effort to define it and used offset to calculate the index.

IEEE stipulates that in the float type, the offset used to calculate the index is 127. That is to say, if your index is actually 0, then the binary value of 0 + 127 = 127 is saved in the memory. Later, let's take a look at how to use this.

After reading so much, we should demonstrate how the computer converts a decimal real number into a binary number. Take the number 6.9 as an example. -_-|!

First, we convert integers and decimals into the corresponding binary values according to the method described above. In this way, the binary value of 6.9 is 110. 1110011001100 .... It can be seen here

6.9 is converted to binary, and the fractional part is infinite loop, which cannot be accurately expressed in the current computer system. This is one of the reasons why computers are often inaccurate when calculating floating-point numbers.

Next, move the decimal point left (or right) to the first valid number. Generally speaking, it is to move the decimal point after the first 1. In this case, we need to shift the decimal point to the left of 2 places for the above 110. 1110011001100 ....

The next thing is interesting. First, we get the number of 1. 101110011001100 .., starting from the first digit after the decimal point, number 23, and fill in the above float memory.

The ending part of the structure (where the pile of F is located). Here we count 10111001100110011001100. Here, an inaccuracy occurs again, and the part that exceeds 23 digits after the decimal point will be discarded, which is terrible.

However, here is one thing that may make everyone feel particularly bad, that is, the first digit of the decimal point is not needed. Taking a closer look at the memory structure above, there is indeed no place to store this 1. The reason is: IEEE vision

Well, since we all agree to move the decimal point to the first valid number, there must be one before the decimal point by default, so it is a waste to save this one, simply don't. In the future, everyone will be so tacit. That's why I mentioned that the ending number is 23 + 1.

After filling the ending number, the filling index is reached. This index is the number of digits that we moved to the decimal point. We shifted the value to positive and the value to negative. Then, according to the offset algorithm mentioned above, the filled index should be 2 + 127 = 129. To convert to an 8-bit binary value is 10000001.

Finally, fill in the sign bit based on the positive and negative values of this number. We have a positive number here, So enter 0. In this way, the storage result of 6.9 in the memory will come out:

0 10000001 10111001100110011001100 |

To sum up, the method for converting a real number to a binary float type is as follows:

A. Convert the integers and decimals of the real numbers into binary values.

B. move the decimal point left or right after the first valid number.

C. From the first digit after the decimal point, Count 23 digits to the ending part.

D. Shift the number of digits to the decimal point to the left to the positive value, and shift the Right to the negative value. Add the offset 127 to convert the obtained sum to the binary value and fill it in the index.

E. Fill the sign bit based on the positive and negative values of the real number. 0 is positive, and 1 is negative.

If you want to convert the float binary data back to the decimal real number, you just need to move the above steps backwards.

**------------------------ I am a split line ------------------------------**

**Notes:**

- 23-digit ending number Filling

Although I did not find the corresponding description in the ieee754 standard, but in actual processing, when intercepting the 23-bit ending number, we need to perform the zero-in-one operation on the 24th-bit, at least in the Java Virtual Machine. If you are interested, try 0.7f-0.6f.

- Rounding to the right during operation

This is also a problem encountered during actual operations. So far, I have not been able to determine whether the right-to-right operation has also been performed. If you are interested, try 9.6f-6.9f.

- Zero index problems

A zero index indicates a special float number. The zero float type can be divided into two situations:

- The end number is zero. The current float number is 0. It can be divided into + 0 and-0 based on the symbol bit. The two are equal on JVM.

Here you need to explain. Because of the default 1 problem of IEEE, the float type cannot represent 0. Therefore, only some mandatory rules can be made to indicate 0, that is to say, there is a saying that all zeros exist above.
- The ending number is not all zero. This indicates that the current float number is a non-normalized number.

- Index all-one
- All the indexes indicate that the float number is an unusual number. It can also be divided into two situations:

- The end number is zero. At this time, it is divided into positive infinity (+ infinity) and negative infinity (-infinity) based on the symbol bit ). Note that these two items are not equal in JVM.
- The ending number is not all zero. This indicates that the float number is not a number (Nan, not a number ). This Nan is also divided into qnan (quiet Nan) and snan (signalling

NAN ). As to the differences between the two Nan S, the following section illustrates the differences, but I do not have the knowledge in this area, so I dare not translate them, so I have to put the original article here:

A qnan is a nan with the most significant fraction bit set. qnan's propagate

Freely through most arithmetic operations. These values pop out of an operation when the result is not mathematically defined.

An snan is a nan with the most significant fraction bit clear. it is used to signal an exception when used in operations. snan's can be handy to assign to uninitialized variables to trap premature usage.

Semantically, qnan's denote indeterminate operations, while snan's denote invalid operations.

In the last sentence, we can see that qnan is the result of an uncertain operation, while snan is purely an illegal operation.

------------------------ I am a split line -----------------------------

Okay. I think I have a general idea about the float type. After float understands it, the double type is easy to say, basically the same as above, but the number of digits of the index and the ending number is different.

Refer:

__IEEE Standard 754 floating point numbers: http://steve.hollasch.net/cgindex/coding/ieeefloat.htm__

__IEEE 754-1985: http://en.wikipedia.org/wiki/IEEE_754-1985__

__IEEE 754-2008: http://en.wikipedia.org/wiki/IEEE_754-2008__

__Java Theory and Practice: Where are your decimal points? : Http://www.ibm.com/developerworks/cn/java/j-jtp0114/__