**Floating Point Number**It is a number that belongs to a specific subset of a rational number. It is used in a computer to represent any real number. Specifically, this real number is obtained by multiplying an integer or a fixed number (that is, the ending number) by the integer power of 2 in a base computer. This representation is similar to the scientific notation of 10.

A few days ago, I was reading a C language textbook, and I had an example:

- # Include <stdio. h>
- Void main (void ){
- Int num = 9;/* num is an integer variable, set to 9 */
- Float * pFloat = & num;/* pFloat indicates the memory address of num, but it is set to a floating point number */
- Printf ("num value: % d \ n", num);/* display the integer value of num */
- Printf ("* pFloat value: % f \ n", * pFloat);/* display the floating point value of num */
- * PFloat = 9.0;/* change the num value to a floating point number */
- Printf ("num value: % d \ n", num);/* display the integer value of num */
- Printf ("* pFloat value: % f \ n", * pFloat);/* display the floating point value of num */
- }

The running result is as follows:

- Num value: 9
- * PFloat value: 0.000000
- Num: 1091567616
- * PFloat value: 9.000000

I'm surprised that num and * pFloat are clearly the same number in the memory. Why are the interpretation results of floating point numbers and integers so different?

To understand this result, you must understand the representation of floating point numbers in the computer. I have read some documents. Below are my notes.

**1. Before discussing floating point numbers, let's take a look at how integers are represented inside the computer.**

- int num=9;

The preceding command declares an integer variable of the int type and the value of 9 is written as 1001 in binary format ). A 32-bit general computer uses four bytes to represent the int variable. Therefore, 9 is saved as 00000000 00000000 00000000 00001001, And the hexadecimal value is 0x00000009.

So our problem is simplified to: Why is 0x00000009 restored to a floating point number, which is 0.000000?

**2. According to the International Standard IEEE 754, any binary floating point V can be expressed in the following format:**

V = (-1) ^ s × M × 2 ^ E

1) (-1) ^ s indicates the symbol bit. When s = 0, V is a positive number; when s = 1, V is a negative number.

2) M indicates a valid number, which must be greater than or equal to 1 and smaller than 2.

3) 2 ^ E indicates the exponential position.

For example, the decimal 5.0 is written as binary 101.0, which is equivalent to 1.01 × 2 ^ 2. Then, according to the above V format, we can obtain s = 0, M = 1.01, E = 2.

-5.0 in decimal format. The binary value is-101.0, which is equivalent to-1.01 × 2 ^ 2. So, s = 1, M = 1.01, E = 2.

IEEE 754 stipulates that for 32-bit floating point numbers, the highest 1-bit is the symbol bit s, the next 8-bit is the exponential E, and the remaining 23 digits are the valid numbers M.

For 64-bit floating point numbers, the highest 1-bit is the symbol bit S, the next 11-bit is the index E, and the remaining 52-bit is the valid number M.

**3. IEEE 754 has some special provisions on valid numbers M and exponent E.**

As mentioned above, 1 ≤ M <2, that is, M can be written in the form of 1. xxxxxx, where xxxxxx indicates the fractional part. IEEE 754 stipulates that when M is stored in a computer, the first digit of this number is always 1 by default, so it can be removed and only the xxxxxx part is saved. For example, when 1.01 is saved, only 01 is saved. When reading, the first 1 is added. The purpose is to save one valid number. Taking a 32-bit floating point number as an example, leave M to only 23 digits. After the first digit is removed, 24 valid digits can be saved.

As for index E, the situation is complicated.

First, E is an unsigned integer unsigned int ). This means that if E is 8 bits, its value range is 0 ~ 255; if E is 11 bits, its value range is 0 ~ 2047. However, we know that E in scientific notation can be negative, so IEEE 754 stipulates that the real value of E must be subtracted from an intermediate number. For 8-bit E, the intermediate number is 127; for 11-bit E, the intermediate number is 1023.

For example, if the E value of 2 ^ 10 is 10, you must save it as 10 + 127 = 137, or 10001001, when saving it as a 32-bit floating point number.

Then, exponential E can be further divided into three situations:

1) E is not all 0 or not all is 1. Then, the floating point number is represented by the above rule, that is, the calculated value of the index E minus 127 or 1023) to obtain the actual value, and then add the first 1 before the valid number M.

2) E is all 0. At this time, the floating point index E is equal to 1-1023 or 1-). Instead of adding 1 to the first digit, the valid number M is restored to the decimal point of 0. xxxxxx. This is used to represent ± 0 and a small number close to 0.

3) all values of E are 1. At this time, if the valid number M is all 0, it indicates plus or minus plus (+) depends on the symbol bit s); If the valid number M is not all 0, it indicates that this number is not a number NaN ).

**4. Now, let's talk about the floating point representation rules.**

Next, let's go back to the first question: Why is 0x00000009 restored to a floating point number, which is 0.000000?

First, split 0x00000009 to obtain the first signed digit s = 0, the index E = 00000000 for the next 8 bits, and the valid digit M = 000 0000 0000 0000 0000 for the last 23 bits.

Because the index E is all 0, it is the second case in the previous section. Therefore, the floating point V is written as follows:

V = (-1) ^ 0 × 0. 00000000000000000001001 × 2 ^ (-126) = 1.001 × 2 ^ (-146)

Obviously, V is a very small positive number close to 0, so it is represented as 0.000000 in decimal places.

**5. Let's look at the second part of the example.**

How can I use a floating point number of 9.0 in binary format? How much is it in decimal format?

First, floating point 9.0 equals to 1001.0 of binary, that is, 1.001 × 2 ^ 3.

Then, the first signed digit s = 0, the valid digit M is equal to 001 followed by 20 digits 0, and the sum is 23 digits. The index E is equal to 3 + 127 = 130, that is, 10000010.

Therefore, the binary format should be s + E + M, that is, 0 10000010 001 0000 0000 0000 0000. The 32-bit binary number is restored to decimal, Which is exactly 1091567616.

*Original article address:**Http://www.ruanyifeng.com/blog/2010/06/ieee_floating-point_representation.html*