If a> 0, must 1 + A be greater than 1? In mathematics, the answer is yes. But on the computer, the answer is related to the size of a and the precision of the floating point number. Matalb can be used for the following calculation:
>> a=1/2^52a = 2.220446049250313e-016>> 1+a>1ans = 1>> a=1/2^53a = 1.110223024625157e-016>> 1+a>1ans = 0
Obviously, when a is equal to 1/2 ^ 53, 1 + A> 1 is invalid.
1 floating point number
Ieee754 defines Single-precision floating-point numbers and double-precision floating-point numbers, that is, float and double. Float has 32bit and double has 64bit. They all include the symbol bit, index, and tail number.
|
Symbol bit |
Index |
Tail number |
Float |
31 (1) |
30-23 (8) |
22-0 (23) |
Double |
63 (1) |
62-52 (11) |
51-0 (52) |
The sign bit is 1 bit. 0 indicates positive and 1 indicates negative. Set the index of a number to E, and the value of the index part to bias + E. Add a bias to indicate a negative number. Float's bias is 127, and double's bias is 1023. The index has a special meaning of "0" or "1". It is not a normal index.
- The float index is 8 bits, which can be 1 ~ 254. The value is reduced by 127, and the corresponding index range is-126 ~ 127.
- The exponential part of double has 11 digits. The value can be 1 ~ 2046. The value is reduced by 1023, and the corresponding index range is-1022 ~ 1023.
The index here is based on 2, and the same ending number is binary. Ieee754 requires floating point numbers to be stored in a standard format, that is, one non-zero digit before the decimal point. For binary numbers, a non-zero number is only 1. Therefore, only the digits after the decimal point are stored in ieee754.
2 error
Let's take a look at the example:
double a=0.2;
On the PC, we can see that the data in the storage zone of A is:
9A 99 99 99 99 99 C9 3F
The data in the PC is a small tail, that is, after the low byte is reached, it is written as the high byte before, get:
3F C9 99 99 99 99 99 9A
The visible symbol is 0. The index bit is 0x3fc, that is, 1020, which is reduced by 1023 to get the index-3. The ending number is 9999999999a. Therefore, the complete number is the hexadecimal 1.999999999999a multiplied by 2 ^-3. That is:
a=(1+9*(1/16+1/16^2+...+1/16^12)+10/16^13)*2^-3
(1/16 +... + 1/16 ^ 12) the sum formula A1 * (1-Q ^ N)/(1-Q) can be used for calculation, where a1 = 1/16, q = 1/16, n = 12. Therefore:
a=(1+9*(1-1/16^12)/15+10/16^13)*2^-3
Use a Windows Calculator to calculate the formula.
a=0.2000 0000 0000 0000 1110 2230 2462 5157
This is not an exact solution, but we can see the error when double represents 0.2. This example shows that an error may be introduced when a finite-character-length binary floating point number is used to represent any real number. Set the index of real number A to E and the ending number to N. Obviously:
Error <(1/2 ^ N) * 2 ^ e
3. Precision
The machine precision can be defined to meet the conditions.
fl(1+ε)>1
The minimum floating point number ε. Specifically, FL (1 + ε) is the floating point representation of 1 + ε. Obviously, the machine precision of double is 1/2 ^ 52. Float machine precision is 1/2 ^ 23. In Matlab, double is used, and 1 + 1/2 ^ 53 is 1 for double, so 1 + 1/2 ^ 53 is not greater than 1.
For the standard number, because there is 1 by default before the decimal point, the valid float number is 24bit, corresponding to the 8-digit decimal valid number; the valid number of double is 53bit, it corresponds to a 16-digit decimal valid number.
4. Special floating point number
The index of floating point numbers all 0 or all 1 has special meanings. Let's take a look at these special floating point numbers:
- The index and tail number are all 0, indicating 0. It can be divided into + 0 and-0 based on the symbol bit.
- The index is all 0, and the ending number is not all 0. These numbers are non-standard numbers, that is, the ending part does not assume that there is 1 before the decimal point. Or these numbers are too close to 0, because the index can no longer be small, so these numbers cannot be written as standard form. For example, if the number of double values is 0000, 0000, 0000, and 0001, the ending number of 0000 is 0, 0000, 0001, or 1/2 ^ 52, the corresponding number is 1/(2 ^ 52) * 2 ^-1022, that is, 4.9406564584124654e-324.
- The index is 1, and the ending number is 0, which indicates infinity, that is, INF. Different symbols can be divided into + INF and-INF.
- The index is full 1. If the ending number is not all 0, Nan is represented, that is, not a number, not a number. Nan with the highest ending number of 1 is called qnan (quiet Nan ). Nan with the highest ending number of 0 is called snan (signalling Nan ). Generally, qnan is used to indicate uncertain operations, and snan is used to indicate invalid operations.
In a computer, double is a 64-digit number. From 0x0000 0000 0000 to 0000 ~ 0 xFFFF FFFF, each 64-digit number corresponds to a floating point number or Nan. I wrote a small program to print out a typical floating point number in the order of 64-bit unsigned integers. The first column of the table is the internal representation of the floating point. In order to facilitate reading, output in the order of tail. The second column is the corresponding floating point number. The third column is a comment. The MATLAB Formula for Calculating non-standard numbers and standard numbers is provided from the internal representation. Note that in C/C ++, 2 ^ 52 should be written as POW (2.0, 52.0 ).
0000 0000 0000 0000 |
0.20.00000000e + 000 |
+ 0 |
0000 0000 0000 0001 |
4.9406564584124654e-324 |
1/(2 ^ 52) x 2 ^-1022 |
000f FFFF |
2.225074255072009e-308 |
. 5*(1-.5 ^ 52)/(1-.5) * 2 ^-1022 |
0010 0000 0000 0000 |
2.225074255072014e-308 |
1.0*2 ^-1022 |
0010 0000 0000 0001 |
2.225074255072019e-308 |
(1 + 1/2 ^ 52) * 2 ^ (-1022) |
001f FFFF |
4.20.1477170144023e-308 |
(1 +. 5*(1-.5 ^ 52)/(1-.5) * 2 ^-1022 |
0020 0000 0000 0000 |
4.20.1477170144028e-308 |
1.0*2 ^-1021 |
3ff0 0000 0000 0000 |
1.20.00000000e + 000 |
1.0 |
3ff0 0000 0000 0001 |
1.20.000000000002e + 000 |
1.0 + 1/(2 ^ 52) |
3fff FFFF |
1.9999999999999998e + 000 |
1 +. 5*(1-.5 ^ 52)/(1-.5) |
4000 0000 0000 0000 |
2.20.00000000e + 000 |
1.0*2 ^ 1 |
7fef FFFF |
1.7976931348623157e + 308 |
(1 +. 5*(1-.5 ^ 52)/(1-.5) * 2 ^ 1023 |
7ff0 0000 0000 0000 |
1. # infevery 00000000e + 000 |
+ INF |
7ff0 0000 0000 0001 |
1. # snan1_0000000e + 000 |
Snan |
7ff7 FFFF |
1. # snan1_0000000e + 000 |
Snan |
7ff8 0000 0000 0000 |
1. # qnan1_0000000e + 000 |
Qnan |
7fff FFFF |
1. # qnan1_0000000e + 000 |
Qnan |
8000 0000 0000 0000 |
0.20.00000000e + 000 |
-0 |
8000 0000 0000 0001 |
-4.9406564584124654e-324 |
-(1/(2 ^ 52) x 2 ^-1022) |
800f FFFF |
-2.225074255072009e-308 |
-(. 5*(1-.5 ^ 52)/(1-.5) * 2 ^-1022) |
8010 0000 0000 0000 |
-2.2250738585072014e-308 |
-(1.0*2 ^-1022) |
8010 0000 0000 0001 |
-2.225074255072019e-308 |
-(1 + 1/2 ^ 52) * 2 ^ (-1022 )) |
801f FFFF |
-4.20.1477170144023e-308 |
-(1 +. 5*(1-.5 ^ 52)/(1-.5) * 2 ^-1022) |
8020 0000 0000 0000 |
-4.20.1477170144028e-308 |
-(1.0*2 ^-1021) |
Bff0 0000 0000 0000 |
-1.20.00000000e + 000 |
-1.0 |
Bfff FFFF |
-1.9999999999999998e + 000 |
-(1 +. 5*(1-.5 ^ 52)/(1-.5 )) |
C000 0000 0000 0000 |
-2.20.00000000e + 000 |
-(1.0*2 ^ 1) |
Ffef FFFF |
-1. 7976931348623157e + 308 |
-(1 +. 5*(1-.5 ^ 52)/(1-.5) * 2 ^ 1023) |
Fff0 0000 0000 0000 |
-1. # infevery 00000000e + 000 |
-INF |
Fff0 0000 0000 0001 |
-1. # snan1_0000000e + 000 |
Snan |
Fff7 FFFF |
-1. # snan1_0000000e + 000 |
Snan |
Fff80000 0000 0000 |
-1. # ind000000000000e + 000 |
Qnan |
FFFF |
-1. # qnan1_0000000e + 000 |
Qnan |
From the table, we can see that the design of the double internal representation is quite regular, + 0, positive non-standard number, regular norm, positive infinity, positive Nan,-0, negative non-standard number, negative norm number, negative norm number, and negative norm number infinity, and the symbol bit is negative Nan.
The design of the double internal representation maintains the order of floating point numbers. That is, if the positive double number is a <positive double number B, the 64-bit unsigned integer corresponding to a <B's 64-bit unsigned integer. Because the negative number has a different symbol, the order of the floating point number is the opposite to that of the corresponding integer. Float also has similar rules.
4 Conclusion
Float and INT are both 32bit, but the ending number of float is only 23bit. The precision of int is higher than that of float, and the value range of float is greater than that of Int. Float sacrifices precision in exchange for a larger representation range. The ending number of double is 52bit, which is higher than the 32-bit Int. Therefore, using dobule indicates that int has no loss of precision. Double is a common type of scientific computing. Understanding the inherent and limitations of double helps us better use it.