The floating point in the computer memory.
Introduction of the Floating Point concept
In the development of computer systems, we have proposed multiple methods to express real numbers. For example, in the fixed point representation, this representation is used to fix the decimal point in a certain position, for example, 11001000.00110001. The 16-bit (2-byte) fixed point represents the integer part with the first eight digits, the last 8 digits indicate the decimal part. This method is intuitive, but the fixed decimal place determines the integer part and the decimal part of the fixed digits, which is not conducive to expressing a particularly large number or a particularly small number at the same time. In the end, most modern computers follow the IEEE binary floating point Number Arithmetic standard and use scientific notation to express real numbers, that is, a Mantissa or significand Base ), an Exponent and a positive or negative symbol are used to express a real number. For example, the decimal scientific notation of 123.45 can be expressed as 1.2345 × 10 ^ 2, where 1.2345 is the ending number, 10 is the base number, and 2 is the index. Floating Point Numbers use indexes to achieve the floating decimal point effect, which can flexibly express larger real numbers.
Format of floating point in memory
Five floating point number formats are defined in the IEEE 754 standard. The base number and the number of memory bits occupied by the format are used for naming. The five formats are:
Various programming languages and compilers generally support the first two formats, which correspond to float and double respectively. For the latter Binary High-Dual-precision 128-bit floating point, the compiler generally uses 80-bit representation, there are also very few compilers that regard them as 128 bits.
In the IEEE Standard, the representation of floating point numbers in memory is to divide all the binary bits of consecutive bytes of a specific length into symbol domains by a specific length, and the exponential domain and the ending domain are three consecutive domains. For example, for float data, its memory representation is:
For the double type, the representation is as follows:
As shown above,FloatThe number of digits that the type occupies in the memory is: 1 + 8 + 23 = 32 bits ,DoubleThe number of bits used in the memory is as follows: 1 + 11 + 52 = 64 bits .
- The first s represents the symbol, 1 represents a negative number, and 0 represents a positive number.
- The second field is the index field. For the float type with single precision, the index field has eight digits, which can represent 0-exponent values. A numeric value specifies the decimal place. the decimal point represents the size of the value. However, the index can be positive or negative. In order to deal with the negative index, a deviation (Bias) value is required for the actual exponent value as the value stored in the index field. The deviation value of the single precision number is-127, the deviation of double type is-1023. For example, 64 in the single-Precision Index field indicates the actual exponent value-63. The introduction of deviation changes the range of the actually expressed exponent values of a single precision to-127 to 128 (including the two ends ). We will soon see that the actual exponent value-127 (saved as full 0) and + 128 (saved as full 1) are reserved as special values for processing. In this way, the effective index range can be expressed between-126 and + 127.
- The third domain is the tail number field, where the single precision is 23-bit long and the double precision is 52-bit long. For example, if the value in a single-precision tail number field is 00001001000101010101000, The exponent value in the second field specifies the position of the decimal point in the tail number string. By default, the decimal point is placed before the first of the tail number string. For example, if the value is-1, the float number is. 000001001000101010101000. If the value is + 1, the float value is 0.0001001000101010101000. We know that the purpose of floating point numbers is to use as few digits as possible to represent real numbers with both high precision and large ranges. The range is determined by the length of the index domain, the length of the ending number field determines the precision that can represent the real number. Therefore, the accuracy of double is higher than that of float, and the range is larger. The corresponding value also occupies more memory. The interpretation of the values in the ending number field just introduced cannot achieve the goal of maximizing accuracy, because there are four "0" before the first "1" in the ending number string ", these four "0" are actually redundant, because when we move the decimal point forward, the "0" at the front end is automatically added, so we can delete these four "0, then there are four more digits at the end of the final number field to indicate more precise values. That is to say, the first digit of the ending number must be "1". Since the first digit must be "1", we do not need to store it in the ending number domain, but the ending number is 1. xxxx… by default... Xxx format. The first digit of the ending number starts from the decimal point. In the preceding example, the ending number is:
1.00001001000101010101000. 23 digits indicate the information of 24 digits (decimal place not occupied ).
Convert a floating point to a real number
We have understood the storage format of floating point numbers in memory. The following uses an example to enhance our understanding of floating point numbers. Suppose we have a 32-bit floating point in the memory:
Explanation:
S: 1-negative
E: 10000010-> 130-> (130-127) = 3, that is, the decimal point of the tail part shifts three places to the right.
M: 10110000000000000000000, the ending number plus the omitted 1 and the decimal point: 1.10110000000000000000000.
Now all three parts have been calculated. Adjust the decimal point position by the exponent value. The result is as follows:
1101.10000000000000000000
Convert to decimal, And the integer part is: 1 × 23 + 1 × 22 + 1 × 20 = 13 ; Decimal part: 1 × 2 −1 = 0.5 .
The last integer plus the decimal part, plus the symbol, represents the real number: − 13.5
Special Value
We already know that the exponent field can actually express a value in the range of-127 to 128 (including the two ends ). Among them, the value-127 (saved as all 0) and + 128 (saved as all 1) are retained as special values.
Special values in floating point numbers are mainly used for handling special cases or errors. For example, when the program starts to square a negative Number, a special return value is used to mark this error. The value is NaN (Nota Number ). Without such a special value, such errors can only be roughly terminated. In addition to NaN, IEEE standards also define ± 0, ± ∞, and DenormalizedNumber ).
For single-precision floating-point numbers, all these special values are encoded by the reserved special exponent values-127 and 128. The index of NaN is 128 (all index fields are 1), and the ending number field is not equal to zero. The IEEE Standard does not require a specific ending number field, so NaN is not actually a family. Different implementations can freely select the value of the tail number field to express NaN, such as the constant Float in Java. naN's floating point number may be expressed as 01111111110000000000000000000000, with the first digit in the tail field being 1, and the rest being 0 (excluding the hidden one), but this depends on the hardware architecture of the system. In Java, programmers are even allowed to construct NaN values with special locating modes (through the Float. intBitsToFloat () method ). For example, programmers can use this customized special location mode in NaN values to express some diagnostic information.
Like NaN, the infinite exponent portion of the special value is also 128, but the infinite ending number must all be 0. Infinity is used to express the Overflow problem in computing. For example, when two extremely large numbers are multiplied, although the two operands can be saved as floating point numbers, the result may be as large as the number of floating points, but must be rounded. According to the IEEE Standard, the result is not rounded to the maximum number of floating points that can be saved (because the number may be too far away from the actual result and meaningless), but rounded to infinity. This is also true for negative number results. It is not rounded to negative infinity at this time, that is, the infinity of the symbol field is 1.
In the IEEE standard floating-point format, 1 on the left of the decimal point is hidden, while 0 obviously requires that the ending number be zero. Therefore, zero cannot be expressed directly in this format but can only be specially processed.
In fact, zero is saved as the ending number field, and the index field is emin-1 =-127, that is, the index field is also 0. Considering the role of the symbolic domain, there are two zeros, namely + 0 and-0. Unlike positive and negative infinity, the IEEE standard stipulates that positive and negative zeros are equal.
There are positive and negative differences between zero and zero, which is indeed very confusing. This is the result of multiple considerations based on numerical analysis after the pros and cons are weighed. Signed Zero can avoid the loss of Symbol Information in operations, especially in infinite operations. For example, if zero is unsigned, equation 1/(1/x) = x is no longer valid when x = ± ∞. The reason is that if zero is unsigned, the ratio of 1 to positive and negative infinity is the same zero, then the ratio of 1 to 0 is positive infinity, and the symbol is gone. Solve this problem, unless there is no symbol in infinity. But the infinite symbols indicate which side of the number axis overflow occurs. This information is obviously not required. Zero-signed also creates other problems. For example, when x = y, when equation 1/x = 1/y is + 0 and-0 respectively, the two ends are positive infinity and negative infinity. Of course, the other way to solve this problem is the same as infinity, and the rule of zero is also orderly. However, if zero is ordered, even a simple judgment like if (x = 0) may become uncertain because it is ± 0. It is better to make them light.