Http://www.cnblogs.com/kingwolfofsky/archive/2011/07/21/2112299.html
floating point numbers
1. What are floating point numbers
In the process of computer system development, several methods have been proposed to express real numbers. Typically, for example, the number of fixedpoint points relative to floatingpoint numbers (fixed point numbers). In this way of expression, the decimal point is fixed somewhere in the middle of all numbers in the real number. The expression of the currency can be used in this way, for example, 99.00 or 00.99 can be used to express a fourbit precision (Precision) with a twobit currency value after the decimal point. Because the decimal point is fixed, you can use the fourdigit value directly to express the corresponding value. The number data type in SQL is defined by using fixedpoint numbers. There is also a proposed expression of rational number expression, that is, the ratio of two integers to express real numbers.
The disadvantage of fixedpoint number expression is that its form is too stiff, fixed decimal point position determines the integer and fractional parts of fixed digits, and it is unfavorable to express particularly large numbers or especially small numbers at the same time. In the end, most modern computer systems adopt the socalled floatingpoint representation. This expression uses scientific notation to express real numbers, that is, using a mantissa (Mantissa, the mantissa is sometimes called a valid digitalsignificand; The mantissa is actually an unofficial argument for a valid number), a cardinal (base), An exponent (Exponent) and a sign representing positive and negative signs to express real numbers. For example, 123.45 can be expressed as 1.2345x102 with the decimal science notation, where 1.2345 is the mantissa, 10 is the base and 2 is the exponent. Floating point numbers use indices to achieve the effect of floating decimal points, so that you can flexibly express a larger range of real numbers.
2. IEEE floatingpoint numbers
A floatingpoint number is stored in a computer with a finite number of contiguous bytes. In the IEEE standard, a floatingpoint number is a symbol field that divides all the bits of a specified length of contiguous bytes into a specific width, exponential and Mantissa fields, where the values stored are used to represent the symbols, indices, and mantissa of a given binary floatingpoint number, respectively. Thus, the given value can be expressed by the mantissa and the adjustable exponent (so called "floatingpoint").
IEEE 754 Specifies:
n two basic floatingpoint formats: single precision and double precision.
The ØIEEE single precision format has 24digit valid digital precision and occupies a total of 32 bits.
The ØIEEE double format has 53 digits of valid digital precision and occupies a total of 64 digits.
N Two extended floatingpoint formats: single and double precision extensions. This standard does not specify the exact precision and size of these formats, but it specifies the minimum precision and size. For example, the IEEE doubleprecision extended format must have at least 64 digits of valid numeric precision and a total of at least 79 digits.
See the following illustration for a specific format:
3. Floating point format
A floatingpoint format is a data structure that specifies the fields that contain floatingpoint numbers, the layout of these fields, and their arithmetic explanations. Floatingpoint storage formats Specify how floatingpoint formats are stored in memory. These formats are defined by the IEEE standard, but the choice of which storage format is determined by the implementation tool.
Assembly language software sometimes depends on the storage format used, but higherlevel languages typically deal only with the language concepts of floatingpoint data types. These types have different names in different highlevel languages and correspond to the IEEE format shown in the table.
IEEE Precision 
C, c + + 
Fortran (SPARC only) 
Single Precision 
Float 
Real or Real*4 
Double Precision 
Double 
DOUBLE PRECISION or Real*8 
Double Precision Expansion 
Long double 
Real*16 
IEEE 754 explicitly defines singleprecision floatingpoint formats and doubleprecision floatingpoint formats, as well as a set of extended formats for each of the two basic formats. The long double and real*16 types shown in the table apply to a doubleprecision extended format defined by the IEEE standard.
3.1. Single precision format
IEEE singleprecision format consists of three fields: 23 decimal F; 8bit bias index e; and 1bit symbol S. These fields are stored continuously in a 32bit word (as shown in the following illustration).
The ø0:22 bit contains 23 decimal F, where the No. 0 digit is the least significant bit of the decimal, and the 22nd digit is the most significant bit.
The IEEE standard requires floatingpoint numbers to be canonical. This means that the mantissa must be to the left of the decimal point of 1, so we save the mantissa, we can omit the decimal point before this 1, thereby freeing up a bits to save more mantissa. So we actually use the 23bit long mantissa field to express the 24bit mantissa.
The Ø23:30 bit contains the 8bit bias index E, the 23rd digit is the least significant bit of the offset index, and the 30th digit is the most significant bit.
The 8bit exponent is a 256 exponential value that can express between 0 and 255. However, the indices can be positive or negative. In order to handle the negative exponent, the actual exponential value is required to add a deviation (Bias) value as the value stored in the Exponential field. The deviation of a single precision number is 127; The introduction of the deviation causes the range of the exponential value that can actually be expressed for a single precision number to be between 127 and 128 (including both ends). In this paper, the minimum and maximum indices are expressed by Emin and Emax respectively. The actual exponential value of 127 (saved as full 0) and +128 (saved as all 1) are described later to be treated as special values.
Ø the highest 31st digit contains the symbol bit s. An S of 0 indicates a positive number, while S to 1 indicates a negative value.
3.2. Double precision format
The IEEE doubleprecision format consists of three fields: 52 decimal F; 11bit bias index e; and 1bit symbol S. These fields are stored continuously in two 32bit characters (as shown in the following figure). In a SPARC architecture, a higher address 32bit word contains a decimal 32bit least significant bit, whereas in the x86 architecture, the loweraddress 32bit word contains the 32bit least significant bit of the decimal number.
If you use f[31:0] to represent the 32bit least significant bit of a decimal, the No. 0 digit is the least significant bit of the entire decimal in the 32bit leastsignificant bit, and the 31st digit is the most significant bit. In another 32digit word, 0: The 19bit contains the most significant bit of the 20bit decimal f[51:32], where the No. 0 digit is the least significant bit of the 20bit most significant bit, and the 19th digit is the most significant bit of the entire decimal, and the 20:30 digit contains the 11bit bias index E, where the 20th digit is The least significant bit of the number, while the 30th digit is the most significant bit, and the highest 31st digit contains the symbol bit s.
The above figure numbered the two consecutive 32digit words by a 64digit word, where
The ø0:51 bit contains 52 decimal F, where the No. 0 digit is the least significant bit of the decimal, and the 51st digit is the most significant bit.
The IEEE standard requires floatingpoint numbers to be canonical. This means that the mantissa must be to the left of the decimal point of 1, so we save the mantissa, we can omit the decimal point before this 1, thereby freeing up a bits to save more mantissa. So we actually use the 52bit long mantissa field to express the 53bit mantissa.
The ø52:62 bit contains the 11bit bias index E, the 52nd digit is the least significant bit of the offset index, and the 62nd digit is the most significant bit.
The 11bit exponent is a 2048 exponential value that can express between 0 and 2047. However, the indices can be positive or negative. In order to handle the negative exponent, the actual exponential value is required to add a deviation (Bias) value as the value stored in the Exponential field. The deviation of a single precision number is 1023; the introduction of the deviation causes the range of the exponential value that can actually be expressed for a single precision number to be between 1023 and 1024 (including both ends). In this paper, the minimum and maximum indices are expressed by Emin and Emax respectively. The actual exponential value of 1023 (saved as full 0) and +1024 (saved as all 1) are described later to be treated as special values.
Ø the highest 63rd digit contains the symbol bit s. An S of 0 indicates a positive number, while S to 1 indicates a negative value.
3.3. Double precision Extended Format (SPARC)
The four timestimes precision format of a SPARC floatingpoint environment conforms to the IEEE definition of a doubleprecision extended format. The four timestimes precision format occupies 32bit characters and contains the following three fields: 112digit F, 15bit bias exponent e, and 1bit symbol S. These three fields are continuously stored, as shown in Figure 23.
The highest 32digit address contains a decimal 32bit least significant bit, expressed in f[31:0. The immediate two 32bit words contain f[63:32] and f[95:64] respectively. The following 0:15 bits contain a decimal 16bit most significant bit f[111:96], where the No. 0 digit is the least significant bit of this 16 bit, and the 15th digit is the most significant bit of the entire decimal. The 16:30 bit contains the 15bit bias index E, where the 16th digit is the least significant bit of the bias index, and the 30th digit is the most significant bit, and the 31st bit contains the symbol bit s.
The following figure numbers the four consecutive 32bit words by a 128digit word, where 0:111 bits store decimal F; 112:126bit store 15bit bias exponent e; and 127thdigit stores symbol bit s.
3.4. Double Precision Extended Format (x86)
The Doubleprecision extended format of this floatingpoint environment conforms to the IEEE definition of a double precision extended format. It contains four fields: 63bit decimal F, 1bit explicit leading effective digit J, 15bit bias exponent e, and 1bit symbol S.
In the x86 Architecture series, these fields are continuously stored in 8bit bytes of 10 contiguous addresses. Because the Unixsystem V application Binary Interface Intel 386 Processor Supplement (Intel ABI) requires a double extension parameter, which takes up 32bit words on the stack of three contiguous addresses, where The top 16bit most significant digits of the address are not used, as shown in the following figure.
The lowest 32digit address contains the decimal 32bit f[31:0], where the No. 0 digit is the least significant bit of the entire decimal, and the 31st digit is the most significant bit of the 32bit least significant bit. In the 32digit address center, 0:30 bits contains a decimal 31bit maximum significant bit f[62:32] (where the No. 0 digit is the least significant bit of the 31bit most significant bit, and the 30th is the most significant bit of the entire decimal); the 31st digit of the center 32 digit of the address contains the explicit leading significant digit J.
In the 32digit address, 0:14 digits contain the 15bit bias index E, where the No. 0 digit is the least significant bit of the bias index, and the 14th digit is the most significant bit, and the 15th bit contains the symbol bit s. Although the highest 16 bits of the highest 32bit address are not used by the x86 architecture series, they are critical to meeting the Intel ABI requirements as described above.
4. Convert real numbers to floatingpoint numbers
Normalization of 4.1 floating point numbers
The same number can be expressed in a variety of floatingpoint numbers, as in the example above, 123.45 can be expressed as 12.345x101,0.12345x103 or 1.2345x102. Because of this diversity, it is necessary to standardize it to achieve the goal of unified expression. The canonical (normalized) floatingpoint representation has the following form:
±d.dd...dxβe, (0≤d i<β)
The D.DD...D is the mantissa, β is the base, and E is the exponent. The number of digits in the mantissa is called precision, which is represented in this article by P. Each digit d is between 0 and cardinality, including 0. The number on the left of the decimal is not 0.
The specific value of the floatingpoint number based on the specification expression can be computed from the following expression:
± (d 0 + D 1β1 + ... + d p1β(p1)) Βe, (0≤d i<β)
The expression above is easy to understand and straightforward for decimal floatingpoint numbers, that is, cardinality beta equals 10 floatingpoint numbers. The numerical expression inside the computer is based on the binary system. From the above expression, we can know that binary numbers can also have decimal points, but also have a similar way of expression in the metric system. Just at this point beta equals 2, and each number D can only be evaluated between 0 and 1. For example, binary number 1001.101 is equivalent to 1x2 3 + 0x22 + 0x21 + 1x20 + 1x21 + 0x22 + 1x23, corresponding to the decimal 9.625. Its canonical floatingpoint number is expressed as 1.001101x23.
4.2 Floatingpoint numbers based on Precision
Taking the above 9.625 as an example, the canonical floatingpoint number is expressed as 1.001101x23,
Therefore, in the form of a single precision format:
1 10000010 00110100000000000000000
Similarly, in double precision format:
1 10000000010 0011010000000000000000000000000000000000000000000000
5. Special Value
From the previous introduction, you should have known the basics of floatingpoint numbers, which should be sufficient for a person who does not touch floatingpoint applications. However, if your interest is strong, or you face a tricky floatingpoint application, you can learn some notable special things about floatingpoint numbers in this section.
We already know that the range of exponential values that can actually be expressed by a singleprecision floatingpoint number exponent field is between 127 and 128 (including both ends). Where the value127 (saved as full 0) and +128 (saved as full 1) are reserved for processing as special values. This section details these special values as defined in the IEEE standards.
Special values in floatingpoint numbers are mainly used for special cases or for error handling. For example, when the program squares a negative number, a special return value is used to mark the error, which is NaN (not a number). Without such a special value, the calculation can only be rudely terminated for such errors. In addition to NaN, the IEEE standard also defines ±0,±∞ and nonstandard numbers (denormalized number).
For singleprecision floatingpoint numbers, all of these special values are encoded by the reserved special exponent value127 and 128来. If we use Emin and Emax respectively to express the boundary of other conventional exponential range, that is, 126 and 127, the reserved special exponent value can be expressed respectively as emin1 and Emax + 1;. Based on this expression, the special values of the IEEE standard are as follows:
where F represents the portion (fraction) to the right of the decimal point in the mantissa. The first line is the normal, normalized floatingpoint number that we described earlier. We will then introduce the remaining special values separately.
5.1 NaN
NaN is used to handle error conditions that occur in calculations, such as 0.0 divided by 0.0 or the square root of a negative number. As can be seen from the table above, for singleprecision floatingpoint numbers, NaN is represented as an exponent of Emax + 1 = 128 (exponential field is all 1), and the Mantissa field is not equal to zero floatingpoint numbers. The IEEE standard does not require a specific mantissa domain, so NaN is actually not one, but a family. Different implementations can freely select the value of the Mantissa field to express NaN, for example, a constant Float.nan in Java may be expressed as 01111111110000000000000000000000, where the first digit of the Mantissa field is 1, and the rest is 0 (excluding the hidden one) , but it depends on the hardware architecture of the system. In Java, programmers are even allowed to construct NaN values with a particular bit pattern (through the Float.intbitstofloat () method). For example, a programmer can use a specific bit pattern in this custom NaN value to express some diagnostic information.
A custom Nan value, which can be determined by the Float.isnan () method as Nan, but it is not equal to the Float.nan constant. In fact, all NaN values are unordered. Numeric comparison operators <,<=,> and >= return False when either operand is NaN. equals operator = = returns False when either operand is Nan, even if two nan with the same bit pattern. The operator!= returns true if either operand is NaN. An interesting result of this rule is that X!=x is true when X is NaN.
The operations that can produce NaN are as follows:
In addition, any operation that has Nan as an operand will also produce Nan. The significance of using a special NaN to express the above error is to avoid unnecessary termination of the operation caused by these errors. For example, if a floatingpoint method that is called by a loop can cause these errors due to input parameter problems, NaN makes it possible to simply continue the loop to perform operations that have no errors, even if a loop has made such an error. You might think that, now that Java has an exception handling mechanism, you might be able to achieve the same effect by capturing and ignoring exceptions. However, you know that the IEEE standard is not designed for Java alone, and that different languages handle exceptions differently, which makes it more difficult to migrate code. Moreover, not all languages have similar exception or signal (Signal) processing mechanisms.
Note: In Java, unlike floatingpoint numbers, the integer 0 divided by 0 throws a Java.lang.ArithmeticException exception.
5.2 Infinity
As with NaN, the exponent portion of the special Value Infinity (Infinity) is also Emax + 1 = 128, but the infinite mantissa field must be zero. Infinity is used to express the overflow (Overflow) problem generated in the calculation. For example, when multiplying two large numbers, although the two operands themselves can be saved as floatingpoint numbers, the result may be too large to be saved as floatingpoint numbers and must be rounded. According to the IEEE standard, instead of rounding the result to the largest floatingpoint number that can be saved (because it may be too far away from the actual result and meaningless), it is rounded to infinity. This is true for negative numbers as well, except when rounding to negative infinity, which means that the symbol field is 1 infinite. With NaN's experience, it is not difficult to understand that the special value infinity makes the overflow error in the calculation not necessarily the result of terminating the operation.
Infinity is ordered as a floatingpoint number other than NaN, with negative infinity from small to large, negative with a poor 0 value, plus or minus 0 (subsequently introduced), positive with poor non0 value and positive infinity. Any non0 value other than NaN is divided by 0, and the result is infinite, and the symbol is determined by the symbol of 0 of the divisor.
Recalling our introduction to Nan, when 0 divided by zero, the result is not infinity but NaN. The reason is not difficult to understand, when the divisor and dividend are all close to 0 o'clock, the quotient may be any value, so the IEEE standard decided to use NaN as a business at this time more appropriate.
5.3 Signed 0
Because 1 of the IEEE standard floatingpoint number format, the left side of the decimal point is hidden, and 0 obviously requires that the mantissa must be zero. Therefore, 0 can not be directly expressed in this format but only special treatment.
In fact, 0 is saved as the Mantissa field is all 0, and the exponential field is Emin1 =127, which means that the exponential field is all 0. Given the role of the symbolic domain, there are two zeros, i.e. +0 and 0. Unlike positive and negative infinity, which is ordered, the IEEE standard stipulates that plus or minus 0 is equal.
0 has positive and negative points, it is really very easy to confuse people. This is based on the numerical analysis of a variety of considerations, after the pros and cons of the formation of the results. Signed 0 can avoid the loss of symbolic information in operations, especially those involving infinite operations. For example, if 0 is unsigned, the equation 1/(1/x) = x is no longer established when X =±∞. The reason is that if 0 unsigned, the ratio of 1 to minus Infinity is the same 0, then the ratio of 1 to 0 is positive infinity, and the symbol is gone. Solve this problem unless infinity also has no symbols. But the infinity symbol expresses the side of the axis on which the overflow occurs, and this information is clearly not to be allowed. 0 symbols also cause other problems, such as when X=y, the equation 1/x = 1/y when x and Y are respectively +0 and 0, the two ends are positive infinity and negative infinity, respectively, are no longer valid. Of course, another way to solve this problem is that, as with infinity, the rule of zero is orderly. However, if 0 is ordered, even simple judgments such as if (x==0) can become indeterminate because x may be ±. Two evils take its light, 0 or disorderly good.
5.4 Nonnormalized number
Let's examine a special case of floatingpoint numbers. Select two floatingpoint numbers with a minimum absolute value, taking the singleprecision binary floatingpoint number as an example, such as 1.001x2125 and 1.0001x2125 two digits (corresponding to the decimal 2.6448623x1038 and 2.4979255x1038 respectively). Obviously, they are all normal floatingpoint numbers (index125, greater than the allowable minimum126; The mantissa is more problematic), which can be saved separately as 00000001000100000000000000000000 (0x1100000) and 00000001000010000000000000000000 (0x1080000).
Now let's look at the difference between the two floatingpoint numbers. It is not difficult to conclude that the difference is 0.0001x2125, expressed as a canonical floatingpoint number is 1.0x2129. The problem is that the exponent is greater than the minimum allowable exponent, so it cannot be saved as a canonical floatingpoint number. Finally, it can only approximate 0 (Flush to Zero). This particular situation means that the following inherently reliable code may also have problems:
if (x!= y) {
z = 1/(xy);
}
Just as we have carefully chosen two floatingpoint numbers to show the problem, even if X is not equal to y,x and Y, the difference may still be too small, and approximately zero, which results in dividing by 0.
To solve such problems, nonstandard (denormalized) floatingpoint numbers are introduced in IEEE standards. Specify that when the exponent of a floatingpoint number is the minimum exponential value allowed, i.e. Emin, the mantissa need not be normalized. For example, the difference in the above example can be expressed as a nonstandard floatingpoint number 0.001x2126, where exponent 126 equals Emin. Note that this is a "no", which means "yes". When the actual exponent of the floatingpoint number is Emin and the exponential field is also Emin, the float is still canonical, that is to say, a hidden trailing digit is implied in the save. To preserve nonstandard floatingpoint numbers, the IEEE standard uses a similar approach to dealing with a special value of zero, which is marked with a special exponential field value emin1, of course, the Mantissa field at this point cannot be zero. Thus, the difference in the example can be saved as 00000000000100000000000000000000 (0x100000), with no implied trailing digits.
With nonstandard floatingpoint numbers, the implied number of digits is removed, and the floatingpoint numbers with smaller absolute values can be saved. Moreover, because the implicit mantissa domain is no longer restricted, the problem of the minimum difference value is not present, because the difference between all the floating point numbers can be saved.
6. Range and Accuracy
Many decimal places cannot be accurately represented in binary computers (for example, the simplest 0.1) because the number of bits in the Mantissa field is limited, the floatingpoint number is treated by continuing the process until the resulting mantissa is sufficient to fill the mantissa field and then rounding the extra bits. In other words, in addition to the precision problems we talked about earlier, the decimal to binary transformation is not guaranteed to be accurate, but only approximate. In fact, only a very small number of decimal decimals have an exact binary floatingpoint number expression. Plus the error accumulation in floatingpoint operations, the result is that many of the decimal operations we seem to be very simple are often unexpected on computers. This is the most common floatingpoint operation of the "inaccurate" problem.
See the following Java example:
System.out.print ("34.634.0=" + (34.6f34.0f));
The output of this code is as follows:
34.634.0=0.5999985
The reason for this error is that 34.6 cannot be accurately expressed as the corresponding floatingpoint number and can only be saved as rounded approximations. The operation between this approximation and 34.0 naturally fails to produce accurate results.
The scope and precision of the storage format
format 
valid digits (binary) 
Minimum positive normal number 
Max positive 
valid number (decimal) 
single fine 

1.175 ... 1038 
3.402 10+38 
69 
double precision 
53 /p> 
2.225 ... 10308 
1.797...10+308 
1517 
Double Extended (SPARC) 
113 
3.362 ... 104932 
1.189...10+4932 
3336 
Double precision Extension (x86) 
$ 
3.362 ... 104932 
1.189...10+4932 
1821 td> 
7. Rounding
It is worth noting that for singleprecision numbers, because we have only 24bit indices (one hidden), the maximum exponent that can be expressed is 2241 = 16,777,215. In particular, 16,777,216 is an even number, so we can save it by dividing it by 2 and adjusting the exponent accordingly, so that 16,777,216 can be saved exactly as well. In contrast, a value of 16,777,217 cannot be accurately saved. As a result, we can see that the number of decimal digits that can be expressed by singleprecision floatingpoint numbers is no higher than 8 bits. In fact, the results of the numerical analysis of relative errors show that the effective precision is approximately 7.22 digits. Refer to the following example:
truth Value (True) 
store value (stored value) 
16,777,215 
1.6777215E7 
16,777,216 
1.6777216E7 
16,777,217 
1.6777216E7 
16,777,218 
1.6777218E7 
16,777,219 
1.677722E7 
16,777,220 
1.677722E7 
16,777,221 
1.677722E7 
16,777,222 
1.6777222E7 
16,777,223 
1.6777224E7 
16,777,224 
1.6777224E7 
16,777,225 
1.6777224E7 
According to the standard requirements, values that cannot be accurately saved must be rounded to the closest possible saved value. This is a bit like our familiar decimal rounding, that is, less than half of the house, more than half (including half) into. But for binary floatingpoint numbers, there is a rule that when the value that is rounded is exactly half, it is not simply entered, but the last valid number is zero in the two equidistant, close, saved values. As can be seen from the example above, an odd number is rounded to an even number, and there is a forward. We can interpret this rounding error as "halfdigit" error. Therefore, in order to avoid the confusion caused by 7.22 of many people, some articles often use 7.5 bits to illustrate the accuracy of singleprecision floatingpoint numbers. Hint: The floatingpoint rounding rule used here is sometimes referred to as rounding to even (Round to even). Rounding to an even number can help reduce rounding error accumulation in the calculation from some angles, compared to the rounding rule that is simply halved. It is therefore used for IEEE standards.