# Floating point numbers (IEEE 754)

Source: Internet
Author: User

Http://www.cnblogs.com/kingwolfofsky/archive/2011/07/21/2112299.html

floating point numbers

1. What are floating point numbers

In the process of computer system development, several methods have been proposed to express real numbers. Typically, for example, the number of fixed-point points relative to floating-point numbers (fixed point numbers). In this way of expression, the decimal point is fixed somewhere in the middle of all numbers in the real number. The expression of the currency can be used in this way, for example, 99.00 or 00.99 can be used to express a four-bit precision (Precision) with a two-bit currency value after the decimal point. Because the decimal point is fixed, you can use the four-digit value directly to express the corresponding value. The number data type in SQL is defined by using fixed-point numbers. There is also a proposed expression of rational number expression, that is, the ratio of two integers to express real numbers.

The disadvantage of fixed-point number expression is that its form is too stiff, fixed decimal point position determines the integer and fractional parts of fixed digits, and it is unfavorable to express particularly large numbers or especially small numbers at the same time. In the end, most modern computer systems adopt the so-called floating-point representation. This expression uses scientific notation to express real numbers, that is, using a mantissa (Mantissa, the mantissa is sometimes called a valid digital--significand; The mantissa is actually an unofficial argument for a valid number), a cardinal (base), An exponent (Exponent) and a sign representing positive and negative signs to express real numbers. For example, 123.45 can be expressed as 1.2345x102 with the decimal science notation, where 1.2345 is the mantissa, 10 is the base and 2 is the exponent. Floating point numbers use indices to achieve the effect of floating decimal points, so that you can flexibly express a larger range of real numbers.

2. IEEE floating-point numbers

A floating-point number is stored in a computer with a finite number of contiguous bytes. In the IEEE standard, a floating-point number is a symbol field that divides all the bits of a specified length of contiguous bytes into a specific width, exponential and Mantissa fields, where the values stored are used to represent the symbols, indices, and mantissa of a given binary floating-point number, respectively. Thus, the given value can be expressed by the mantissa and the adjustable exponent (so called "floating-point").

IEEE 754 Specifies:

n two basic floating-point formats: single precision and double precision.

The ØIEEE single precision format has 24-digit valid digital precision and occupies a total of 32 bits.

The ØIEEE double format has 53 digits of valid digital precision and occupies a total of 64 digits.

N Two extended floating-point formats: single and double precision extensions. This standard does not specify the exact precision and size of these formats, but it specifies the minimum precision and size. For example, the IEEE double-precision extended format must have at least 64 digits of valid numeric precision and a total of at least 79 digits.

See the following illustration for a specific format:

3. Floating point format

A floating-point format is a data structure that specifies the fields that contain floating-point numbers, the layout of these fields, and their arithmetic explanations. Floating-point storage formats Specify how floating-point formats are stored in memory. These formats are defined by the IEEE standard, but the choice of which storage format is determined by the implementation tool.

Assembly language software sometimes depends on the storage format used, but higher-level languages typically deal only with the language concepts of floating-point data types. These types have different names in different high-level languages and correspond to the IEEE format shown in the table.

 IEEE Precision C, c + + Fortran (SPARC only) Single Precision Float Real or Real*4 Double Precision Double DOUBLE PRECISION or Real*8 Double Precision Expansion Long double Real*16

IEEE 754 explicitly defines single-precision floating-point formats and double-precision floating-point formats, as well as a set of extended formats for each of the two basic formats. The long double and real*16 types shown in the table apply to a double-precision extended format defined by the IEEE standard.

3.1. Single precision format

IEEE single-precision format consists of three fields: 23 decimal F; 8-bit bias index e; and 1-bit symbol S. These fields are stored continuously in a 32-bit word (as shown in the following illustration).

The ø0:22 bit contains 23 decimal F, where the No. 0 digit is the least significant bit of the decimal, and the 22nd digit is the most significant bit.

The IEEE standard requires floating-point numbers to be canonical. This means that the mantissa must be to the left of the decimal point of 1, so we save the mantissa, we can omit the decimal point before this 1, thereby freeing up a bits to save more mantissa. So we actually use the 23-bit long mantissa field to express the 24-bit mantissa.

The Ø23:30 bit contains the 8-bit bias index E, the 23rd digit is the least significant bit of the offset index, and the 30th digit is the most significant bit.

The 8-bit exponent is a 256 exponential value that can express between 0 and 255. However, the indices can be positive or negative. In order to handle the negative exponent, the actual exponential value is required to add a deviation (Bias) value as the value stored in the Exponential field. The deviation of a single precision number is 127; The introduction of the deviation causes the range of the exponential value that can actually be expressed for a single precision number to be between 127 and 128 (including both ends). In this paper, the minimum and maximum indices are expressed by Emin and Emax respectively. The actual exponential value of 127 (saved as full 0) and +128 (saved as all 1) are described later to be treated as special values.

Ø the highest 31st digit contains the symbol bit s. An S of 0 indicates a positive number, while S to 1 indicates a negative value.

3.2. Double precision format

The IEEE double-precision format consists of three fields: 52 decimal F; 11-bit bias index e; and 1-bit symbol S. These fields are stored continuously in two 32-bit characters (as shown in the following figure). In a SPARC architecture, a higher address 32-bit word contains a decimal 32-bit least significant bit, whereas in the x86 architecture, the lower-address 32-bit word contains the 32-bit least significant bit of the decimal number.

If you use f[31:0] to represent the 32-bit least significant bit of a decimal, the No. 0 digit is the least significant bit of the entire decimal in the 32-bit least-significant bit, and the 31st digit is the most significant bit. In another 32-digit word, 0: The 19-bit contains the most significant bit of the 20-bit decimal f[51:32], where the No. 0 digit is the least significant bit of the 20-bit most significant bit, and the 19th digit is the most significant bit of the entire decimal, and the 20:30 digit contains the 11-bit bias index E, where the 20th digit is The least significant bit of the number, while the 30th digit is the most significant bit, and the highest 31st digit contains the symbol bit s.

The above figure numbered the two consecutive 32-digit words by a 64-digit word, where

The ø0:51 bit contains 52 decimal F, where the No. 0 digit is the least significant bit of the decimal, and the 51st digit is the most significant bit.

The IEEE standard requires floating-point numbers to be canonical. This means that the mantissa must be to the left of the decimal point of 1, so we save the mantissa, we can omit the decimal point before this 1, thereby freeing up a bits to save more mantissa. So we actually use the 52-bit long mantissa field to express the 53-bit mantissa.

The ø52:62 bit contains the 11-bit bias index E, the 52nd digit is the least significant bit of the offset index, and the 62nd digit is the most significant bit.

The 11-bit exponent is a 2048 exponential value that can express between 0 and 2047. However, the indices can be positive or negative. In order to handle the negative exponent, the actual exponential value is required to add a deviation (Bias) value as the value stored in the Exponential field. The deviation of a single precision number is 1023; the introduction of the deviation causes the range of the exponential value that can actually be expressed for a single precision number to be between 1023 and 1024 (including both ends). In this paper, the minimum and maximum indices are expressed by Emin and Emax respectively. The actual exponential value of 1023 (saved as full 0) and +1024 (saved as all 1) are described later to be treated as special values.

Ø the highest 63rd digit contains the symbol bit s. An S of 0 indicates a positive number, while S to 1 indicates a negative value.

3.3. Double precision Extended Format (SPARC)

The four times-times precision format of a SPARC floating-point environment conforms to the IEEE definition of a double-precision extended format. The four times-times precision format occupies 32-bit characters and contains the following three fields: 112-digit F, 15-bit bias exponent e, and 1-bit symbol S. These three fields are continuously stored, as shown in Figure 2-3.

The highest 32-digit address contains a decimal 32-bit least significant bit, expressed in f[31:0. The immediate two 32-bit words contain f[63:32] and f[95:64] respectively. The following 0:15 bits contain a decimal 16-bit most significant bit f[111:96], where the No. 0 digit is the least significant bit of this 16 bit, and the 15th digit is the most significant bit of the entire decimal. The 16:30 bit contains the 15-bit bias index E, where the 16th digit is the least significant bit of the bias index, and the 30th digit is the most significant bit, and the 31st bit contains the symbol bit s.

The following figure numbers the four consecutive 32-bit words by a 128-digit word, where 0:111 bits store decimal F; 112:126-bit store 15-bit bias exponent e; and 127th-digit stores symbol bit s.

3.4. Double Precision Extended Format (x86)

The Double-precision extended format of this floating-point environment conforms to the IEEE definition of a double precision extended format. It contains four fields: 63-bit decimal F, 1-bit explicit leading effective digit J, 15-bit bias exponent e, and 1-bit symbol S.

In the x86 Architecture series, these fields are continuously stored in 8-bit bytes of 10 contiguous addresses. Because the Unixsystem V application Binary Interface Intel 386 Processor Supplement (Intel ABI) requires a double extension parameter, which takes up 32-bit words on the stack of three contiguous addresses, where The top 16-bit most significant digits of the address are not used, as shown in the following figure.

The lowest 32-digit address contains the decimal 32-bit f[31:0], where the No. 0 digit is the least significant bit of the entire decimal, and the 31st digit is the most significant bit of the 32-bit least significant bit. In the 32-digit address center, 0:30 bits contains a decimal 31-bit maximum significant bit f[62:32] (where the No. 0 digit is the least significant bit of the 31-bit most significant bit, and the 30th is the most significant bit of the entire decimal); the 31st digit of the center 32 digit of the address contains the explicit leading significant digit J.

In the 32-digit address, 0:14 digits contain the 15-bit bias index E, where the No. 0 digit is the least significant bit of the bias index, and the 14th digit is the most significant bit, and the 15th bit contains the symbol bit s. Although the highest 16 bits of the highest 32-bit address are not used by the x86 architecture series, they are critical to meeting the Intel ABI requirements as described above.

4. Convert real numbers to floating-point numbers

Normalization of 4.1 floating point numbers

The same number can be expressed in a variety of floating-point numbers, as in the example above, 123.45 can be expressed as 12.345x101,0.12345x103 or 1.2345x102. Because of this diversity, it is necessary to standardize it to achieve the goal of unified expression. The canonical (normalized) floating-point representation has the following form:

±d.dd...dxβe, (0≤d i<β)

The D.DD...D is the mantissa, β is the base, and E is the exponent. The number of digits in the mantissa is called precision, which is represented in this article by P. Each digit d is between 0 and cardinality, including 0. The number on the left of the decimal is not 0.

The specific value of the floating-point number based on the specification expression can be computed from the following expression:

± (d 0 + D 1β-1 + ... + d p-1β-(p-1)) Βe, (0≤d i<β)

The expression above is easy to understand and straightforward for decimal floating-point numbers, that is, cardinality beta equals 10 floating-point numbers. The numerical expression inside the computer is based on the binary system. From the above expression, we can know that binary numbers can also have decimal points, but also have a similar way of expression in the metric system. Just at this point beta equals 2, and each number D can only be evaluated between 0 and 1. For example, binary number 1001.101 is equivalent to 1x2 3 + 0x22 + 0x21 + 1x20 + 1x2-1 + 0x2-2 + 1x2-3, corresponding to the decimal 9.625. Its canonical floating-point number is expressed as 1.001101x23.

4.2 Floating-point numbers based on Precision

Taking the above 9.625 as an example, the canonical floating-point number is expressed as 1.001101x23,

Therefore, in the form of a single precision format:

1 10000010 00110100000000000000000

Similarly, in double precision format:

1 10000000010 0011010000000000000000000000000000000000000000000000

5. Special Value

From the previous introduction, you should have known the basics of floating-point numbers, which should be sufficient for a person who does not touch floating-point applications. However, if your interest is strong, or you face a tricky floating-point application, you can learn some notable special things about floating-point numbers in this section.

We already know that the range of exponential values that can actually be expressed by a single-precision floating-point number exponent field is between 127 and 128 (including both ends). Where the value-127 (saved as full 0) and +128 (saved as full 1) are reserved for processing as special values. This section details these special values as defined in the IEEE standards.

Special values in floating-point numbers are mainly used for special cases or for error handling. For example, when the program squares a negative number, a special return value is used to mark the error, which is NaN (not a number). Without such a special value, the calculation can only be rudely terminated for such errors. In addition to NaN, the IEEE standard also defines ±0,±∞ and nonstandard numbers (denormalized number).

For single-precision floating-point numbers, all of these special values are encoded by the reserved special exponent value-127 and 128来. If we use Emin and Emax respectively to express the boundary of other conventional exponential range, that is, 126 and 127, the reserved special exponent value can be expressed respectively as emin-1 and Emax + 1;. Based on this expression, the special values of the IEEE standard are as follows:

where F represents the portion (fraction) to the right of the decimal point in the mantissa. The first line is the normal, normalized floating-point number that we described earlier. We will then introduce the remaining special values separately.

5.1 NaN

NaN is used to handle error conditions that occur in calculations, such as 0.0 divided by 0.0 or the square root of a negative number. As can be seen from the table above, for single-precision floating-point numbers, NaN is represented as an exponent of Emax + 1 = 128 (exponential field is all 1), and the Mantissa field is not equal to zero floating-point numbers. The IEEE standard does not require a specific mantissa domain, so NaN is actually not one, but a family. Different implementations can freely select the value of the Mantissa field to express NaN, for example, a constant Float.nan in Java may be expressed as 01111111110000000000000000000000, where the first digit of the Mantissa field is 1, and the rest is 0 (excluding the hidden one) , but it depends on the hardware architecture of the system. In Java, programmers are even allowed to construct NaN values with a particular bit pattern (through the Float.intbitstofloat () method). For example, a programmer can use a specific bit pattern in this custom NaN value to express some diagnostic information.

A custom Nan value, which can be determined by the Float.isnan () method as Nan, but it is not equal to the Float.nan constant. In fact, all NaN values are unordered. Numeric comparison operators <,<=,> and >= return False when either operand is NaN. equals operator = = returns False when either operand is Nan, even if two nan with the same bit pattern. The operator!= returns true if either operand is NaN. An interesting result of this rule is that X!=x is true when X is NaN.

The operations that can produce NaN are as follows:

In addition, any operation that has Nan as an operand will also produce Nan. The significance of using a special NaN to express the above error is to avoid unnecessary termination of the operation caused by these errors. For example, if a floating-point method that is called by a loop can cause these errors due to input parameter problems, NaN makes it possible to simply continue the loop to perform operations that have no errors, even if a loop has made such an error. You might think that, now that Java has an exception handling mechanism, you might be able to achieve the same effect by capturing and ignoring exceptions. However, you know that the IEEE standard is not designed for Java alone, and that different languages handle exceptions differently, which makes it more difficult to migrate code. Moreover, not all languages have similar exception or signal (Signal) processing mechanisms.

Note: In Java, unlike floating-point numbers, the integer 0 divided by 0 throws a Java.lang.ArithmeticException exception.

5.2 Infinity

As with NaN, the exponent portion of the special Value Infinity (Infinity) is also Emax + 1 = 128, but the infinite mantissa field must be zero. Infinity is used to express the overflow (Overflow) problem generated in the calculation. For example, when multiplying two large numbers, although the two operands themselves can be saved as floating-point numbers, the result may be too large to be saved as floating-point numbers and must be rounded. According to the IEEE standard, instead of rounding the result to the largest floating-point number that can be saved (because it may be too far away from the actual result and meaningless), it is rounded to infinity. This is true for negative numbers as well, except when rounding to negative infinity, which means that the symbol field is 1 infinite. With NaN's experience, it is not difficult to understand that the special value infinity makes the overflow error in the calculation not necessarily the result of terminating the operation.

Infinity is ordered as a floating-point number other than NaN, with negative infinity from small to large, negative with a poor 0 value, plus or minus 0 (subsequently introduced), positive with poor non-0 value and positive infinity. Any non-0 value other than NaN is divided by 0, and the result is infinite, and the symbol is determined by the symbol of 0 of the divisor.

Recalling our introduction to Nan, when 0 divided by zero, the result is not infinity but NaN. The reason is not difficult to understand, when the divisor and dividend are all close to 0 o'clock, the quotient may be any value, so the IEEE standard decided to use NaN as a business at this time more appropriate.

5.3 Signed 0

Because 1 of the IEEE standard floating-point number format, the left side of the decimal point is hidden, and 0 obviously requires that the mantissa must be zero. Therefore, 0 can not be directly expressed in this format but only special treatment.

In fact, 0 is saved as the Mantissa field is all 0, and the exponential field is Emin-1 =-127, which means that the exponential field is all 0. Given the role of the symbolic domain, there are two zeros, i.e. +0 and 0. Unlike positive and negative infinity, which is ordered, the IEEE standard stipulates that plus or minus 0 is equal.

0 has positive and negative points, it is really very easy to confuse people. This is based on the numerical analysis of a variety of considerations, after the pros and cons of the formation of the results. Signed 0 can avoid the loss of symbolic information in operations, especially those involving infinite operations. For example, if 0 is unsigned, the equation 1/(1/x) = x is no longer established when X =±∞. The reason is that if 0 unsigned, the ratio of 1 to minus Infinity is the same 0, then the ratio of 1 to 0 is positive infinity, and the symbol is gone. Solve this problem unless infinity also has no symbols. But the infinity symbol expresses the side of the axis on which the overflow occurs, and this information is clearly not to be allowed. 0 symbols also cause other problems, such as when X=y, the equation 1/x = 1/y when x and Y are respectively +0 and 0, the two ends are positive infinity and negative infinity, respectively, are no longer valid. Of course, another way to solve this problem is that, as with infinity, the rule of zero is orderly. However, if 0 is ordered, even simple judgments such as if (x==0) can become indeterminate because x may be ±. Two evils take its light, 0 or disorderly good.

5.4 Non-normalized number

Let's examine a special case of floating-point numbers. Select two floating-point numbers with a minimum absolute value, taking the single-precision binary floating-point number as an example, such as 1.001x2-125 and 1.0001x2-125 two digits (corresponding to the decimal 2.6448623x10-38 and 2.4979255x10-38 respectively). Obviously, they are all normal floating-point numbers (index-125, greater than the allowable minimum-126; The mantissa is more problematic), which can be saved separately as 00000001000100000000000000000000 (0x1100000) and 00000001000010000000000000000000 (0x1080000).

Now let's look at the difference between the two floating-point numbers. It is not difficult to conclude that the difference is 0.0001x2-125, expressed as a canonical floating-point number is 1.0x2-129. The problem is that the exponent is greater than the minimum allowable exponent, so it cannot be saved as a canonical floating-point number. Finally, it can only approximate 0 (Flush to Zero). This particular situation means that the following inherently reliable code may also have problems:

if (x!= y) {

z = 1/(x-y);

}

Just as we have carefully chosen two floating-point numbers to show the problem, even if X is not equal to y,x and Y, the difference may still be too small, and approximately zero, which results in dividing by 0.

To solve such problems, nonstandard (denormalized) floating-point numbers are introduced in IEEE standards. Specify that when the exponent of a floating-point number is the minimum exponential value allowed, i.e. Emin, the mantissa need not be normalized. For example, the difference in the above example can be expressed as a nonstandard floating-point number 0.001x2-126, where exponent 126 equals Emin. Note that this is a "no", which means "yes". When the actual exponent of the floating-point number is Emin and the exponential field is also Emin, the float is still canonical, that is to say, a hidden trailing digit is implied in the save. To preserve nonstandard floating-point numbers, the IEEE standard uses a similar approach to dealing with a special value of zero, which is marked with a special exponential field value emin-1, of course, the Mantissa field at this point cannot be zero. Thus, the difference in the example can be saved as 00000000000100000000000000000000 (0x100000), with no implied trailing digits.

With nonstandard floating-point numbers, the implied number of digits is removed, and the floating-point numbers with smaller absolute values can be saved. Moreover, because the implicit mantissa domain is no longer restricted, the problem of the minimum difference value is not present, because the difference between all the floating point numbers can be saved.

6. Range and Accuracy

Many decimal places cannot be accurately represented in binary computers (for example, the simplest 0.1) because the number of bits in the Mantissa field is limited, the floating-point number is treated by continuing the process until the resulting mantissa is sufficient to fill the mantissa field and then rounding the extra bits. In other words, in addition to the precision problems we talked about earlier, the decimal to binary transformation is not guaranteed to be accurate, but only approximate. In fact, only a very small number of decimal decimals have an exact binary floating-point number expression. Plus the error accumulation in floating-point operations, the result is that many of the decimal operations we seem to be very simple are often unexpected on computers. This is the most common floating-point operation of the "inaccurate" problem.

See the following Java example:

System.out.print ("34.6-34.0=" + (34.6f-34.0f));

The output of this code is as follows:

34.6-34.0=0.5999985

The reason for this error is that 34.6 cannot be accurately expressed as the corresponding floating-point number and can only be saved as rounded approximations. The operation between this approximation and 34.0 naturally fails to produce accurate results.

The scope and precision of the storage format

 format valid digits (binary) Minimum positive normal number Max positive valid number (decimal) single fine 1.175 ... 10-38 3.402 10+38 6-9 double precision 53 /p> 2.225 ... 10-308 1.797...10+308 15-17 Double Extended (SPARC) 113 3.362 ... 10-4932 1.189...10+4932 33-36 Double precision Extension (x86) \$ 3.362 ... 10-4932 1.189...10+4932 18-21 td>

7. Rounding

It is worth noting that for single-precision numbers, because we have only 24-bit indices (one hidden), the maximum exponent that can be expressed is 224-1 = 16,777,215. In particular, 16,777,216 is an even number, so we can save it by dividing it by 2 and adjusting the exponent accordingly, so that 16,777,216 can be saved exactly as well. In contrast, a value of 16,777,217 cannot be accurately saved. As a result, we can see that the number of decimal digits that can be expressed by single-precision floating-point numbers is no higher than 8 bits. In fact, the results of the numerical analysis of relative errors show that the effective precision is approximately 7.22 digits. Refer to the following example:

 truth Value (True) store value (stored value) 16,777,215 1.6777215E7 16,777,216 1.6777216E7 16,777,217 1.6777216E7 16,777,218 1.6777218E7 16,777,219 1.677722E7 16,777,220 1.677722E7 16,777,221 1.677722E7 16,777,222 1.6777222E7 16,777,223 1.6777224E7 16,777,224 1.6777224E7 16,777,225 1.6777224E7

According to the standard requirements, values that cannot be accurately saved must be rounded to the closest possible saved value. This is a bit like our familiar decimal rounding, that is, less than half of the house, more than half (including half) into. But for binary floating-point numbers, there is a rule that when the value that is rounded is exactly half, it is not simply entered, but the last valid number is zero in the two equidistant, close, saved values. As can be seen from the example above, an odd number is rounded to an even number, and there is a forward. We can interpret this rounding error as "half-digit" error. Therefore, in order to avoid the confusion caused by 7.22 of many people, some articles often use 7.5 bits to illustrate the accuracy of single-precision floating-point numbers. Hint: The floating-point rounding rule used here is sometimes referred to as rounding to even (Round to even). Rounding to an even number can help reduce rounding error accumulation in the calculation from some angles, compared to the rounding rule that is simply halved. It is therefore used for IEEE standards.

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

## A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• #### Sales Support

1 on 1 presale consultation

• #### After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.