# Detailed description of Floating Point Theory

Source: Internet
Author: User
Directory
1. What is a floating point number?
2. IEEE floating point number
3. Transformation between a real number and a floating point number
4. Special values
4.1. Nan
4.2. Infinite
4.3. Signed zero
4.4. Number of non-normalized items
5. References
1. What is a floating point number?

In the development of computer systems, we have proposed multiple methods to express real numbers. A typical example is the fixed point number relative to a floating point ). In this expression, the decimal point is fixed in a certain position in the middle of all digits of the real number. The expression of currency can be used in this way. For example, 99.00 or 00.99 can be used to express the four-digit precision, and there are two currency values after the decimal point. Because the decimal point is fixed, four digits can be used to express the corresponding value. The number data type in SQL is defined by the number of points. Another proposed expression is rational number expression, that is, the ratio of two integers to express real numbers.

The disadvantage of the Point expression method is that its form is too stiff. The fixed decimal point location determines the integer part and the decimal part of the fixed digits, which is not conducive to expressing a particularly large number or a particularly small number at the same time. In the end, most modern computer systems adopt the so-called floating point expression. This expression uses scientific notation to express real numbers, that is, mantissa, base, exponent, and a symbol representing positive and negative numbers. For example, 123.45 can be expressed as 1.2345 × 102 in decimal scientific notation, where 1.2345 is the ending number, 10 is the base number, and 2 is the index. Floating Point Numbers use indexes to achieve the floating decimal point effect, which can flexibly express larger real numbers.

Tip:The ending number is also known as a valid number ). The ending number is actually an informal saying of valid numbers.

The same value can be expressed in multiple floating point numbers. For example, in the above example, 123.45 can be expressed as 12.345X101, 0.12345x103, or 1.2345X102. Because of this diversity, it is necessary to standardize it to achieve the goal of unified expression. The normalized floating point number can be expressed as follows:

±D. dd... d×βE, (0 ≤D I<β)

WhereD. dd... dThat is, the ending number and the base number of Beta,EIs an index. The number of digits in the ending number is called accuracy.P. Each numberDBetween 0 and the base number, including 0. The number on the left of the decimal point is not 0.

The specific value corresponding to the floating point based on the regular expression can be calculated by the following expression:

± (D0 +D1 β-1 +... +D P-1 β -(P-1) βE, (0 ≤D I<β)

For a decimal floating point number, that is, a floating point number with the base beta equal to 10, the above expression is very easy to understand and straightforward. The numerical expression in the computer is based on binary. From the above expression, we can know that the binary number can also have a decimal point or a decimal expression. In this case, beta is equal to 2, and each numberDThe value can only be between 0 and 1. For example, the binary number 1001.101 is equivalent to 1X2 3 + 0x22 + 0x21 + 1x20 + 1X2-1 + 0x2-2 + 1X2-3, it corresponds to the decimal 9.625. Its standard floating point number is expressed as 1.001101 × 23.

2. IEEE floating point number

In the computer, floating point numbers are saved in finite continuous bytes. The floating point numbers must be saved in a specific format. Float and double on the Java platform adopt the single-precision 32-bit floating point numbers and double-precision 64-bit floating point numbers defined in the IEEE 754 standard.

Note:The Java platform also supports the two extended formats defined by the standard, namely float-extended-exponent and double-extended-exponent. I will not introduce it here. If you are interested, you can refer to relevant references.

In the IEEE Standard, floating point numbers divide all the binary bits of consecutive bytes of a specific length into symbol domains of a specific width, and the index domain and the ending domain, the saved values are used to indicate the symbols, indexes, and tails in a given binary floating point number. In this way, the given value can be expressed through the ending number and the adjustable index (so called the "Floating Point. For the specific format, see the following example:

In the preceding example, the first domain is a symbolic domain. 0 indicates that the value is a positive number, and 1 indicates a negative number.

The second field is the index field, which corresponds to the index section in the binary scientific notation we introduced earlier. The single precision is 8 bits and the double precision is 11 bits. Taking the single-precision number as an example, the eight-digit index can express 255 exponent values between 0 and 255. However, the index can be positive or negative. In order to deal with the negative index, a deviation (bias) value is added as the value stored in the index field as required. The deviation of the single precision number is 127, the deviation of the double precision is 1023. For example, the actual Single-precision metric value 0 is saved as 127 in the index field, and the 64 in the index field indicates the actual metric value-63. The introduction of deviation changes the range of the actually expressed exponent values of a single precision to-127 to 128 (including the two ends ). We will soon see that the actual exponent value-127 (saved as full 0) and + 128 (saved as full 1) are reserved as special values for processing. In this way, the effective index range can be expressed between-127 and 127. In this article, the minimum and maximum indexes are usedEMin andEMax.

The third field in the legend is the ending number field, where the precision is 23 characters long and the double precision is 52 characters long. In addition to some special values we will discuss, the IEEE standard requires that floating point numbers be standard. This means that the decimal point on the left of the ending number must be 1. Therefore, when saving the ending number, we can omit the first decimal point 1 to free up a binary location to save more tails. In this way, we actually use the 23-bit long ending number field to express the 24-bit ending number. For example, for the number of single-Precision values, 1001.101 of the binary value (corresponding to 9.625 in decimal format) can be expressed as 1.001101 × 23. Therefore, the actual value saved in the last field is 00110100000000000000000, remove 1 from the left decimal point, and add 0 to the right.

It is worth noting that for a single precision number, since we only have a 24-digit index (one of which is hidden), the maximum exponent that can be expressed is 224-1 = 16,777,215. In particular, 16,777,216 is an even number, so we can save this number by dividing it by 2 and adjusting the integer accordingly, so that 16,777,216 can be precisely saved. On the contrary, a value of 16,777,217 cannot be precisely saved. As a result, we can see that a single-precision floating point number can be expressed in a decimal number, the real valid number is no more than eight digits. In fact, the numerical analysis of the relative error shows that the effective precision is about 7.22 bits. See the following example:

`         true value                 stored value         --------------------------------------         16,777,215                 1.6777215E7         16,777,216                 1.6777216E7         16,777,217                 1.6777216E7         16,777,218                 1.6777218E7         16,777,219                 1.677722E7         16,777,220                 1.677722E7         16,777,221                 1.677722E7         16,777,222                 1.6777222E7         16,777,223                 1.6777224E7         16,777,224                 1.6777224E7         16,777,225                 1.6777224E7         --------------------------------------`

According to standard requirements, values that cannot be precisely saved must be rounded to the nearest value that can be saved. This is a bit like the familiar decimal rounding, that is, if less than half is used, more than half (including half) is used. However, for binary floating-point numbers, there is another rule, that is, when the value to be rounded is exactly half, it is not simply entered, but in the two stored values with same distance between the front and back, the last valid number is 0. From the above example, we can see that all odd numbers are rounded to even numbers and are included. We can understand this rounding error as a half-bit error. Therefore, in order to avoid confusion caused by 7.22, some articles often use 7.5 bits to illustrate the precision of Single-precision floating point numbers.

Tip:The floating point number Rounding Rule used here is sometimes called rounding to an even number (round to even ). Compared with the Rounding Rule that simply goes in half, rounding to the even number helps reduce the rounding error accumulation problem in the calculation from some angles. Therefore, it is adopted by IEEE standards.

2nd floor

MorningPosted on:. Transformation between a real number and a floating point number

Now we have understood the IEEE expression of floating point numbers. Let's do some transformation exercises between the real number and the floating number to deepen our understanding. In these exercises, you will also find some surprising facts about floating point operations.

First, let's take a look at the simple side of things, from the floating point number to the real number. Understanding the format of floating point numbers is not difficult to do this exercise. Assume that we have a 32-bit data, expressed as 0xc0b40000 in hexadecimal notation, and we know that it is actually a single-precision floating point number. To obtain the real number actually expressed by the floating point number, we first convert it to the binary form:

`   C     0     B     4     0     0     0     01100 0000 1011 0100 0000 0000 0000 0000`

Then, it is divided into the corresponding fields according to the floating point format:

`1   10000001 01101000000000000000000`

1 indicates a negative number, and 129 indicates that the actual index is 2 (minus the deviation of 127 ); the ending number field is 01101, which means that the actual ending number of the binary is 1.01101 (plus 1 before the implicit decimal point ). Therefore, the actual real number is:

`-1.01101 × 22-(20 + 2-2 + 2-3 2-5) × 22-5.625`

Changing from a real number to a floating point number is a little tricky. Suppose we need to express the real-number-9.625 as a single-precision floating point number format. The method is to first express it with a binary floating point number and then convert it to the corresponding floating point number format.

First, convert the integer on the left of the decimal point to its binary form. The binary form of 9 is 1001. The algorithm used to process the decimal part is to multiply the decimal part by the base 2, record the integer part of the product result, multiply the decimal part of the result by 2, and continue the process:

`0.625 × 2 = 1.25         10.25   × 2 = 0.5          00.5    × 2 = 1            10`

When the final result is zero, the process ends. At this time, a column of numbers on the right is the binary decimal part we need, that is, 0.101. In this way, we get the complete binary form 1001.101. The value is expressed as 1.001101 × 23 using a standard floating point number.

Because it is a negative number, the symbol field is 1. The index is 3, so the index field is 3 + 127 = 130, that is, 10000010 of the binary. The tail number is omitted from 1 to 001101 on the left of the decimal point, and the right side is filled with zero. The final result is:

`1 10000010 00110100000000000000000`

Finally, the hexadecimal data in the floating point format can be represented as follows:

`1100 0001 0001 1010 0000 0000 0000 0000   C     1     1     A     0     0     0     0`

The final result is 0xc11a0000.

Simple? Wait! As you may have noticed, in the above intentionally selected example, the process of constantly multiplying the generated decimal part by 2 masks the fact. The ending sign of this process is that the result of multiplying the decimal part by 2 is 1. It is hard to imagine that many decimal places cannot go through a finite process (such as the simplest 0.1 ). We already know that the number of digits in the floating point tail number field is limited. Therefore, the floating point processing method is to continue this process until the resulting ending number is enough to fill the ending number field, and then round the extra digits. In other words, except for the precision we mentioned earlier, the decimal to binary conversion cannot always be accurate, but can only be an approximate value. In fact, only a few decimal places have precise binary floating point expression. In addition, the error accumulation during the floating point operation results in many simple decimal operations, which are often unexpected on the computer. This is the most common problem of "inaccuracy" in floating point operations. See the following Java example:

`System.out.print("34.6-34.0=" + (34.6f-34.0f));`

The output result of this Code is as follows:

`34.6-34.0=0.5999985`

The cause of this error is that 34.6 cannot be accurately expressed as the corresponding floating point number, but can only be saved as the rounded approximate value. The operation between the approximate value and the value 34.0 naturally does not produce accurate results.

4. Special values

Based on the previous introduction, You should have understood the basic knowledge of floating point numbers, which should be sufficient for a person who does not have access to floating point numbers. However, if you are very interested, or you are facing a tricky floating point application, you can learn some notable characteristics about floating point in this section.

We already know that the exponent field can actually express a value in the range of-127 to 128 (including the two ends ). Among them, the value-127 (saved as all 0) and + 128 (saved as all 1) are retained as special values. This section describes the special values defined in the IEEE Standard.

Special values in floating point numbers are mainly used for handling special cases or errors. For example, when the program starts to square a negative number, a special return value is used to mark this error. The value is Nan (not a number ). Without such a special value, such errors can only be roughly terminated. In addition to Nan, IEEE standards also define ± 0, ± ∞, and denormalized number ).

For single-precision floating-point numbers, all these special values are encoded by the reserved special exponent values-127 and 128. If we useEMin andEMax is used to express the boundary of other general exponent values, that is,-126 and 127. The reserved special exponent values can be expressedEMin-1 andEMax + 1 ;. Based on this expression, the special values of the IEEE Standard are as follows:

WhereFIndicates the right (fraction) of the decimal point in the tail. The first line is the standard floating point number we introduced earlier. Then we will introduce the remaining special values separately.

4.1. Nan

Nan is used to handle errors in computing, such as dividing 0.0 by 0.0 or finding the square root of a negative number. As can be seen from the above table, for a single-precision floating point number, Nan indicates that the index isEMax + 1 = 128 (all exponent fields are 1), and the ending number is not equal to zero. The IEEE Standard does not require a specific ending number field, so Nan is not actually a family. Different implementations can freely select the value of the tail number field to express Nan, such as the constant float in Java. nan's floating point number may be expressed as 01111111110000000000000000000000, with the first digit in the tail field being 1, and the rest being 0 (excluding the hidden one), but this depends on the hardware architecture of the system. In Java, programmers are even allowed to construct Nan values with special locating modes (through the float. intbitstofloat () method ). For example, programmers can use this customized special location mode in Nan values to express some diagnostic information.

The custom Nan value can be determined as Nan through the float. isnan () method, but it is not equal to the float. Nan constant. In fact, all Nan values are unordered. Numeric comparison operators <, <=,> and> = return false when any operand is Nan. Equal to operator = false is returned when any operand is Nan, even two nan with the same bit pattern. Operator! = Returns true if any operand is Nan. An interesting result of this rule is X! = X is true when X is Nan.

The operation that can generate Nan is as follows:

In addition, any operation with Nan as the operand will also generate Nan. The significance of expressing the preceding operation errors with special Nan is to avoid unnecessary termination of the operation due to these errors. For example, if a floating-point operation method called cyclically may cause these errors due to input parameters, Nan makes such errors even if a loop occurs, you can also simply continue to execute a loop to perform operations without errors. You may think that since Java has an exception handling mechanism, you can catch and ignore exceptions to achieve the same effect. However, it should be noted that the IEEE standard is not just set for Java, and the exception handling mechanisms of various languages are different, which makes code migration more difficult. Besides, not all languages have similar exceptions or signal processing mechanisms.

Note:In Java, unlike floating point processing, dividing the integer 0 by 0 throws a java. Lang. arithmeticexception.

4.2. Infinite

Like Nan, The exponent part of the special infinity value is alsoEMax + 1 = 128, but the infinite ending number field must be zero. Infinity is used to express the overflow problem in computing. For example, when two extremely large numbers are multiplied, although the two operands can be saved as floating-point numbers, the result may be as large as it cannot be saved as a floating-point number, but must be rounded. According to the IEEE Standard, the result is not rounded to the maximum number of floating points that can be saved (because the number may be too far away from the actual result and meaningless), but rounded to infinity. This is also true for negative result, except that the round is rounded to negative infinity, that is, the infinity of the symbol field is 1. With the experience of Nan, it is easy to understand that the infinite special value makes the overflow error in the computation unnecessary to end the computation.

Infinity is as ordered as any floating point number other than Nan. from small to large, it is negative infinity, negative has a finite non-zero value, and positive and negative zero (introduced later ), positive has a non-zero value and positive infinity. If any non-zero value except Nan is divided by zero, the result is infinite, and the symbol is determined by the zero sign as the divisor.

Looking back at our introduction to Nan, when zero is divided by zero, the result is not infinite but Nan. The reason is not hard to understand. When the divisor and the divisor both approach zero, the operator may be any value. Therefore, the IEEE Standard determines that Nan can be used as the quotient at this time.

4.3. Signed zero

In the IEEE standard floating-point format, 1 on the left of the decimal point is hidden, while 0 obviously requires that the ending number be zero. Therefore, zero cannot be expressed directly in this format but can only be specially processed.

In fact, zero is saved as the ending number field, all is 0, and the index field isEMin-1 =-127, that is, the index fields are all 0. Considering the role of the symbolic domain, there are two zeros, namely + 0 and-0. Unlike positive and negative infinity, the IEEE standard stipulates that positive and negative zeros are equal.

There are positive and negative differences between zero and zero, which is indeed very confusing. This is the result of multiple considerations based on numerical analysis after the pros and cons are weighed. Signed Zero can avoid the loss of Symbol Information in operations, especially in infinite operations. For example, if zero is unsigned, equation 1/(1/x) = x is no longer valid when x = ± ∞. The reason is that if zero is unsigned, the ratio of 1 to positive and negative infinity is the same zero, then the ratio of 1 to 0 is positive infinity, and the symbol is gone. Solve this problem, unless there is no symbol in infinity. But the infinite symbols indicate which side of the number axis overflow occurs. This information is obviously not required. Zero-signed also causes other problems. For example, when x = Y, when equation 1/x = 1/Y is + 0 and-0 respectively, the two ends are positive infinity and negative infinity. Of course, the other way to solve this problem is the same as infinity, and the rule of zero is also orderly. However, if zero is ordered, even a simple judgment like if (x = 0) may become uncertain because X may be ± 0. It is better to make them light.

4.4. Number of non-normalized items

Let's look at a special situation of floating point numbers. Select two floating-point numbers with extremely small absolute values. Take a single-precision binary floating-point number as an example, for example, the numbers 1.001 × 2-125 and 1.0001 × 2-125 correspond to 2.6448623 × 10-38 in decimal format and 2.4979255 × 10-38 in decimal format ). Obviously, they are all common floating point numbers (exponent-125, greater than the allowed minimum value-126; no problem with the number of tails), which can be saved as 754 (0x00000001000100000000000000000000) According to IEEE 1100000) and 00000001000010000000000000000000 (0x1080000 ).

Now let's look at the difference between the two floating point numbers. It is not hard to conclude that the difference value is 0.0001 × 2-125, and the value expressed as a standard floating point number is 1.0 × 2-129. The problem is that its index is greater than the allowed minimum exponent value, so it cannot be saved as a standard floating point number. Eventually, it can only be approximately zero (flush to zero ). In special cases, the following code may also be very reliable:

`if (x != y) { z = 1 / (x -y);}`

Just as we have carefully selected two floating point numbers, even if X is not equal to Y, the difference between x and y may still be the absolute value too small, rather than zero, resulting in dividing by 0.

To solve this problem, the IEEE standard introduces denormalized floating point numbers. Specifies that when the index of a floating point is the allowed minimum exponent value, that isEIn min, the ending number does not need to be normalized. For example, the difference value in the above example can be expressed as a non-standard floating point number of 0.001 × 2-126, where the exponent-126 equalsEMin. Note that the rule "no" means "yes ". When the actual exponent of a floating point isEMin, and the index domain isEMin, the floating point number is still standard, that is, a hidden ending number is hidden during storage. In order to save non-standard floating point numbers, the IEEE Standard adopts a method similar to processing special values at zero time, that is, using special exponential values.EMin-1 is marked. Of course, the ending number field cannot be zero. In this case, the difference value in the example can be saved as 00000000000100000000000000000000 (0x100000) without an implicit ending number.

With non-standard floating point numbers, we can remove the implicit tail number restriction and save floating point numbers with smaller absolute values. In addition, because it is no longer subject to the implicit tail number field constraints, the above problem about the very small difference does not exist, because the difference between all the floating point numbers that can be saved can also be saved.

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

## A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• #### Sales Support

1 on 1 presale consultation

• #### After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.