Representation accuracy of IEEE 754 floating point numbers
Preface
I have seen many programmers on the Internet have many questions about floating-point number accuracy. I posted a post on the Forum and asked many enthusiastic people to answer the questions. However, I found some small errors and misunderstandings in some answers. I have done a numerical algorithm program. Although it is basically available, it is plagued by the floating point number Accuracy Problem. After the incident, I spent some time collecting data and studying it carefully, I would like to share with you the hope to give a more detailed explanation of the accuracy of the binary floating point in the IEEE 754 standard and related issues. Of course, any errors in this article shall be borne by me and I hereby declare that.
1. What is the IEEE 754 standard?
Almost all hardware and software documents that currently support binary floating point numbers claim that their floating point implementations comply with the IEEE 754 standard. So what is the IEEE 754 standard?
The most authoritative explanation is that the ieee754 standard itself ANSI/IEEE Std 754-1985 IEEE Standard for binary floating-point arithmetic. There are PDF files on the Internet. Just google them and download them. The standard text is in English and only 23 pages in total. Read it carefully if you are patient. Here is an excerpt from the preface:
This standard defines a family of very cially feasible ways for new systems to perform binary floating-point arithmetic.
In fact, it is a nonsense and nothing is said.
William Kahan, a professor of mathematics at the University of California at Berkeley, helped intel design the 754 floating point processor (FPU) based on the IEEE 8087 standard, professor Kahan also won the Turing Award in 1987. The IEEE 754 floating point format is indeed a genius design. Professor Kahan's homepage: http://www.cs.berkeley.edu /~ Wkahan /.
Let's take a look at other documents.
2. What are the provisions of the IEEE 754 standard?
The following content comes from Sun's numerical computation guide-Sun Studio 11, a Chinese version of the "numerical computation Guide", with my own instructions. To be honest, this Chinese guide is not very well translated. For example, round is translated into "Rounding ".
IEEE 754 rules:
A) two basic floating point formats: single precision and double precision.
The IEEE Single-precision format has 24-bit valid numbers and occupies 32 digits in total. The IEEE dual-precision format has a 53-bit valid digital precision, and occupies a total of 64-bit.
Note:The basic floating point format is fixed. The corresponding decimal valid digits are 7 bits and 17 bits. The C/C ++ types corresponding to the basic floating point format are float and double.
B) Two Extended Floating Point formats: Single-precision extension and double-precision extension.
This standard does not specify the precision and size of the extended format, but it specifies the minimum accuracy and size. For example, the IEEE dual-precision extended format must have at least 64-bit valid numbers and occupy at least 79 digits in total.
Note:Although the IEEE 754 standard does not specify a specific format, the implementer can select a format that complies with this specification. Once implemented, the format is fixed. For example, x86 FPU is an 80-bit extended precision, while Intel anteng FPU is an 82-bit extended precision, which complies with the IEEE 754 standard. C/C ++ is of the Long Double Type for extended dual precision. However, compilers of Microsoft Visual C ++ 6.0 and later versions do not support this type. The long double and double types are the same, they are 64-bit basic dual-precision and can only be used by other C/C ++ compilers or assembly languages.
C) Accuracy Requirements for floating point operations: add, subtract, multiply, divide, square root, remainder, round the number of floating point formats to an integer, convert between different floating point formats, convert between floating point and integer formats, and compare.
The remainder and comparison operations must be accurate. Each other operation must provide accurate results to its target, unless such results are not available or the results do not meet the target format. In the latter case, the computation must follow the rules of the rounding mode described below to make a minimum modification to the exact result and provide the result of such modification to the operation target.
Note:IEEE754There is no rule that the results of basic arithmetic operations (+,-, ×,/, etc.) must be accurate, because for the binary floating point format of IEEE 754, because the floating point format length is fixed, the results of basic operations are hardly accurate. Here we use the three-digit decimal addition to describe:
Example 1: A = 3.51, B = 0.234, and a + B =?
Both A and B are three valid numbers. However, the exact result of A + B is 3.744, which is a four-digit valid number. This floating point format has only three Precision values, the result of A + B cannot be accurately expressed and can only be approximately expressed. The specific operation result depends on the rounding mode (see the description of the rounding mode ). Similarly, because the floating point format is fixed, the results of other basic operations are almost impossible to represent accurately.
D) Accuracy, singularity, and consistency requirements for conversion between a decimal string and binary floating point numbers in one of the two basic floating point formats.
For the operands within the specified range, these conversions must generate precise results (if possible), or follow the rules of the specified rounding mode to make minimum modifications to such precise results. For operands that are not within the specified range, the difference between the result generated by these conversions and the exact result cannot exceed the specified error depending on the rounding mode.
Note:This rule aims at the mutual conversion between the data represented by the decimal string and the binary floating point number, and is also the most likely illusion for programmers. This is because people are most familiar with decimal systems and think that binary systems must be exactly represented in any decimal number. The main purpose of this article is to expose the decimal number that the binary floating point number can accurately represent. If you have never thought about this before, it will surprise you. Sell a customs first!
E) five types of IEEE floating point exceptions and conditions used to indicate to the user the occurrence of these types of exceptions.
Five types of floating point exceptions are: Invalid operation, Division by zero, overflow, overflow, and inaccuracy.
Note:For floating point exceptions, see Professor Kahan's lecture notes on IEEE 754.
F) Four rounding directions:
Toward the nearest value that can be expressed; the "even" value is preferred when there are two nearest value that can be expressed; toward negative infinity (downward); toward positive infinity (upward) and 0 (truncation ).
Note:The rounding mode is also one of the most common causes of misunderstanding. What we are most familiar with isRoundingMode. However, the IEEE 754 standard does not support this mode. The default mode is round to nearest. It is only different from rounding. 5, the even number is used. Examples are as follows:
Example 2:
Last rounding mode: round (0.5) =0; Round (1.5) = 2; round (2.5) =2;
Rounding mode: round (0.5) =1; Round (1.5) = 2; round (2.5) =3;
Main reasons:Given the limited font length, floating point numbers that can be accurately expressed are limited and therefore discrete. Between two adjacent floating point numbers that can be accurately expressed, there must be an infinite number of real numbers that cannot be accurately expressed by IEEE floating point numbers. How to use floating-point numbers to represent these numbers. The IEEE 754 method is to use a floating-point number closest to the real number for Approximate Representation. However,. 5. The distance from 0 to 1 is the same as that between 0 and 1. It is not suitable for anyone. The rounding mode takes 1. Although the bank is willing to pay an extra 0.5 cent for interest calculation, it is not reasonable. For example, if rounding is used in sum calculation, the error may increase continuously. Equality of opportunity is fair, that is, it is reasonable to make up half of the sum up and down. In a large number of calculations, from the statistical point of view, the probability of a high person being an even number and an odd number is exactly 50%: 50%. Master knuth has an example to explain why it is better to take an even number instead of an odd number. The nearest round mode does not have a function in C/C ++. Of course, the default round mode of ieee754 and x86 FPU is the nearest round, that is, the result of each floating point calculation adopts the nearest rounding mode unless explicitly set to the other three rounding modes by the program.
The other three rounding modes are briefly described.
Rounding to 0 (truncation): type conversion of C/C ++. (INT) 1.324 = 1, (INT)-1.324 =-1;
Round to negative infinity (down): C/C ++ function floor (). For example, floor (1.324) = 1, floor (-1.324) =-2.
Round to positive infinity (up): C/C ++ function Ceil (). Ceil (1.324) = 2. Ceil (-1.324) =-1;
The last two Rounding Methods are said to be for the Range algorithm in numerical calculation, but it is seldom heard of which commercial software uses the Range algorithm.
3. Convert decimal places to binary decimal places
First, let's take a look at how the decimal number and binary number convert each other. Uses subscript to represent the base of a number, that isD10Decimal number,B2Binary Number. Then a decimal number with n + 1 integer m digitsD10Indicates:
Example 3:
Similarly, a binary number with n + 1 integer m decimal placesB2Indicates:
Example 4:
It is easy to convert the binary number into a decimal number, for example, 4.
To convert a decimal number to a binary number, convert the integer part and the decimal part respectively, divide the integer part with 2, take the remainder, and multiply the decimal part with an integer number.
Example 5: Convert (13.125) 10 to a binary number
Integer:, decimal part:
Therefore,
Description: Scanf () functions in C/C ++ do not use this method.
The key to whether a decimal number can be accurately expressed using a binary floating point number is the decimal part. Let's see whether the simplest decimal point can be accurately expressed. The following methods take the integer by multiplying the value by 2:
To obtain an infinite loop binary decimal point, an infinite loop decimal point cannot be expressed with a limit. Therefore, it cannot be accurately expressed with an IEEE 754 floating point number. As shown in the following figure
,
The four numbers cannot be accurately expressed. Likewise:
It cannot be accurately expressed using the IEEE 754 floating point number.
Conclusion 1:Of the nine decimal places, only 0.5 can be accurately expressed :.
We can extend this conclusion to the general situation:
Conclusion 2:Any of the following decimal numbers cannot be accurately expressed using the IEEE 754 floating point number, and there must be an error.
If the integer part can be accurately expressed and the number is within the range of floating point precision, the number can be accurately expressed.
4. Basic Rules of decimal places that can be accurately expressed by binary decimal places
The above conclusion is obtained from the conversion from the decimal number to the binary number. The following is a deduction from the conversion from the binary number to the decimal number:
You can continue to calculate and get a basic rule.
Conclusion 3: A decimal place must be exactly represented by a floating point number. The last digit must be 5, Because dividing 1 by 2 is always 0.5. Of course this is a necessary condition and is not a sufficient condition.
How many decimal places can an M-bit binary decimal point be exactly represented? Of course. The deduction is as follows:
Only one decimal point in binary can be exactly represented :.
There are two decimal places that can be exactly represented by binary decimal places :.
Three decimal places that can be accurately expressed include:
...
The M-bit binary decimal point can be exactly expressed as a decimal point. The m decimal places have one. Therefore, the ratio of the decimal places that can be accurately expressed is that the larger the m, the smaller the ratio. Take the common single-precision and double-precision floating-point numbers as examples. If M is 24 and 53 respectively, the ratio is: And, which is small to negligible.
5. FAQ: How does the C/C ++ library function printf () fool us?
Q: since most floating-point decimal places cannot accurately represent decimal places, why can printf () be used to print accurate values?
A: Because IEEE 754 clearly specifies the binary to decimal conversion, see 2.d ). In addition, the printf () function prints only seven valid digits by default, which is correct when the error is not significant. However, we often see the result ". xxxx999999 ". Use printf ("%. 17lf ",...); Allows floating point numbers to show their original shape.
6. IEEE 754-related standards
The conclusion in this article is based on the IEEE 754 standard, and the other standard is IEEE 854. This standard is about the decimal floating point number, but there is no specific format, so it is rarely used. In addition, since 2000, IEEE 754 was revised, known as IEEE 754r (http://754r.ucbtest.org/), to integrate the IEEE 754 and IEEE 854 standards and has been voting in the Working Group, it has not been approved by IEEE, and it is estimated that it will be faster. The standard is revised as follows:
A) The 16-bit and 128-bit binary floating point formats are added.
B) added the decimal floating point number format, adopted the format proposed by IBM (http://www2.hursley.ibm.com/decimal/), Intel also proposed its own format, but was not adopted, leaving only a mouth. (Standards have always been the product of an enterprise's Interest Game ).
7. Should I use a decimal floating point number?
Professor Kahan's opinion: we must use a decimal floating point number to avoid human errors. That is, the error: Double D = 0.1; actually, D = 0.1.
IBM's opinion: the use of decimal floating points in economic, financial, and human-related programs. However, because there is no hardware support, the software-implemented decimal floating point computation is-times slower than the hardware-implemented binary floating point computation. Because it is adopted by IEEE 754r, IBM will implement a decimal FPU in the next generation of power chips. Http://www2.hursley.ibm.com/decimal)
8. Further reading suggestions
This article discusses the representation accuracy of binary floating-point numbers. For the computational accuracy, you can read David Goldberg's classic article "What every computer scientist shoshould know about floating-point Arithmetic". do not think that "Scientist" is an advanced item. Here it is a "Beginner". This document is used as an appendix in the "numerical calculation Guide.
Summary
Accuracy is accidental, and errors are inevitable. If we do a numerical algorithm, the only thing we can do is to keep the error from accumulating, and never overdo anything else.
Representation accuracy of IEEE 754 floating point numbers