In-depth understanding of computer systems (2.7)---binary floating-point numbers, IEEE Standards (important)

Source: Internet
Author: User

This article reprinted address :

2.6 We have a binary integer operation of the last battle, this LZ will join you into the floating point of the world, there is no unsigned, no complement, but there are all kinds of surprises. If you really enter the floating point of the world, you will find it is so interesting, and not as before, feel that the content of floating-point numbers is useless, as long as the simple use of the line. Of course, there may be some of the apes who think this part of the content is too difficult, and it lost interest in learning.

Like the LZ before, the IEEE standards have been prohibitive, but I believe that the introduction of the floating point in these chapters will give you a sense of epiphany.


Integer arithmetic, although it can solve a large part of computer information on the storage, operation and other functions, but is still not enough. Otherwise, if we want to do a supermarket inventory management system, then all the prices of goods can only be integers, this is not difficult to accept it.

So sometimes we need more accurate numerical representations, which require floating-point numbers to come out. For the representation of floating-point numbers and the rules of operation, in the previous computer manufacturers have their own standards, which has caused great trouble to the portability of the program.

There is demand for innovation, and finally in 1985 years or so, floating point standard IEEE754 came into being. It is like a generation of Qin Shihuang, unified floating point world. Qin Shihuang Unified the text, money and so on, and IEEE754 unified the standard floating point.

Floating-point numbers are not just to make the representation of a numeric value more precise, but also to represent a number that cannot be reached by integers, such as some numbers close to 0, or some very large values. Therefore, the value of floating-point numbers for computers can be said to be quite large.

Binary decimals

Although the main content of our chapter should be the IEEE standard, let's look at how binary represents decimals, which helps us understand the representation of floating-point numbers. if it is a decimal decimal, I believe you are familiar with it, for 12345.6789来 said that its value is obtained by the following formula.

12345.6789 = 1 * 104 + 2 * 103 + 3 * 102 + 4 * 101 + 5 * 100 + 6 * 10-1 + 7 * 10-2 + 8 * 10-3 + 9 * 10-4

This should be common sense for us, then for binary decimals is similar, consider such a decimal 10010.1110, its value can be obtained by the following formula.

10010.1110 = 1 * 24 + 0 * 23 + 0 * 22 + 1 * 21 + 0 * 20 + 1 * 2-1 + 1 * 2-2 + 1 * 2-3 + 0 * 2-4 = 16 + 2 + 1/2 + 1/4 + 1/ 8 = 18.875

From this point of view, binary decimals are actually the same calculation as decimal decimals, except that this is a power of 2 for the whole number of times.

The formula for binary decimals is given in the book, and for a binary fractional B in the form of bm....b0.b-1....b-n, its value is calculated in the following way.

It is to be reminded that binary decimals, unlike integers, can represent all integers as long as the number of bits is sufficient. Binary decimals can not accurately represent any decimal, such as the simplest, decimal decimal 0.3 Such a decimal, the binary is not accurate to represent it.

IEEE Standards

IEEE standards represent floating-point decimals in a way similar to scientific notation, that is, we represent each floating-point number as V = ( -1) s * M * 2 e.

This is where S is the sign bit, 0 is positive, and 1 is negative. M is the mantissa, is a binary decimal, its range is 0 to 1-ε, or 1 to 2-ε (the value of ε is generally 2-k, where k > 0 is set). E is the order code, is a binary integer, can be negative, in order to give the mantissa weighting.

Floating-point format is divided into two types, one is single precision and the other is double precision. The single-double precision corresponds to the float and double types in the programming language, respectively. Where float is a single-precision, 32-bit binary representation, with 1-bit sign bits, 8-bit order codes, and 23-bit mantissa. Double is dual, in 64-bit binary notation, with 1-bit sign bits, 11-bit order codes, and 52-bit mantissa.

As you can see from the above digits, the range represented by the double-precision floating-point number will be much larger than the single-precision floating-point numbers. For the value of the order E, the value of the floating-point number can be divided into three different cases, which are normalized, non-normalized and special values, and these three cases are the essence of floating-point numbers.

The following LZ first give a book for the single-precision three of the picture description, respectively, is 1, 2, 3, where 3 is the special value and divided into two cases 3a and 3b. You can look at the description of the apes first, then we will be one by one specific analysis.

Normalization of

The normalized floating-point number is the 1th case above, for single precision, that is, the order bit is not 0 and not 255 of the case.

Floating-point numbers in this range, the order is converted to a "biased" after the signed number. The meaning of "bias" is to add an offset on the basis of the original value, and for the case where the order number is k, the offset bias = 2k-1-1. Assuming that E is an unsigned value of the order code, then the real order e = E-bias. For example, assuming that the order number is 8, then bias = 127. Since the order range for normalized floating-point numbers under 8-bit order is 1 to 254, the true order size ranges from 126 to 127.

The explanation for the mantissa is a decimal or 0 less than 1. That is, assuming that the trailing digits are represented as fn-1...f0, the value of F is 0.fn-1...f0. This is only the value of the mantissa, when calculating the value of a floating-point number, it will add 1 on the basis of the tail value, that is, the true mantissa m = 1 + F. It is equivalent to omitting the 1-bit binary, which forms the convention of floating-point number representation, and the default Mantissa has a maximum of 1.


Non-normalized floating-point numbers correspond to the 2nd case in the graph, that is, when the order code is all 0.

The non-normalized order value should be fixed on the value of-bias according to the order evaluation method normalized above. But here's a little trick, we set the value of the order to E = 1-bias. This is done in order to smooth the transition from non-normalized floating-point numbers to normalized floating-point numbers, which we'll look at later in detail.

For the mantissa explanation, the non-normalized method differs from normalization in that it does not process the mantissa by 1, that is, the true mantissa m = f. This is to be able to represent the value of 0, otherwise the mantissa is always greater than 1, then in any case will not get the value of 0.

A non-normalized floating-point number, in addition to representing 0, has a function of representing a value close to 0. In addition, in the floating point number, 0 of the expression has two, one is the bit represents all 0, then +0.0. Another is the sign bit is 1, the rest is 0, at this time is-0.0.

Special values

The special values correspond to the 3a and 3b cases in the diagram, that is, when the order code is all 1.

When the order is all 1 o'clock, if the trailing digits are all 0, then infinity is indicated. A sign bit of 0 indicates positive infinity, while the opposite is negative infinity. If the last digit is not all 0 o'clock, then Nan is represented as not a number. One of the related functions in JavaScript is a bit like the meaning of this Nan, which is used to determine whether a parameter is a number.


Range of values

Let's discuss the range of values for the above three floating-point numbers, and we assume that a floating-point number has 1 sign bits s,k and N-tailed digits. Here we discuss some of the range of values for such a floating-point number in each case.

Before talking about the value range, the first thing to say is two points, 1th, because of the particularity of the special value, it does not have the concept of value range, so it is not within our scope of discussion. 2nd, since the floating-point number corresponds to one by one in the positive and negative intervals, we will ignore the effect of the sign bit on the range of values, and we'll only discuss the case where the sign bit is 0.


Non-normalized range of values

For non-normalized floating-point numbers, the real order code is E = 1-(2k-1-1) = 2-2k-1 because the order is fixed to K 0. Then we can get a few important values.

1, when the mantissa is n 0 o'clock, the value at this time is +0.0.

2, when the mantissa is the lowest bit is 1, the rest is all 0 o'clock, at this time the value is the smallest non-0 value. It now has the mantissa M = f = 2-n, so the value at this time is 2-n * 22-2k-1 = 2-n+2-2k-1.

3, when the mantissa is N 1 o'clock, the value at this time is the largest non-normalized value. It now has the mantissa M = f = 1-2-n, so the value at this time is (1-2-n) * 22-2k-1. (may be the printing problem or the author's clerical error, this value in the book is wrong, the correct value should be the LZ given this, the value given in the book is (1-2-n) * 2-n + 2-2k-1)


Normalized range of values

For normalized floating-point numbers, there are also three more important values.

1, when the order is the lowest bit is 1, the rest is 0, the mantissa is n 0 o'clock, at this time the value is the smallest normalized value. At this point the order is exactly the same as the non-normalized order code, all E = 2-2k-1. and its tail value is very good calculation, because the mantissa is all 0, then M = 1 + F = 1. So the value at this point is 22-2k-1.

In particular, it is important to mention that, for the smallest normalized value, its order point is exactly equal to the order of the non-normalized value, which is precisely because we take the non-normalized order bit as 1-bias rather than-bias credit. Because the order of the two is the same, and the mantissa of the two is exactly the same as the 2-n (the smallest precision that can be represented when the 2-n is just the N-bit mantissa), this completes the smooth transition of the non-normalized value to the normalized value. It can also be seen that the smallest normalized value is just a little larger than the largest non-normalized value.

2, when the order of the highest bit is 0, the rest is 1, the mantissa is n 0 o'clock, at this time the value is 1. Because after biasing, the order e = 0, and the mantissa m = 1 + F = 1.

3, when the order is the lowest bit is 0, the rest is 1, the mantissa is N 1 o'clock, at this time the value is the largest normalized value. At this point the order e = 2k-1-1, the mantissa m = 2-2-n. So the value at this point is

(2-2-n) * 2-1 + 2k-1, can also be simplified for (1-2-n-1) * 22k-1.

Range of values for single and double precision

The book gives a single double precision of the above six kinds of values and their values, we can calculate these values according to the above formula, LZ here do not give everyone hit forget. Here is a direct view of this chart, so that the ape friends. (where exp is the bit representation of the order code, FRAC is the bit representation of the mantissa)

As you can see, the range of values that can be represented here is quite large, which is very much needed for some scientific programs or applications that require large numbers.

An interesting exercise.

Here LZ and everyone together to see a book interesting exercises, the title of the original is as follows.

Title: For a floating-point format with n decimals, give a formula for the smallest positive integer that cannot be accurately described (because you want to accurately indicate that it may require n+1 decimal places). Assuming that the Order field length k is large enough, the order range that can be represented does not limit this problem.

Analysis: You can first exclude the range of values in non-normalized, because those values are all less than 1. In the range of values that are considered normalized, if a n+1 decimal is required and is the smallest decimal, it should consist of a 1 of n 0 and the lowest bit. That is, the mantissa m = 1 + F = 1 + 2-n-1, when we use the order to cancel out the decimal place, the order code is 2n+1, so the last value is 2n+1+1.

Let's take an example to consider the simplest, for example, when n is 1 o'clock, the value is calculated as 5 according to the above equation. At this point we can list all the values of the mantissa, which are 0, 1/2, 1, 3/2, respectively. It can be seen that no matter how much the order is taken, it is impossible to express 5, 6, 7, 9 and so on these numbers (any value greater than or equal to 5 is not 2 of the whole number of powers can be represented), and such as 1, 2, 3, 4 are all possible, so the smallest can not accurately describe the value is 5

Article Summary

This time we mainly introduce the IEEE floating point standard, the difficulty of this chapter is relative to the difficulty of the previous chapter, so the apes should not look too laborious. The next chapter will be 2. The last chapter of the X-series, which mainly includes the rounding of floating-point numbers and the contents of the Operations section.

In-depth understanding of computer systems (2.7)---binary floating-point numbers, IEEE Standards (important)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.