Floating-point data

Representation of a non-integer

In addition to integers, the usual calculation is inseparable from the non-integers, that is, those numbers with fractional parts. In a digital system, integers and non-integers are called rational numbers, and rational and irrational numbers are called real numbers (well, that's okay with the gist of this article, but to show that I was a student of the math department ...) ）

A non-integer consists of a "." Decimal notation,"." The number on the left is a positive integer power of 10, and the power value is 0, 1, 2, and so on from the decimal point. And the right is 10 negative integer power, the power value from near is 1,-2, -3 ...

Example: 12.25 = 1 * 10 ^ 1 + 2 * 10 ^ 0 + 2 * 10 ^ (-1) + 5 * 10 ^ (-2)

The binary representation is similar, the difference is that the base is changed from 10 to 2, such as: 11.01 = 1 * 2 ^ 1 + 1 * 2 ^ 0 + 0 * 2 ^ (-1) + 1 * 2 ^ (-2) =3.25

However, just as the decimal cannot accurately represent 1/3 fractions, the binary is also approximate when it represents a non-integer. The reason for this is that, when we write a number, the length of the code is always limited, such as 1/3 = 0.3333333 ... we can't write 31 straight down, and for the binary, We can not use it to accurately represent the decimal 0.4, I feel that no matter what kind of binary, in the representation of numbers, there is a gap.

Second, non-integer storage

To store a non-integer, such as a 11.101 binary, a natural idea is to separate the integer and fractional parts, which is called the fixed-point notation-it specifies that several digits to the left of the decimal point are used to represent integers, and the right few bits represent decimals. This approach is straightforward, but inefficient, and it can only be truncated to a number beyond the fixed-point range, and thus cannot be accurately represented.

Note that a binary non-integer whose decimal point is shifted to the left is equivalent to the original number divided by 2, and one to the right is the equivalent of multiplying by 2. Like what:

11.01x2 = 110.1;11.01/2 = 1.101

Again, we think that the decimal science notation, we find that binary can also be represented as X * 2 ^ y this form, where x is a fixed-point number, its decimal point to the left is the integer part only, and only 1 can not be 0 (because if it is 0, we can always move the decimal point right until we meet 1) The right can only have a fixed number of decimal parts. Y is a power of 2, and its size is the number of digits moved by the decimal point, depending on the left or right, to add the appropriate sign.

Thus, each binary non-integer is called a floating-point notation by moving the decimal point into the form above, which is called normalization. Binary non-integers represented in this way, we only need to store three parts: the number of symbols, the mantissa, the exponent (the number of bits moved by the digits with the direction).

Three, single and double precision

IEEE defines several criteria for storing floating-point numbers, the two common ones being single-and double-precision. The single-precision format uses 32-bit or 4-byte to store a real number (which can include integers), where the sign bit is 1 bits, the mantissa is 23 bits, and the exponent is 8 bits; The double format uses 8 bytes 64 bits to store a real number, a symbol, a 52 bit, and an exponent of 11 bits.

Let's talk about how these three parts are actually stored. The sign bit doesn't need to say much, either 0 or 1. The key is the storage of the index, which determines the form of the mantissa. When the index is stored as binary, it is divided into three cases: all 0, all 1;

1. Other conditions ...

and imagined, in this case the exponent is not stored directly in the form of a binary complement, however, it must be a signed number, because there is obviously a negative exponential condition.

We consider the actual bit pattern stored in the index as an unsigned number with a value of E, and the actual value of the exponent is E, then E = E-bias,bias is an offset value with a size of 2 ^ (k-1)-1,k is the length of the bit that the exponent occupies. The calculation shows that for single precision, bias=127; for double precision, bias=1023.

Since e is not all 0 and not all 1, then the range of E is: -126~+127 (single precision) or -1022~+1023 (double precision).

At this point, the mantissa stores only the fractional part to the right of the decimal point, and the left part defaults to 1. The actual value of the mantissa is M=1+f, where F is the value of converting the actual stored bit pattern into decimals, with 0<= F < 1 .

2, All is 0

At this point, E = 1-bias =-126 or-1022, while M = f.

In the case above, we cannot represent 0.0, because M is always by default greater than or equal to 1. Now, all we need to do is set the bit mode of F to 0, which means 0. The trouble, however, is that there is also a bit of sign that IEEE stipulates that +0.0 and 0.0 are not exactly the same.

3, All is 1

At this point, if the bit pattern of the mantissa is also all 0, then the floating-point number is interpreted as infinity, according to the sign bit, also divided into positive infinity and negative infinity, if the mantissa bit pattern is not all 0, then this floating point is interpreted as Nan, that is, not a number, such as 1 square root, will return such results.

Iv. operation and type conversion

1. Rounding

Because of the limitation of length, floating-point numbers can only approximate the real numbers. When a real number exceeds the range of precision that a floating point can represent, we need a way to discard some of the values to get an approximation. For example, use 2.0来 to represent 1.99999.

IEEE defines four rounding methods: Rounding up, rounding down, rounding to 0, rounding to even. A picture description ("In-depth understanding of computer systems"):

2. Arithmetic

The most important point is: floating-point addition operation is not binding! Whether it's addition or multiplication.

For example, Pseudo-n is a very large floating-point number, then: 3.14 + (n-n) = 3.14, and 3.14 + n-n = 0

Other of There is nothing special about it , but the final result may be rounded.

3. Type conversion in C language

The C language provides two different types of floating-point data: float and double, but unfortunately they do not always correspond to single and double precision because the C language standard does not require the machine to use IEEE standards. In most cases, however, they are corresponding.

Here is the type conversion, which takes int as an example (you'll fill in the experiment later):

INT Steering float , the number will not overflow, but will be rounded;

When int and float turn double, the exact original value is preserved;

When floating-point data is converted to int, the value is rounded to 0;

Double turns to float, overflows to infinity, or is rounded because double precision is too high.

C Language Note data type (iii)