A floating-point number is a numeric representation of a number in a given subset of a rational number, which is used to approximate any real numbers in a computer. Specifically, the real number is obtained by multiplying an integer or fixed-point number (that is, the mantissa) by a radix (usually 2 in the computer), which is similar to the scientific notation of cardinality 10.
Floating-point calculations are the operations that a floating-point number participates in, which is often accompanied by approximations or rounding because they cannot be represented accurately.
A floating-point number A is represented by two numbers m and e: a = mxb^e. In any such system, we select a cardinality B (the base of the system of the count) and the precision P (that is, how many bits are used to store). M (that is, the mantissa) is the P-bit shape, such as ±d.ddd...ddd (each bit is a whole number between 0 and b-1, including 0 and b-1). If the first bit of M is a non-0 integer, M is called normalized. There are some descriptions that use a single sign bit (s for + or-) to indicate positive and negative, so that m must be positive. E is the exponent.
As you can see, a floating-point number is represented in the computer with the following structure:
End part (fixed-point decimal) Order part (fixed-point integer) number character ± mantissa M-order ± order E
This design can represent a larger range of numbers in a fixed-length storage space that the fixed-point number cannot represent.
For example, a 4-bit decimal floating-point number with an exponential range of ±4 can be used to represent 43210,4.321 or 0.0004321, but there is not enough precision to represent 432.123 and 43212.3 (must be approximately 432.1 and 43210). Of course, the actual number of bits used is usually far greater than 4.
In addition, floating-point notation usually includes some special values: +∞ and?? ∞ (plus and minus infinity) and Nan (' not a number '). When infinity is used for a number that is too large to represent, Nan indicates an illegal operation or a result that cannot be defined.
As we all know, all the data in the computer is represented in binary, and the floating-point number is no exception. However, the binary notation of floating-point numbers is not as simple as the fixed-point number.
First clarify a concept, floating point number is not necessarily equal to decimals, fixed-point numbers are not necessarily integers. The so-called floating point number is the decimal point is not fixed logically, and fixed points can only represent a fixed decimal value, with a floating-point or fixed-point numbers to indicate which of the number to see what the user gives the meaning of this number.
There are 6 floating-point numbers in C + +, namely:
FLOAT: Single-precision, 32-bit
unsigned float: single-precision unsigned, 32-bit
Double: dual precision, 64-bit
Unsigned double: unsigned, 64-bit
Long double: high-precision, 80-bit
unsigned long double: unsigned, 80-bit (ho, should be the longest built-in type in C + +!). )
However, the support of different compilers is slightly different, as far as I know, many compilers do not follow the IEEE standard 80-bit support for the two floating-point numbers, most compilers treat them as double, and perhaps a very individual compiler to treat them as 128-bit?! For the 128-bit longdouble I have only heard, no proof, which man know this detail trouble inform.
Below I only float (signed, single-precision, 32-bit) type of floating-point number to illustrate how floating-point numbers in C + + are represented in memory. Let's start with the basics, the binary representation of decimal fraction. (Decimal fraction is no integer part of the decimal, speaking to the primary school did not learn the people)
Decimal fraction in order to use binary notation, it must be normalized, which is the form of 1.xxxxx * (2 ^ n) ("^" stands for the exponentiation, and 2 ^ n represents the N-th square of 2). For a decimal fraction D, the formula for n is as follows:
n = 1 + log2 (D); Decimal fraction obtained n must be negative
Using D/(2 ^ n), you can get the normalized decimal. Next is the decimal to binary conversion problem, in order to better understand, first look at the 10 binary decimal fraction is how to express, assuming that there is decimal fraction D, its decimal point after each digit in order to form a sequence:
{K1, K2, K3, ..., kn}
Then D can also say:
D = K1/(1 ^) + K2/(^ 2) + K3/(+ ^3) + ... + kn/(ten ^ n)
Extended to binary, the representation of decimal fraction is:
D = B1/(2 ^ 1) + b2/(2 ^ 2) + B3/(2 ^ 3) + ... + bn/(2 ^ n)
Now the question is how to obtain B1, B2, B3,......,bn. The algorithm is more complex to describe, or to speak with numbers. To declare, 1/(2 ^ n) This number is special, which I call a bit-order value.
For example 0.456, 1th bit, 0.456 is less than the bit order value 0.5 is 0, 2nd bit, 0.456 is greater than the bit value 0.25, the bit is 1, and 0.45 minus 0.25 is 0.206 into the next bit, 3rd bit, 0.206 is greater than the bit order 0.125, the bit is 1, and 0.206 minus 0.125 is 0.081 In the next place, 4th place, 0.081 is greater than 0.0625, 1, and 0.081 minus 0.0625 is 0.0185 into the next; 5th bit 0.0185 is less than 0.03125 ...
At the end of the calculation to get enough of the 1 and 0 in a bitwise sequence, we get a more accurate binary representation of the decimal fraction, while the accuracy of the problem is generated, many of the numbers are not fully accurate in the finite n, we can only use the greater N value to more accurately represent this number, That's why in many areas programmers prefer to use double instead of float.
For the memory structure of float, I use a structure with a bit field as described below:
struct Myfloat
{
BOOL Bsign:1; Symbol, positive or negative, 1-bit
Char Cexponent:8; Index, 8-bit
unsigned long ulmantissa:23; Mantissa, 23-bit
};
The symbol does not have to say more, 1 means negative, 0 means positive
The index is based on 2, the range is 128 to 127, the actual data index is the original index plus 127 obtained, if more than 127, then from 128, its behavior and the X86 Schema CPU processing plus subtraction overflow is the same.
For example: 127 + 2 = -127;-127-2 =127
The mantissa saves 1 of the 1th bit, so you need to add 1 to the first bit when restoring. It may contain both integers and decimal fraction, or only some of them, depending on the size of the numbers. For floating-point numbers with integral parts, there are two representations of integers, when integers are greater than 16777215 of the decimal, the scientific notation is used, and if less than or equal, the general binary notation is used directly. The notation of scientific notation is the same as that of decimals.
The decimal part is the direct use of the scientific notation, but the form is not X * (ten ^ n), but X * (2 ^ n). Take it apart.
0 00000000 0000000000000000000000
Sign bit refers to digit end digit
Error when calculating the accuracy of a double type of data