Reprint reference: http://share.onlinesjtu.com/mod/tab/view.php?id=176

Http://www.ruanyifeng.com/blog/2010/06/ieee_floating-point_representation.html

Http://baike.baidu.com/link?url=heUWO1s1ygzWlf_ZQ0nzfKcJQFmosbGJVzkOCvInmNRcISY_bSfsjaDaxgGjOlJcMwFKKIMf3z9Ml0hW6xQi7_

One problem in the representation of fixed-point numbers is that it is difficult to represent data with very large values and very small values. For example, the mass of electrons (9x10-28 grams) and the mass of the sun (2x1033 grams) are far from being directly represented in a fixed-point computer, because the decimal point can only be fixed in one location, thus limiting the range of data representation.

In order to represent a larger range of data, it is mathematically customary to use scientific notation to represent the data as a fraction multiplied by a 10-based exponent.

For example, in a computer, the mass of electrons and the mass of the sun can be divided into different scale factors, so that the absolute value of its numerical part is less than 1, namely:

9x10-28 = 0.9x10-27

2x1033 = 0.2x1034

The scale factors 10-27 and 1034 here are stored in a unit of the machine, so that the calculation results will be increased at this rate later. Obviously, this takes up a certain amount of storage space and computing time.

Floating-point notation is the representation of a number of valid numbers and ranges in a computer. This method of the range and precision of the number, the equivalent of the number of decimal places with the scale factor of the difference in a certain range of free floating, changing the value of the exponential part of the position of the change of the decimal point. In this notation, the position of the decimal point can be floating, and is therefore called floating point notation.

The general representation of a floating-point number is:

A decimal number n can be written as: n = 10exM

A binary number n can be written as: n = 2exM

where M is called the mantissa of the floating-point number, is a decimal fraction; E is an exponent of the scale factor, called the exponent of the floating-point number, is an integer. When a floating-point number is represented in a computer, the first one is to give the mantissa m, expressed in decimal form, and the second is to give the exponent e, expressed in integer form, often called the order code. The fractional part gives the number of digits of the valid number, thus determining the representation precision of the floating-point number, and the order section indicates the position of the decimal point in the data, thus determining the range of the floating-point numbers. Floating-point numbers are also signed, with a signed floating-point number representing 2-2.

where S is the sign bit of the mantissa, placed in the highest one, E is the order code, immediately after the sign bit, the M-bit, M is the mantissa, placed in the low part, occupies n bits.

**1. Normalize floating-point numbers**

The expression of the same floating-point number is not unique if the expression of the floating-point number is not explicitly specified. For example:

(1.75) Ten = (1.11) 2 = 1.11x20

= 0.111x21

= 0.0111x22

= 0.00111x23

In order to improve the representation precision of data, we need to make full use of the significant digits of mantissa. When the value of the mantissa is not 0 o'clock, the most significant bit of the mantissa field should be 1, otherwise it is necessary to modify the order to move the decimal point around at the same time, so that it becomes a representation that meets this requirement, which is called the normalization of floating-point number.

**2. IEEE-754 standard floating-point format**

Prior to the advent of the IEEE-754 standard, the industry did not have a unified floating-point standard, but many computer manufacturers were designing their own floating-point rules and calculation details.

In order to facilitate the porting of software, the representation format of floating-point numbers should have a uniform standard. In 1985, the IEEE (Institute of Electrical and Electronics Engineers, **American Society of Electrical and Electronics Engineers** ) presented the IEEE-754 standard as a uniform standard for floating-point representation formats. Today, almost all computers support this standard, which greatly improves the portability of scientific applications.

The IEEE standard logically uses a ternary group {S, E, M} to represent a number n, which specifies a radix of 2, the sign bit S with 0 and 1 respectively to denote positive and negative, the mantissa M with the original code, the order E is represented by the shift code. According to the normalization method of floating-point number, the most significant bit of the mantissa field is always 1, thus, the standard convention that this bit is not stored, but is considered hidden in the left side of the decimal point, so the Mantissa field represents a value of 1. M (m is actually stored), which makes the mantissa's representation range more than the actual storage. In order to indicate the positive or negative of the exponent, the order E is usually represented by a shift code, and the exponent e of the data is followed by a fixed offset as the order of the number, which avoids the positive and negative exponents and preserves the original size order of the data, which facilitates the comparison operation.

Currently, most high-level languages specify the storage format for floating-point numbers according to the IEEE-754 standard. The IEEE-754 standard specifies that single-precision floating-point numbers are stored in 4-byte (or 32-bit), and double-precision floating-point numbers are stored in 8-byte (64-bit) digits, as shown in 2-3:

Single-precision format (32-bit): Sign bit (S) 1-bit, order (e) 8-bit, order offset is 127 (7FH), Mantissa (M) 23 bits, decimal notation, decimal point at the front of the mantissa field;

Double-precision Format (64-bit): Sign bit (S) 1 bit, order (e) 11 bit, order offset is 1023 (3FFH), Mantissa (M) 52 bit, decimal, decimal point is at the front of the Mantissa field.

In the IEEE-754 standard, the truth value of a normalized 32-bit floating-point number x can be expressed as:

X = ( -1) SX (1.M) x2 E-127 E = E-127 (type 2-9)

In the IEEE-754 standard, the truth value of a normalized 64-bit floating-point number x can be expressed as:

X = ( -1) SX (1.M) x2 E-1023 E = E-1023 (type 2-10)

Since the principle of double-precision is the same as the single-precision format, only the number of bits represented has increased, so the following mainly describes the single-precision format (32-bit) floating-point numbers representation method.

Machine 0 in the computer refers to: 1, if a floating-point number of the mantissa is all 0, regardless of its order value, the computer in the processing of this floating point as a 0 view; 2, if the order of a floating-point number is less than the minimum value of the range it represents, the computer treats the floating-point numbers as 0 when processed.

When the order E is full 0 o'clock, if the mantissa m is also full 0 o'clock, the true value of X is zero, the combined sign bit s is 0 or 1, there is a positive zero and a negative 0 of the points. if M is not full 0, then the index e of the floating-point number is equal to 1-127 (or 1-1023), and the effective digit m is no longer added to the first digit of 1, but the decimal is reverted to 0.xxxxxx. In short, this is done to represent ±0, and very small numbers close to 0.

When the order E is full 1 o'clock, if the mantissa m is also full 0 o'clock, the value of True x is infinity (∞), the combination of the sign bit s is 0 or 1, there are +∞ and-∞. If the valid number m is not all 0, it indicates that the number is not a number (NaN).

When e is not all 0, it is not all 1. At this point, the floating-point number is represented by the above rule, that is, the calculated value of the exponent e minus 127 (or 1023), the real value, and then the effective digit m plus the first digit of 1.

Thus, in the 32-bit floating-point representation, to remove the E with full 0 and all 1 (255) for the special case of 0 and infinity, therefore, the range of the order E is changed to 1~254, the exponent of the offset is not selected (10000000B), and the selection of 127 (01111111B). For 32-bit normalized floating-point numbers, the true exponent value E is -126~+127, so the absolute range of the number is 2-126~2127≈10-38~1038.

Additional notes:

In a single-precision representation, the exponent (order) is stored as a 8-bit binary. The original range is -127~128 (residual code system). After 127 yards, plus the offset, the range becomes 0~255. From the previous discussion, the removal of the full 0 and all 1 of the order code, the value of the order e range changed to 1~254 that is the true exponent value E is -126~ +127.

Thus we calculate the maximum absolute value of the number that can be represented by single-precision floating-point notation: 1.11 ... 1 (23 x 1) The maximum mantissa is then converted to decimal, multiplied by 2^127 (the maximum exponent that the order can represent), and the result = (1-2^-24) *2^128

= 3.4028234663853 * 10 38

Calculates the minimum absolute value of a number that can be represented by a single-precision floating-point notation: (1-2^-1) *2^-127 = 2.9387358770557 * 10-39 (this value I will not calculate for the moment, for advice)

The number exceeding the maximum absolute value is overflow, and the number less than the maximum absolute value is underflow.

The storage structure of floating-point number in computer memory and the calculation of overflow critical value