**1 representation of floating point numbers**

Typically, we can use the following format to represent floating-point numbers

where S is the sign bit, p is the order, M is the mantissa

For IBM-PC, the single-precision floating-point number is 32 bits (that is, 4 bytes), and the double-precision floating-point number is 64 bits (that is, 8 bytes). The number of digits of the s,p,m and the method of representation are known from the following table

S |
P |
M |
Represents a formula |
Offset amount |

1 |
8 |
23 |
(-1) ^{S}*2^{(P-127)}*1.m |
127 |

1 |
11 |
52 |
(-1) ^{S}*2^{(P-1023)}*1.m |
1023 |

In the case of a single precision floating-point number, the representation format of the binary can be obtained as follows

S (31st place) |
P (30-bit to 23-bit) |
M (22-bit to 0-bit) |

where S is the sign bit, only 0 and 1, respectively, indicate positive and negative, p is a step code, usually using the move code representation (move code and complement only sign bit opposite, the rest are the same. For a positive number, the original code, the inverse code and the complement are the same; for negative numbers, the complement is the absolute value of the original code all taken back, and then add 1.

For simplicity, this article deals only with single-precision floating-point numbers, which are stored and represented in the same way as a double-precision floating-point number.

**2 representation conventions for floating point numbers**

Single-precision floating-point numbers and double-precision floating-point numbers are defined by the IEEE754 standard, with some special conventions.

(1) When p = 0, M = 0 o'clock, represents 0.

(2) When p = 255, M = 0 o'clock, represents infinity, with a sign bit to determine whether it is positive infinity or negative infinity.

(3) When p = 255, M!= 0 o'clock, indicates Nan (not a number, not a few).

When we use the. Net framework, we typically use the following three constants

Console.WriteLine (float. MaxValue); 3.402823E+38
Console.WriteLine (float. MinValue); -3.402823E+38
Console.WriteLine (float. Epsilon); 1.401298E-45
//If we convert them to a double-precision type, their values are
Console.WriteLine (convert.todouble (float). MaxValue)); 3.40282346638529E+38
Console.WriteLine (convert.todouble) (float. MinValue)); -3.40282346638529E+38
Console.WriteLine (convert.todouble) (float. Epsilon)); 1.40129846432482E-45

So how are these values to be obtained?

According to the above convention, we can know that the maximum value of the order P is 11111110 (this value is 254, because 255 is used for special conventions, so for the number that can be accurately represented, 254 is the largest order). The maximum value of the mantissa is 11111111111111111111111.

So the maximum value is: 0 11111110 11111111111111111111111.

That is 2 (254-127) * (1.11111111111111111111111) 2 = 2127 * (1+1-2-23) = 3.40282346638529E+38

It can be seen from the above double precision that the two are consistent. The smallest number is naturally -3.40282346638529E+38.

For the nearest 0 number, according to the IEEE754 convention, in order to enlarge the representation of the data near the 0 value, take the order code P =-126, the mantissa M = (0.00000000000000000000001) 2. At this point the binary representation of the number is: 0 00000000 00000000000000000000001

namely 2-126 * 2-23 = 2-149 = 1.40129846432482E-45. This number is consistent with the epsilon above.

If we want to pinpoint the number closest to 0, it should be 0 00000001 00000000000000000000000.

namely: 2-126 * (1+0) = 1.17549435082229E-38.

**3 accuracy of floating point numbers**

Floating point numbers reflect an infinite set of real numbers in a finite 32bit length, and are therefore approximate in most cases. At the same time, the operation of floating-point number is also accompanied by error diffusion phenomenon. Two floating-point numbers that appear to be equal under a specific precision may not be equal because they have a different minimum number of significant digits.

Because floating-point numbers may not be accurate approximate to decimal digits, mathematical or comparison operations that use floating-point numbers may not produce the same results if you use decimal numbers.

If a floating-point number is involved, the value may not round-trip. A round-trip to a value means that an operation converts the original floating-point number to another format, while the reverse operation converts the converted format back to the floating-point numbers, and the final floating point is equal to the original floating-point count. A roundtrip may fail because one or more of the least significant digits may be lost or changed in the conversion.

**4 representing floating-point numbers as binary**

**4.1 Floating-point numbers without decimals converted into binary representations**

First, we use a floating-point number without decimals to show how to convert a floating-point number to a binary representation. Suppose the data to be converted is 45678.0f.

When dealing with this floating-point number with no decimal number, the integer part is directly converted to binary representation:

1011001001101110.0, then add a default of 1 (this is because the mantissa must be formatted in 1.M according to floating-point number normalization),

Then it can be expressed as: 11011001001101110.0.

Then move the decimal point to the left, all the time to the top only 1 digits, that is, 1.1011001001101110, moved 16 bits, we know that the left shift means multiplication, right shift means division. So the original number is equal to this: 1.1011001001101110 * (216). Now the mantissa and the index are out. Because the top 1 is added according to the standard, only in order to meet the requirements of normalization, this time need to remove this 1. The binary of the mantissa becomes: 1011001001101110.

Finally in the back of the mantissa 0, until enough to fill 23 digits, is: 10110010011011100000000.

Come back to see the index, according to the previous definition, p-127=16, then P = 143, represented as binary is: 10001111.

45678.0f This number is positive, so the sign bit is 0, then we put it together according to the preceding format, that is: 0 10001111 10110010011011100000000.

This is the binary representation of the 45678.0f number, and if we want to get a representation of the 16 binary, it's very simple, we just need to convert this binary string into a group of 4 to 16. However, note that the x86 architecture of the CPU is little endian (that is, the lower byte in the front, high byte in the back), so in real memory the number is stored in the order of the upper binary string. It's also easy to know if the CPU is little endian.

`BitConverter.IsLittleEndian;`

**4.2 Floating-point numbers with decimals represented as binary**

For floating-point numbers with decimals, there are problems with precision, as illustrated below. Suppose you want to convert a decimal number of 123.456f.

For this kind of decimal, the whole number and the fractional part should be treated separately. For the integer part of the processing no longer repeat, direct into binary is: 100100011. The processing of fractional parts is a bit cumbersome, and we know that using binary notation is only 0 and 1, then the decimal can only be expressed in the following way:

A1*2-1+a2*2-2+a3*2-3+......+an*2-n

Where the number of A1 can be 0 or 1, it is theoretically possible to use this representation to represent a finite fraction. But the tail number can only have 23 digits, then will inevitably bring the problem of precision.

In many cases, we can only approximate decimals. Look at the 0.456 decimal decimal fraction, how do you represent binary? In general, we can use the method of multiplying 2 to represent.

First, multiply this number by 2, less than 1, so the first digit is 0, then multiplied by 2, greater than 1, so the second digit is 1, subtracting the number by 1, then multiplying by 2, so that the loop continues until the number equals 0.

In many cases, the binary numbers we get are more than 23 digits, and more than 23 are going to be shed. The rounding principle is 0 homes 1. In this way, we can get binary representation: 1111011.01110100101111001.

Now start to move the decimal point to the left, a total of 6 digits, this time the mantissa is: 1.11101101110100101111001, the order is 6 Plus 127 131, binary is: 10000101, then the overall binary is represented as:

0 10000101 11101101110100101111001

16-F6 E9 79

Because the CPU is little endian, it is represented in memory as: E9 F6 42.

**4.3 represents decimal fraction as a binary**

For decimal fraction conversion to binary, it must be normalized first. For example 0.0456, we need to normalize it to a 1.xxxx * (2n) Form, requiring that decimal fraction x correspond to n available in the following formula:

n = Int (1 + log 2X)

0.0456 we can represent 1.4592 times the power of 2-5-square, that is, 1.4592 * (2-5). After converting to this form, and then processing the decimal method above, the binary representation

1.01110101100011100010001

Remove the first 1 and get the mantissa.

01110101100011100010001

The order is:-5 + 127 = 122, binary is represented as

0 01111010 01110101100011100010001

Last converted to hexadecimal

One C7 3 a 3D

**5 mathematical operations of floating point numbers**

**5.1 addition and subtraction of floating point numbers**

Set two floating-point number X=mx*2ex, Y=my*2ey

The implementation of X±y should be done in the following 5 steps:

(1) On-order operation: Chei to the large order

(2) The mantissa addition and subtraction operation

(3) Normalization: The result of the operation of the mantissa must become a normalized floating-point number, for the double sign bit (that is, using 00 for positive numbers, 11 for negative numbers, 01 for overflow, 10 for overflow), it must be

The form of 001xxx...xx or 110xxx...xx

If the above form does not comply with the left or right regulation.

(4) Rounding operation: the "0" home "1" entry method is used to round up the mantissa value that is moved right out in order to ensure accuracy in the execution of the step or right regulation operation.

(5) The correctness of the result: check whether the order is overflow

If the Step code underflow (the shift code indicates that 00 ...) 0), to set the result to machine 0;

If the order overflow (more than the order represented by the maximum value) put overflow flag.

Now use a concrete example to illustrate the 5 steps above

Example: Assuming that x=0.0110011*211,Y=0.1101101*2-10 (The numbers here are binary), compute the x+y;

First of all, we want to turn these two numbers into 2, and for floating-point numbers, the order is usually represented by the shift code, and the mantissa is usually expressed as a complement.

Note that the 10 move code is 00110.

[X] Float: 0 1 010 1100110

[Y] Float: 0 0 110 1101101

Sign bit order Mantissa

(1) Order difference: │δe│=|1010-0110|=0100

(2) to the order: Y of the Order small, y of the mantissa right shift 4-bit

[Y] float into 0 1 010 0000110 1101 temporarily Save

(3) The sum of the mantissa, using the complement of the double sign bit

00 1100110

+00 0000110

00 1101100

(4) Normalization: Meeting the requirements of normalization

(5) Rounding treatment, using 0 Homes 1 into the process

So the result of the final operation of the floating-point number format is: 0 1 010 1101101

namely x+y=+0. 1101101*210

**5.2 Multiplication and division method of floating point numbers**

(1) Order code operation: Step Code summation (multiplication) or order difference (division)

That is, [ex+ey] move = [Ex] Move + [Ey] Complement

[Ex-ey] move = [Ex] Move + [-ey] Complement

(2) The mantissa of floating point number: The result of multiplication and division of the mantissa in floating point numbers rounding

Example: x=0.0110011*211,y=0.1101101*2-10 x*y

Solution: [X] Float: 0 1 010 1100110

[Y] Float: 0 0 110 1101101

(1) Step code addition

[Ex+ey] Move =[ex] Move +[ey] complement = 1 010+1 110=1 000

1 000 for the shift code 0

(2) The result of multiplication of the mantissa of the original code is:

0 10101101101110

(3) Normalization treatment: has satisfied the standardization requirement, does not need the left rule, the mantissa invariable, the order code unchanged.

(4) Rounding: Fixed by rounding rule, plus 1

So x※y= 0.1010111*20

/******************************************************************************************

* "Author" : Flyingbread

* "Date": March 2, 2007

* "Notice":

* 1, this article is the original technical article, the starting blog Garden personal site (http://flyingbread.cnblogs.com/), Please indicate the author and the source of the quotation.

* 2, this article must be reproduced and cited in full text, any organization and individual is not authorized to modify any content, and not authorized to be used in business.

* 3, this declaration is part of the article, reprint and reference must be included in the original.

*4, this article refers to a number of information on the network, not listed, but thanks.

******************************************************************************************/