Representation and operation of floating-point numbers

Source: Internet
Author: User

1 representation of floating-point numbers

In general, we can use the following format to represent floating-point numbers

S P M


where S is the sign bit, p is the order, M is the mantissa

For IBM-PC, a single-precision floating-point number is 32 bits (or 4 bytes), and a double-precision floating-point number is 64 bits (8 bytes). The number of s,p,m of the two and the representation method are indicated by the following table

S P M Representing formulas Offset amount
1 127
1 11 52 (-1) S*2 (P-1023) *1.m 1023


As an example of a single-precision floating-point number, you can get the binary representation format as follows

S (31st bit) P (30-bit to 23-bit) M (22-bit to 0-bit)

where S is the sign bit, only 0 and 1, respectively, the positive or negative, p is the order code, usually using a shift code representation (the shift code and complement only the sign bit opposite, the rest is the same.) For positive numbers, the original code, anti-code and complement are the same; for negative numbers, the complement is the absolute value of the original code all reversed, and then add 1.)

For simplicity, this article discusses only single-precision floating-point numbers, and double-precision floating-point numbers are stored and represented in the same way.

2 expression conventions for floating-point numbers

Single-precision floating-point numbers and double-precision floating-point numbers are defined by the IEEE754 standard, with some special conventions.

(1) When p = 0, M = 0 o'clock, indicates 0.

(2) When p = 255, M = 0 o'clock, indicates infinity, with the sign bit to determine whether it is positive infinity or negative infinity.

(3) When p = 255, M! = 0 o'clock, indicates Nan (not a number, is not a count).

When we use the. Net framework, we typically use the following three constants

Console.WriteLine ( float . MaxValue); // 3.402823E+38
Console.WriteLine ( float .  MinValue); // -3.402823E+38
Console.WriteLine ( float .    Epsilon); // 1.401298E-45
// If we convert them to double types, their values are as follows
Console.WriteLine (convert.todouble ( float . MaxValue)); // 3.40282346638529E+38
Console.WriteLine (convert.todouble ( float .  MinValue)); // -3.40282346638529E+38
Console.WriteLine (convert.todouble ( float .    Epsilon)); // 1.40129846432482E-45

So how are these values calculated?

According to the above conventions, we can know that the maximum value of the order P is 11111110 (this value is 254, because 255 for the special convention, then for the number can be accurately represented, 254 is the largest order). The maximum value of the mantissa is 11111111111111111111111.

So the maximum value is: 0 11111110 11111111111111111111111.

That is 2 (254-127) * (1.11111111111111111111111) 2 = 2127 * (1+1-2-23) = 3.40282346638529E+38

As can be seen from the above double-precision representation, the two are consistent. The smallest number is naturally -3.40282346638529E+38.

For the number closest to 0, according to the Convention of the IEEE754, in order to enlarge the representation of the data near the 0 value, the order code p =-126, the mantissa M = (0.00000000000000000000001) 2. At this point the binary representation of the number is: 0 00000000 00000000000000000000001

That is 2-126 * 2-23 = 2-149 = 1.40129846432482E-45. This number is consistent with the epsilon above.

If we want to accurately represent the number closest to 0, it should be 0 00000001 00000000000000000000000

That is: 2-126 * (1+0) = 1.17549435082229E-38.

3 accuracy of floating-point numbers

Floating-point numbers reflect an infinite set of real numbers with a limited length of 32bit, so in most cases they are approximate values. At the same time, the operation of floating-point number is accompanied by error diffusion phenomenon. Two floating-point numbers that appear to be equal in a given precision may not be equal because they have a different minimum number of significant digits.

Because floating-point numbers may not be exactly approximate to decimal numbers, if you use decimal numbers, mathematical or comparison operations that use floating-point numbers may not produce the same results.

If a floating-point number is involved, the value may not round-trip. A round-trip of a value refers to an operation that converts the original floating-point number to another format, and the inverse operation converts the converted format back to the floating-point numbers, and the final float is equal to the original float. A round trip may fail because one or more of the least significant bits may be lost or changed in the conversion.

4 to represent a floating-point number as a binary

4.1 Floating-point numbers without decimals converted to binary representations

First, we use a floating-point number with no decimals to illustrate how to convert a floating-point number to a binary representation. Suppose the data to be converted is 45678.0f.

When dealing with this floating-point number without decimals, the integer part is converted directly to a binary representation:

1011001001101110.0, a default of 1 is added (this is due to the requirement of normalization of floating-point number, the mantissa must be formatted as 1.M),

Then it can be represented as: 11011001001101110.0.

Then move the decimal point to the left, move to the highest bit only 1 bits, that is, 1.1011001001101110, altogether moved 16 bits, we know that the left shift means multiplication, right shift represents division. So the original number is equal to this: 1.1011001001101110 * (216). Now the mantissa and the exponent are all out. Because the highest bit of 1 is based on the standard plus go, just to meet the requirements of normalization, this time need to remove this 1. The binary in the mantissa becomes: 1011001001101110.

Finally in the back of the mantissa 0, until enough to fill 23, that is: 10110010011011100000000.

Come back to see the exponent, according to the previous definition, p-127=16, then P = 143, which means binary is: 10001111.

45678.0f This number is positive, so the sign bit is 0, then we put it together according to the format mentioned earlier, that is: 0 10001111 10110010011011100000000.

This is the 45678.0f binary representation of this number, if we want to get 16 binary representation, very simple, we just need to put this binary string 4 a group, converted to 16 binary number on it. However, it is important to note that the CPU of the x86 architecture is little endian (that is, the low byte is in front, the high byte is behind), so in real memory the number is stored in the reverse order of the binary string above. It's also easy to know if the CPU is little endian.

Bitconverter.islittleendian;

4.2 Floating-point numbers with decimals expressed as binary

For floating-point numbers with decimals, there is a problem with precision, as illustrated below. Suppose the decimal number to be converted is 123.456f.

For this type of decimal, the integer and fractional parts need to be processed separately. For the integer part of the processing no longer repeat, directly into the binary is: 100100011. The processing of fractional parts is more troublesome, we know that using binary means only 0 and 1, then for decimals can only be represented in the following way:

A1*2-1+a2*2-2+a3*2-3+......+an*2-n

Where A1 can be 0 or 1, in theory, this representation can be used to represent a finite number of decimals. But the number of tails can only have 23, then it will inevitably bring the problem of precision.

In many cases, we can only approximate decimals. Look at 0.456 this decimal decimal fraction, how to express it into binary? In general, we can represent this by multiplying by 2.

First, multiply this number by 2, less than 1, so the first bit is 0, then multiply by 2, greater than 1, so the second bit is 1, subtract this number from 1, then multiply by 2, so the loop continues until the number equals 0.

In many cases, the binary numbers we get are larger than 23 bits, and more than 23 bits are going to be given up. The rounding principle is 0 1 in. In this way, we can get a binary representation: 1111011.01110100101111001.

Now start to the left to move the decimal point, altogether moved 6 bits, this time the mantissa is: 1.11101101110100101111001, the order 6 Plus 127 is 131, the binary is expressed as: 10000101, then the total binary is represented as:

0 10000101 11101101110100101111001

Represented as 16 binary: F6 E9 79

Since the CPU is little endian, it is represented in memory as: E9 F6 42.

4.3 decimal fraction is represented as a binary

For decimal fraction to be converted to binary, normalization must be done first. For example, 0.0456, we need to normalize it into 1.xxxx * (2n) Form, requiring decimal fraction x corresponding to the n can be the following formula:
n = Int (1 + log 2X)

0.0456 we can represent a 1.4592 multiplied by 2-5 power, or 1.4592 * (2-5). Converted to such a form, and then processed in accordance with the method of processing decimals, to obtain a binary representation

1.01110101100011100010001

Remove the first 1 and get the mantissa

01110101100011100010001

The order code is:-5 + 127 = 122, binary is represented as

0 01111010 01110101100011100010001

Last converted to hexadecimal
C7 3 A 3D

5 mathematical operations of floating-point numbers

5.1 addition and subtraction of floating-point numbers

Set of two floating-point X=mx*2ex, Y=my*2ey

The implementation of X±y is done in 5 steps as follows:

(1) On-order operation: Chei to the big order
(2) The Mantissa plus minus operation
(3) Normalization: The result of the operation of the mantissa must become a normalized floating-point number, for the double sign bit (that is, use 00 for positive, 11 for negative, 01 for overflow, 10 for overflow), the complement of the mantissa, it must be
form of 001xxx...xx or 110xxx...xx
If you do not comply with the above-mentioned form, left or right-hand regulation.
(4) Rounding operation: When performing a right-order or right-hand operation, the common "0" "1" method rounds the Mantissa value that is moved right out to ensure accuracy.
(5) Correctness of the result: check whether the Order code overflow

If the Step code underflow (the shift code means 00 ... 0), to set the result for the machine 0;
The overflow flag is placed on the Step Code overflow (exceeding the maximum value of the order).

Now use a concrete example to illustrate the 5 steps above

Example: Suppose x=0.0110011*211,y=0.1101101*2-10 (where the numbers are binary), calculates the x+y;

First, we're going to turn these two numbers into 2-binary representations, and for floating-point numbers, the order is usually represented by a shift code, and the mantissa is usually in complement.

Note that the 10 shift code is 00110.
[X] Float: 0 1 010 1100110
[Y] Float: 0 0 110 1101101
Symbol bit Order Mantissa

(1) Order difference: │δe│=|1010-0110|=0100

(2) To order: Y's order is small, the mantissa of Y is shifted to the right 4 bits
[Y] float to 0 1 010 0000110 1101 Save temporarily

(3) The summation of the mantissa, the complement operation using the double sign bit
00 1100110
+00 0000110
00 1101100

(4) Normalization: Meet the requirements of normalization

(5) rounding, using 0 1 into the method of processing

Therefore, the result of the final operation of the floating-point number format is: 0 1 010 1101101

That is x+y=+0. 1101101*210

5.2 Floating-point multiplication method

(1) Order arithmetic: Order summation (multiplication) or order difference (division)
[Ex+ey] shift = [EX] Shift + [Ey] Complement
[Ex-ey] shift = [EX] Shift + [-ey] Complement
(2) Mantissa processing of floating-point numbers: The results of the mantissa multiplication method in floating-point numbers are rounded

Example: X=0.0110011*211,y=0.1101101*2-10 seeking X*y

Solution: [X] Float: 0 1 010 1100110
[Y] Float: 0 0 110 1101101

(1) Order code addition
[Ex+ey] Move =[ex] Move +[ey] complement = 1 010+1 110=1 000
1 000 for the 0 of the shift code representation

(2) The result of multiplying the original code mantissa is:
0 10101101101110

(3) Normalization processing: has met the requirements of normalization, no left-hand, the mantissa is unchanged, the order code unchanged.

(4) Rounding: Fixed by rounding rule, plus 1

So x※y= 0.1010111*20

/******************************************************************************************
* "Author": Flyingbread
* "Date": March 2, 2007
* "Notice":
* 1, this article is the original technical article, the starting blog Garden personal site (http://flyingbread.cnblogs.com/), reproduced and quoted please specify the author and source.
* 2, this article must be reproduced and quoted in full text, any organization and individuals are not authorized to modify any content, and is not authorized to be used for business.
* 3, this statement is part of the article, reproduced and quoted must be included in the original text.
* 4, this article refers to a number of information on the network, not listed, but thanks.
******************************************************************************************/

Representation and operation of floating-point numbers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.