Concepts about floating point numbers

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Floating point include
Float
And
Double
Two types,
Float
Account
32
Bit,
Double
Account
64
Bit. The binary storage format follows
Ieee754
Standard. To
Float
For example:

Symbol bit: positive number
0
, Negative number is
1

To
Float
Type data
123.456
For example, analyze the binary storage format:

First, convert the decimal number
123.456
Convert to binary number:
1111011. 01110100101111001

(Where
0.456
How to convert to binary? Multiply
2...
)

1111011. 01110100101111001
That is
11101101110100101111001.
Multiply
2
Of
6
Power

First, this is a positive number, and the symbol bit is
0

Level code is
6
.

(How to find
6
? I am not very familiar with it here.
5 + 127 = 133
,
2
Hexadecimal
10000101
)

The ending number is
11101101110100101111001.
The fractional part, that is

11101101110100101111001

To sum up:
123.456
The binary storage format of is:
0
1000010
111101101110100101111001

Use a piece of code to verify:

# Include <cstdlib>

# Include <iostream>

Using namespace STD;

Void printbinary (const unsigned char Val)

{

For (INT I = 7; I> = 0; I --)

If (Val & (1 <I ))

STD: cout <"1 ";

Else

STD: cout <"0 ";

}

Int main ()

{

Float d = 123.456;

Unsigned char * CP = reinterpret_cast <unsigned char *> (& D );

For (INT I = sizeof (float)-1; I> = 0; -- I)

{

Printbinary (CP [I]);

}

System ("pause ");

}

Note that,
X86
The architecture is the small-end mode, which is the low storage speed of the index data in the memory.
Address

Medium, while the number

The high data level is stored in the high memory address. So the above
For (INT I = sizeof (float)-1; I> = 0; -- I)
First print the high address section, that is, the binary high byte data.

Program Execution result:

0
1000010
111101101110100101111001

The analysis results are the same.

Double
Type and
Float
The binary storage format is the same.

The above part is my transfer from others' blogs. You can check it out. Well, let's raise a question,

Double;
A = 1.1;
Int B = (Int &);

Then output B. What is the result? 1? Of course not. The result is-1717986918. Why? Let's analyze it:

Double stores the fractional part with 52 bits. We calculate that the 52-bit storage should be 0001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010
The last four digits should be noted, because they also need to be "rounded up" in the computer ". If the last bit omitted is 1, the first bit will be added for storage. So 1001 + 1 is 1010.
4. Replace the last 32 digits with a signed decimal number. The first digit is the symbol, indicating a negative number. The complement code is 0110 0110 0110 0110 0110 0110 0110 is 0110. That is, the number.

Another interesting thing is to look at the following code:

Double A = 3.0, B = 10.0, ans;

Ans = A/B;

One-step debugging to see what the ANS value will be, 0. 333333333... of course not, but 0. 299999999 ..., why? For an analysis, the C language uses the ieee794 floating point number, for example, the double type is 64-bit. Many numbers. For example, if 0.3 is expressed as an infinite repeating decimal number in binary format, it will be truncated. Unless you use a third-party exact Floating Point Library, as long as it is a C language, no matter which compiler is the result, it is best to use the width control such as % 5.2f to limit the output ending number when you output the data, in addition, you must not use = directly if you judge. Preferably ABS (a-B) <0.000001 and so on.

Another good method is to add an infinitely small decimal number to 3.0. For example, the minimum positive decimal number supported by your compiler is 0.000000001, in this case, this value is added to 3.0 During calculation, and then 10 is not involved. This can only solve some problems. If you want to stay secure, check the float. h header file in the standard library or the cfloat header file.

A friend said that at the beginning, the Excel and Windows calculators used ieee794 floating point numbers, and the maximum number of valid double numbers was 15.
However, a complicated simulated score algorithm was implemented in the Windows Calculator. You can also use a pair of p and q to save the values of the numerator and denominator, and think that the values encountered are all rational scores (irrational numbers can only be truncated ). For reference

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Concepts about floating point numbers

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support