Float Analysis in C

Source: Internet
Author: User
Tags bitset decimal to binary

Two years ago, I knew that the equal number of floating point numbers should not be determined by the = number, because there is a precision problem, but I have never cared much about these things. In fact, although I know about the structure of floating point numbers, it is not clear. as a C ++ enthusiast, I should try to figure out every problem, so I have figured out the internal representation and implementation of floating point numbers. in the absence of major problems, everything is based on easy understanding and memory.

First, let's talk about the original, reverse, complement, and shift code. the transfer is actually equal to the completion code, but the opposite is the symbol. for positive numbers, the original, reverse, and complement codes are the same. For negative numbers, in addition to the symbol bit, the reverse code is reversed based on the original code, and the complement Code is based on the reverse code, add 1 to the bitwise of the token. when the request is to be transferred, the request is still to be completed first, and then the symbol is changed.

Floating Point Numbers are divided into float and double, which occupy 4 or 8 bytes respectively, namely, 32 and 64 bits. I only use 32 bits as an example, and the double is included.

In the ieee754 standard, the 32-bit float is defined as follows:

Symbol bit (s)

1

Level Code (E)

8

Tail (m)

23

 

Here, we should pay attention to three points: A, the level code is represented by the shift code, there will be a 127 offset, its 127 is equivalent to 0, less than 127 is negative, greater than 127 is positive, for example: 10000001 indicates that the exponent is 129-127 = 2, indicating that the true value is 2 ^ 2, and 01111110 indicates 2 ^ (-1 ).

B. The ending number is the number after the decimal point,

C, but the ending number is omitted 1, so when the ending number is all 0, it is also 1. 0... 00;

Next, we only need to explain a few questions. Taking 123.456 as an example, the binary format is: n (2) = 1111011. 01110100101111001, here, it will shift 6 places to the right and get N (2) = 1.111011 01110100101111001*2 ^ 6. This form can be used in the representation format.

Symbol bit (s)

0

Order code (e) 00000110

Tail (m) 11101101110100101111001

Note that the first digit of the order code above is positive in the 0 table, and the ending number is 1 less than the first digit indicated by N (2). That is, the first digit is 1 by default. because in the process of converting decimal to binary, it is often not possible to convert exactly the same (of course, there will be no loss such as 4.0, and the inevitable loss such as 1.0/3.0 ), so the precision of floating point numbers is generated. In fact, the first 8 digits of the decimal point can be affected by the 23-bit binary number after the decimal point. Why? At this time, the average person is often confused. In fact, it is very simple. In the ending number shown above, it is binary, and there are 23 digits after the decimal point. When the value of the last digit is 1, it is 1/2 ^ 22 = 0.000000238. The actual value must be 0.0000002. That is to say, for a float floating point number, the valid bits are 7 digits from left to right (including the default 1 is 7 digits). When the above 8th bits are reached, they are unreliable, however, the maximum output value of vc6 is 1.0/3.0, which is mainly caused by the compiler. This does not mean that the 16 digits after the floating point are valid. if you do not believe it, you can try the double type 1.0/3.0, and the result will also be 17 digits after the decimal point... in addition, compilers or circuit boards generally have the "Noise Removing" "correction" capability, which can make the number of decimal digits that exceed 7 digits even if they are invalid, this is why the output is always 333 instead of 345 ,. you can try it like this:

Float F = 123456789;
Cout <F <Endl; // 123456789 is returned here.

Here is a forgotten question: how can a decimal point be converted to a decimal point in decimal order? In fact, it is very easy to multiply the decimal part in decimal order by 2, write the corresponding binary into 1. therefore, when we convert N (2) = 1.111011 01110100101111001*2 ^ 6; Back to the decimal number, it is likely that it is no longer 123.456. well, the accuracy issue should be clear. the value range is as follows.

The number of digits of the order code is an 8-shift code. The maximum value is 127 and the minimum value is-127. Here, 127 is used as the index of 2, so it is 2 ^ 127, it is about 1.7014*10 ^ 38, and we know that the float value range is-3.4*10 ^ 38-----3.4*10 ^ 38, this is because all the 24 digits of the ending number (the first digit is 1 by default) is 1, which is very close to 2, 1. 11 .. 11 is obviously about 2, so the floating point range comes out.

Double is similar to float, but its internal form is

Symbol bit (s)

1

Level Code (E)

11

Tail (m)

52

The main difference is that its level code has 11 digits, which is 2 ^ 1023 about equal to 0.8572*10 ^ 308, And the ending number of 53 digits is about 2, therefore, the value range of double is-1.7*10 ^ 308. ------ 1.7*10 ^ 308. as for its accuracy, 1.0/2 ^ 51 = 4.4*10 ^ (-16 ). the value is 15 digits after the decimal point plus the default one. Therefore, for a double floating point, the number of 16 digits from left to right is reliable.

Sometimes, we will hear the word "Fixed Point decimal". Single-Chip Microcomputer (such as mobile phones) generally only uses fixed points. When confused, we will think float a = 23.4; this is a fixed point decimal, float a = 2.34e1 is a floating point number. In fact, this is incorrect. The above is only a different representation of the same floating point number, all of which are floating point numbers. this method is used to specify a decimal point. The decimal point is placed after a single digit, And the decimal point is 0. the pure decimal point can also be considered as a fixed point decimal point, but it can only represent a pure decimal point smaller than 1.

Then let's talk about several functions in C/C ++. In C ++, the 5 decimal places are output by default, but you can set two methods: Call setpression or use cout. expression, but the effect is different:

Float Mm = 123.456789f;
Cout <mm <Endl; // although the default value of 123.457 is the last five digits without limit, this is only true for the integer.
Setprecision (10); // set the number of digits after the decimal point. However, when the integer has two digits, it is no different from the default one.
Cout <mm <Endl; // 123.457
Cout. Precision (4); // set the total number of digits.
Cout <mm <Endl; // 123.4 in short, the effect is quite strange. I personally think that although this seems uncertain, it is actually a hardware system.

For the actual expression of 0, some people think that + 0 can be absolutely 0, while-0 may represent an extremely small number. therefore, I have come up with a good verification method, proving that no matter + 0 or-0, it is 2 ^ (-127), and the Code is as follows:

Float fdigital = 0.0f;
Unsigned long nmem; // temporary variable used to store the memory data of Floating Point Numbers
// Copy the memory by bit to the temporary variable for use. The nmem is not equal to fdigital, And it is copied by bit.
Nmem = * (unsigned long *) & fdigital;
Cout <nmem <Endl; // generally, a large integer is obtained.

Bitset <32> mybit (nmem); // here, the output is the memory representation of 32float.
Cout <mybit <Endl; // 00000000000000000000000000000000 use-0.0 for the test.

If you still think that the above long string 0 represents absolute 0, read this article again. in fact, this is a clever practice. The above fdigital is represented by any other floating point number. This bitset number can reflect its memory representation.

There is a reason for the shift code to indicate the order code, mainly because the shift code facilitates the operation of the order, so as to compare the size of two floating point numbers. note that the level code cannot reach 11111111. IEEE stipulates that when the level code of the compiler is 0xff, an overflow command is called. in short, when the order is converted into an integer, the range is-127 ~ 127.

Finally, there is a place where experts often feel ashamed. Remember that the unsigned usinged float/Double Floating Point Numbers are incorrect.

I am not very easy to learn. You are welcome to criticize and correct me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.