Float Analysis in C

Last Update:2014-07-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In C/C ++, how are floating point numbers, float, and double stored in the memory?

Assume that I have 32-bit
8bit 8bit 8bit 0 0 0 0 1 1 1 1

For integer int, we can quickly conclude that this is the memory format of int I = 15.
Assume that the bitwise of the bitwise is-1 and the maximum bitwise is 30. So this does not represent the number 15,
2 ^-1 + 2 ^ 0 + 2 ^ 1 + 2 ^ 2 = 7.5.

Of course, the above is just a hypothesis. What does the real float floating point look like in the memory?

First, you need to know that float occupies 32-bit double in the memory and 64-bit in the memory.

The floating point is in the memory and consists of three parts.

Sign bit
Exponent (INDEX)
Mantissa (tail number, valid number)

Sign bit

It refers to the highest bit of a floating point in the memory. 0 indicates a positive number, and 1 indicates a negative number. Sing bit occupies 1-bit in float and 32-Bit Memory.

Exponent

Exponent, such as 10 ^ 5, 2 ^ 6. 5 and 6 of these two numbers are both exponent. Of course, numbers are represented in a binary system in the memory. Therefore, the index here refers to the index at the bottom of 2. For example

0 0 0 0 1 1 0

It is easy to know that the exponent is 6. In the memory that represents the floating point number, it indicates 2 ^ 6 = 64.
Expoent occupies 8-bit in the float 32-Bit Memory. Here, this 8-bit is regarded as the bit pattern representing the unsigned Int. The value range is 0 ~ An integer of 256 (exponential range), but the index can be either a positive integer or a negative integer, so it cannot represent-1,-2 .... this is a negative integer. Therefore, IEEE Standard 754 floating-point introduces the concept of bias. The offset is 127 for float type. that is to say, the number 127 has been stored in the exponent part, as in the previous example,

0 0 0 0 1 1 0

It indicates the index 6, but in the float memory structure, it actually indicates (6-127) =-121. The offset of the saved offset is reduced to 127.
If 2 ^ (1), what is the bit pattern in the float memory structure of 1?
Will it be simple?

0 0 0 0 0 0 1

It should be exponent-127 = 1; (the Index 1 in 2 ^ (1) is obtained in this way)
Exponent = 127 + 1 = 128. (Index 1 in 2 ^ (1) should be 128 bit pattern in the float memory structure)

1 0 0 0 0 0 0 0

This is just an example to help you understand exponent and won't really ask such a question ....

Double type, requires 64-bit memory space. It is also composed of three parts: sign bit, exponent, and mantissa. However, exponent occupies 11-bit for the entire 64-bit. Also, the offset is 1023.

Mantissa

Mantissa's tail part occupies 23-bit memory space in float's 32-bit memory space. Note that the exponent index mentioned previously has a minimum bit starting from 0, so mantissa, the maximum number of tails is-1.

0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

So what is the value of the ending part in the memory of the float floating point number? Soon you can get
2 ^ (-2) + 2 ^ (-3) = 0.375. It should be 1.375.
Let's look back at the scientific note in elementary school, 5 = 5.0*10 ^ 0, 0.75 = 7.5*10 ^ (-1 ). Right?
In the memory representation of float, the 23-bit ending number only indicates the accuracy of the non-zero real number decimal point in the scientific notation. In other words, mantissa consists of two parts: Leading bit (non-zero real number of scientific Notation) and fraction bits (precision). The 23-bit only indicates fraction bits. In binary, the non-zero real number is 1, so leading bit is 1 by default. Therefore, the preceding table indicates

Reference
1 + 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

That's why, in the float memory, the ending part can use 23-bit pattern to represent different 24-bit numbers.

In the 64-Bit Memory Structure of the double type, the tail part occupies 52-bit.

We use a table to show how float is stored in the memory.

+/-Sign exponent index fraction bit->. f
S <---------------- 8 ----------------> <------------------------------------ 23 --------------------------------->
Unsigned int 2 ^ (-1), 2 ^ (-2), 2 ^ (-3 )............

The above table indicates the following floating point number.
(-1) ^ s * 1.f * 2 ^ (exponent-127)

With 32-bit pattern,

0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 ^ 8 2 ^ 0 2 ^-23

If we tell you that this is a floating point memory structure, what is the floating point number?
This floating point number can be quickly obtained (-1) ^ 0*1. (2 ^-2 + 2 ^-3) * 2 ^ (2 ^ 1 + 2 ^ 2 + 2 ^ 4-127 ).

The above is the analysis of the float double type memory structure, the previous http://chuansu.iteye.com/blog/1484742 mentioned the conversion between int short char, so what will happen to the conversion of Float double and Int?

First, let's talk about the original, reverse, complement, and shift code. the transfer is actually equal to the completion code, but the opposite is the symbol. for positive numbers, the original, reverse, and complement codes are the same. For negative numbers, in addition to the symbol bit, the reverse code is reversed based on the original code, and the complement Code is based on the reverse code, add 1 to the bitwise of the token. when the request is to be transferred, the request is still to be completed first, and then the symbol is changed.

Floating Point Numbers are divided into float and double, which occupy 4 or 8 bytes respectively, namely, 32 and 64 bits. I only use 32 bits as an example, and the double is included.

In the ieee754 standard, the 32-bit float is defined as follows:

Symbol bit (s)

Level Code (E)

Tail (m)

Here, we should pay attention to three points: A, the level code is represented by the shift code, there will be a 127 offset, its 127 is equivalent to 0, less than 127 is negative, greater than 127 is positive, for example: 10000001 indicates that the exponent is 129-127 = 2, indicating that the true value is 2 ^ 2, and 01111110 indicates 2 ^ (-1 ).

B. The ending number is the number after the decimal point,

C, but the ending number is omitted 1, so when the ending number is all 0, it is also 1. 0... 00;

Next, we only need to explain a few questions. Taking 123.456 as an example, the binary format is: n (2) = 1111011. 01110100101111001, here, it will shift 6 places to the right and get N (2) = 1.111011 01110100101111001*2 ^ 6. This form can be used in the representation format.

Symbol bit (s)

Order code (e) 00000110

Tail (m) 11101101110100101111001

Note that the first digit of the order code above is positive in the 0 table, and the ending number is 1 less than the first digit indicated by N (2). That is, the first digit is 1 by default. because in the process of converting decimal to binary, it is often not possible to convert exactly the same (of course, there will be no loss such as 4.0, and the inevitable loss such as 1.0/3.0 ), so the precision of floating point numbers is generated. In fact, the first 8 digits of the decimal point can be affected by the 23-bit binary number after the decimal point. Why? At this time, the average person is often confused. In fact, it is very simple. In the ending number shown above, it is binary, and there are 23 digits after the decimal point. When the value of the last digit is 1, it is 1/2 ^ 22 = 0.000000238. The actual value must be 0.0000002. That is to say, for a float floating point number, the valid bits are 7 digits from left to right (including the default 1 is 7 digits). When the above 8th bits are reached, they are unreliable, however, the maximum output value of vc6 is 1.0/3.0, which is mainly caused by the compiler. This does not mean that the 16 digits after the floating point are valid. if you do not believe it, you can try the double type 1.0/3.0, and the result will also be 17 digits after the decimal point... in addition, compilers or circuit boards generally have the "Noise Removing" "correction" capability, which can make the number of decimal digits that exceed 7 digits even if they are invalid, this is why the output is always 333 instead of 345 ,. you can try it like this:

Float F = 123456789; cout <F <Endl; // 123456789 is returned here.

Here is a forgotten question: how can a decimal point be converted to a decimal point in decimal order? In fact, it is very easy to multiply the decimal part in decimal order by 2, write the corresponding binary into 1. therefore, when we convert N (2) = 1.111011 01110100101111001*2 ^ 6; Back to the decimal number, it is likely that it is no longer 123.456. well, the accuracy issue should be clear. the value range is as follows.

The number of digits of the order code is an 8-shift code. The maximum value is 127 and the minimum value is-127. Here, 127 is used as the index of 2, so it is 2 ^ 127, it is about 1.7014*10 ^ 38, and we know that the float value range is-3.4*10 ^ 38-----3.4*10 ^ 38, this is because all the 24 digits of the ending number (the first digit is 1 by default) is 1, which is very close to 2, 1. 11 .. 11 is obviously about 2, so the floating point range comes out.

Double is similar to float, but its internal form is

Symbol bit (s)

Level Code (E)

Tail (m)

The main difference is that its level code has 11 digits, which is 2 ^ 1023 about equal to 0.8572*10 ^ 308, And the ending number of 53 digits is about 2, therefore, the value range of double is-1.7*10 ^ 308. ------ 1.7*10 ^ 308. as for its accuracy, 1.0/2 ^ 51 = 4.4*10 ^ (-16 ). the value is 15 digits after the decimal point plus the default one. Therefore, for a double floating point, the number of 16 digits from left to right is reliable.

Sometimes, we will hear the word "Fixed Point decimal". Single-Chip Microcomputer (such as mobile phones) generally only uses fixed points. When confused, we will think float a = 23.4; this is a fixed point decimal, float a = 2.34e1 is a floating point number. In fact, this is incorrect. The above is only a different representation of the same floating point number, all of which are floating point numbers. this method is used to specify a decimal point. The decimal point is placed after a single digit, And the decimal point is 0. the pure decimal point can also be considered as a fixed point decimal point, but it can only represent a pure decimal point smaller than 1.

Then let's talk about several functions in C/C ++. In C ++, the 5 decimal places are output by default, but you can set two methods: Call setpression or use cout. expression, but the effect is different:

Float Mm = 123.456789f; cout <mm <Endl; // 123.457 although the default value is not the last 5 digits, only one integer is used. setprecision (10); // set the number of digits after the decimal point. However, when the integer has two digits, it is no different from the default one. cout <mm <Endl; // 123.457 cout. precision (4); // set the total number of digits. cout <mm <Endl; // 123.4 in short, the effect is quite strange. I personally think that although this seems uncertain, it is actually limited by the hardware system. it is understandable.

For the actual expression of 0, some people think that + 0 can be absolutely 0, while-0 may represent an extremely small number. therefore, I have come up with a good verification method, proving that no matter + 0 or-0, it is 2 ^ (-127), and the Code is as follows:

Float fdigital = 0.0f; unsigned long nmem; // temporary variable, used to store the memory data of floating point numbers. // copy the memory to the temporary variable in bits for use, at this time, nmem is not equal to fdigital, And it is replicated by bit. Nmem = * (unsigned long *) & fdigital; cout <nmem <Endl; // generally, a large integer is obtained.

Bitset <32> mybit (nmem); // The output here is the memory representation of 32float. finally, we can see it intuitively. cout <mybit <Endl; // 00000000000000000000000000000000 use-0.0 for the test.

If you still think that the above long string 0 represents absolute 0, read this article again. in fact, this is a clever practice. The above fdigital is represented by any other floating point number. This bitset number can reflect its memory representation.

There is a reason for the shift code to indicate the order code, mainly because the shift code facilitates the operation of the order, so as to compare the size of two floating point numbers. note that the level code cannot reach 11111111. IEEE stipulates that when the level code of the compiler is 0xff, an overflow command is called. in short, when the order is converted into an integer, the range is-127 ~ 127.

Finally, there is a place where experts often feel ashamed. Remember that the unsigned usinged float/Double Floating Point Numbers are incorrect.

I am not very easy to learn. You are welcome to criticize and correct me.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Float Analysis in C

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support