Topic.csdn.netu2009032618c96364bc-38cc-42d2-962e-420600625720.html? Seed2132653946r79968420 # r_79968420 in terms of storage structure and algorithm, double and float are the same. The difference is that float is 32-bit, and double is 64-bit, so double can store more data.
Http://topic.csdn.net/u/20090326/18/c96364bc-38cc-42d2-962e-420600625720.html? Seed = 4152653946r = 79968420 # r_79968420 in terms of storage structure and algorithm, double and float are the same. The difference is that float is 32-bit, and double is 64-bit, so double can store higher
Http://topic.csdn.net/u/20090326/18/c96364bc-38cc-42d2-962e-420600625720.html? Seed = 1902653946 & r = 79968420 # r_79968420
In terms of storage structure and algorithm, double and float are the same. The difference is that float is 32 bits and double is 64 bits, so double can store higher precision.
Any data is stored in the binary (0 or 1) Order in the memory. Each 1 or 0 is called 1 bit, and the last byte on the x86CPU is 8 bits. For example, if the value of a 16-bit (2-byte) short int variable is 1000, its Binary Expression is 00000011 11101000. Due to Intel CPU architecture, it is stored in inverted byte order, so this is because: 11101000 00000011, which is the structure of the fixed point 1000 in the memory.
Currently, the C/C ++ compiler standards all follow the floating-point notation developed by IEEE for float and double operations. This structure is a scientific notation, represented by symbols, indexes, and tails. The base number is set to 2-that is, a floating point number is expressed as the ending number multiplied by the exponent power of 2 and then added the symbol. The specific specifications are as follows:
"" ''Indicates the tail length of the signed level code.
Float 1 8 23 32
Double 1 11 52 64
Temporary quantity 1 15 64 80
Generally, the floating point number of the C compiler is double by default. The following uses double as an example:
A total of 64 bits, equivalent to 8 bytes. From the highest to the lowest Bit are 63rd, 62, 61 ,...... 0 bits:
The highest digit 63 is the sign bit. 1 indicates that the number is negative and 0 is positive;
62-52 digits, 11 digits in total are exponential digits;
51-0 bits, a total of 52 bits are the ending bits.
According to the IEEE floating point number representation, the double Floating Point Number 38414.4 is converted to the hexadecimal code below.
Enable integer and decimal parts: Convert integer parts directly into hexadecimal: 960E. Decimal processing:
0.4 = 0.5*0 + 0.25*1 + 0.125*1 + 0.0625*0 + ......
In fact, this will never end! This is the famous floating point Precision problem. So it is enough to add the preceding integer to calculate 53 bits (hidden BIT Technology: 1 of the highest bit is not written into the memory ).
If you are patient enough and manually calculate 53 digits, the reason is: 38414.4 (10) = 1001011000001110.0110101010101010101010101010101010101 (2)
Scientific Note: 1.001 ...... Multiply by the 15th power of 2. The index is 15!
So let's look at the level code. A total of 11 digits can indicate the range is-1024 ~ 1023. Because the index can be negative, 1023 is required for ease of calculation. Here, 15 + 1023 = 1038. Binary representation: 100 00001110
Symbol bit: positive -- 0!
Combined (1 of the highest bits in the tail binary ):
01000000 11100010 11000001 11001101 01010101 01010101 01010101
The hexadecimal number stored in inverted byte order is:
55 55 55 CD C1 E2 40
Find some information and you will understand:
Any data is stored in binary (1 or 0) Order in the memory. Each 1 or 0 is called 1 bit, and the last byte on the x86CPU is 8 bits. For example
If the value of a 16-bit (2-byte) short int variable is 1156, its Binary Expression is 00000100 10000100. Because Intel CPU
The architecture is Little Endian (please have knowledge about the principles of parameter computation machines), so it is stored in byte inverted order, so it should be like this: 10000100
00000100, which is the structure of the fixed point 1156 in the memory.
How do floating point numbers be stored? Currently, all C/C ++ compilers are developed according to IEEE (International Association of electronic and electrical engineers ).
Point representation. This structure is a scientific representation expressed by symbols (positive or negative), exponent, and ending number. The base number is determined as 2
That is to say, a floating point number is expressed as the ending number multiplied by the exponent power of 2 plus the symbol. The following describes the specific float specifications:
Float
A total of 32 bits, 4 bytes
From the highest to the lowest Bit are 31st, 30, 29 ,...... 0 bits
31 is the symbol bit. 1 indicates that the number is negative, and 0 indicates that the number is negative.
30-23 digits. A total of 8 digits are exponential digits.
22-0 digits. A total of 23 digits are the ending digits.
Each 8 bits are divided into four groups: Group A, Group B, group C, and Group D.
Each group is a byte and is stored in reverse order in the memory, that is, DCBA.
We will not consider the reverse storage problem first, because it will completely confuse the readers, so I will first follow the order and finally turn them over.
Now let's use the IEEE floating point number representation to step-by-step convert a float floating point number 12345.0f to a hexadecimal code. Processing such
When a floating point is used, the integer is directly converted to a binary representation: 1 11100010 01000000 can also be expressed as: 11110001001000000.0 and then decimal
Point to left, always move to only one place from the highest bit, that is, the highest bit of. 11100010010000000 move a total of 16 digits, in the bucket operation decimal point
One shift to the left is equal to the index + 1 in the scientific calculation method with 2 as the base, so the original number is equal to 1.11100010010000000*(2 ^ 16
) Well, now we have both the ending number and the index. Obviously, the highest digit is always 1, because you can't say that you have bought 16 eggs as you have bought 0016
Eggs, right? (Oh, don't take the rotten eggs you bought to me ~), So do we need to keep this 1? (Public: No !) Okay, let's delete it.
He. In this way, the binary value of the ending number is changed to: 11100010010000000 and then 0 is added after the ending number until 23 digits are filled:
11100010010000000000000 (MD, these 0 s almost didn't carry me back ~)
Let's look at the index again. A total of eight digits can indicate an unsigned integer in the range of 0-255, or a signed integer in the range of-128-127. But because the index is
It can be negative. Therefore, to convert the decimal integer into binary, 127 is added first. Here, we add 127 to 16 and then 143,
Binary: 10001111
The number of 12345.0f is positive, so the symbol bit is 0, so we can combine it according to the preceding format:
0 10001111 11100010010000000000000
01000111 11110001 00100000 00000000
Convert it to hexadecimal: 47 F1 20 00, and finally turn it over to: 00 20 F1 47.
Now, you can convert 54321.0f to a binary representation and try again!
With the above foundation, let's take another example with decimals to see why there is a precision problem.
According to the IEEE floating point number representation, the float floating point number 123.456f is converted into a hexadecimal code. For such decimal places, the integer and decimal places must be
Separate processing. Convert integer to binary: 100100011. The processing of decimal places is a little complicated and not easy to talk about. It may be better to talk about decimal places.
For example, if there is a decimal point of 0.57826, 5 is a very bit, the order is 1/10; 7 is a percentile, the order is 1/100; 8 is a kilobytes, and the order is 1/1000
......, The relationship between these sub-Masters is 10 ^ 1, 10 ^ 2, 10 ^ 3 ......, Assume that the sequence of each bit is {S1, S2, S3 ,...... , Sn}, here is 5
, 7, 8, 2, 6, and the pure decimal number can be expressed as follows: n = S1 * (1/(10 ^ 1 )) + S2 * (1/(10 ^ 2) + S3 * (1/(10 ^ 3) + ...... + Sn * (1/(10 ^ n )). The formula is extended to the pure decimal number of B:
N = S1 * (1/(B ^ 1) + S2 * (1/(B ^ 2) + S3 * (1/(B ^ 3) + ...... + Sn * (1/(B ^ n ))