Storage Method of floating point and floating point in memory, floating point

First, let's talk about how to convert decimal places to binary decimal places. The computer doesn't even know the decimal data. He only knows 0 and 1. So, decimal decimal places are represented by binary decimal places in the computer.

**Decimal decimal to binary decimal:**

It can be simply summarized**Returns an integer in the positive order, which is used to multiply the decimal part****2****And then take the integer part.**

For example, if 0.2 is converted to a decimal point of the binary value, the integer value of 0.2*2 = 0.4 is 0, so the first decimal point of the binary value is 0, and then 0.4*2 = 0.8, the integer is 0, so the second digit of the binary decimal is 0, then 0.8*2 = 1.6, And the integer is 1, so the third digit of the binary decimal is 1, then 0.6*2 = 1.2, and its integer is 1, so the fourth digit of the binary decimal point is 1. In this way, the computation continues,

For example, how to represent 8.25 in binary format: First 8 is represented as 1000 in binary format, and then 0.25*2 = 0.5. The integer is 0, so the first digit of the binary decimal point is 0. then 0.5*2 = 1.0, so the second digit of the binary decimal point is 1, so 8.25 is represented as 1000.01 in binary.

**Code transfer:**We know how to calculate the shift code corresponding to a decimal number,**The bitwise of the complement sign is the offset or****128 (****Assume that****8****Bit****)**,

For example, if you want to request a code transfer from 1 to 1, 1 + 128 = 129, 129 = 1000 0001, or use the method of first obtaining the complement code and then getting the reverse symbol bit, first obtain the complement of 1, the complement Code of 1 is 0000 0001, and then the bitwise of his symbol is reversed to 1000 0001. We can see that the results obtained by the two methods are the same,

For example, if you want to calculate the-1 code,-1 + 128 = 127 = 0111 1111, or you can first find the-1 code, the-1 code is 1111 1111, the bitwise is reversed to get the code 0111 1111. The two methods get the same result.

In addition, the method for solving the QR code here is different from the method for finding the shift code (order code) when the floating point number below is stored. Baidu inputs the shift code, which is generally said above, the QR code is generally used as the level code of the floating point number.**Add raw data****128****Or the complement sign is reversed.**Which of the following is the order of the floating point number?**Add raw data****127****.**

**In addition, why do we need to use a shift code to represent the level code, instead of a complement code to represent the level code?**To simplify the comparison. This is because the index must be a signed number to express a very large or small number. If it is indicated by a complement code, first the decimal number is signed. If the index is also signed, then we cannot simply compare the floating point size. Because if you compare the size based on the complement code, you must first convert it to the original code and then compare the size. Because of this, the exponent part is expressed in the form of positive values, the actual value is the sum of the original value and a fixed value (127 in 32 bits. Adjust its value to an unsigned number range for comparison. Because the QR code does not have a symbol bit, we can directly see the size of the corresponding value in the form of the QR code,**There is no symbol bit for the QR code. The highest bit of the QR Code cannot be regarded as a symbol bit but a value bit.**For example, the top 1000 0001 regards the highest bit as a numerical bit to get 129.

**C****Language floating point number Storage Method**

Floating Point Numbers (float with single precision and double Precision) are represented by binary scientific notation in the memory. They are mainly composed of three parts: Symbol bit + level code + ending number. Float storage uses 4 bytes, and double storage uses 8 bytes. The occupied bits of each part are as follows:

Length of the tail of the signed level code

Float 1 8 23 32

Double 1 11 52 64

Symbol bit: 0 indicates an integer, and 1 indicates a negative number. Note:**Here, the symbol bit is the ending number, not the exponent part.**Note that all floating point numbers are signed and cannot be defined as unsigned float. In this case, the compiler will report an error. Of course, some compilers only warn and do not report an error,

Order code (index part): used to store index data in scientific notation and stored in shift mode,

Ending part: the ending part is described in detail below,

Shows the storage method of float:

The dual-precision storage method is:

**Description of the ending number**:**Note that the ending number uses the original code,**8.25 can be expressed as 1000.01 in binary format, 1.00001 in binary scientific notation, and 120.5 in binary format. 1111000.1 in binary format can be expressed as 1.1110001 in binary scientific notation *, the scientific notation of any number is expressed as 1. xxx *, because the ending part is 1 before the decimal point, you can omit 1 before the decimal point, and the ending part can be expressed as xxxx. For example, the binary form of 0.5 is 0.1, given that the positive part must be 1, if the right decimal point is shifted to 1, it is 1.0*2 ^ (-1), and the ending number 1.0 is removed from the integer part to 0, if the number of digits 0 to 23 is 00000000000000000000000, the ending part is 00000000000000000000000. Because the previous 1 is omitted, the ending part of the 23bit is changed to 24 bit, the truth is that here, the 24bit can be accurate to the number of decimal places. We know that the binary value of 9 is 1001, SO 4 bit can be accurate to the one decimal point in decimal places, 24-bit enables float to be accurate to the last six digits of the decimal point,

**Method for order code**: For example, if you want to calculate the level code corresponding to 3, the method is**Add 127**, 3 + 127 = 130 = 1000 0010 (note that the highest bit is also a numerical bit), and then calculate the order code corresponding to 6, 6 + 127 = 133 = 1000 0101,

If you want to obtain the original data by knowing the order code, the method is**Level Code minus 127**For example, 1000 0010 = 130, then 130 minus 127 = 3, then 1000 0101 = 133, then 133-127 = 6,

**The following example illustrates how floating point data is stored in memory.**, Knowing the rule that the tail number is omitted 1, and knowing the method of the order code, next we can look at two floating point number Storage examples, such as 8.25, first, convert the decimal point to 1000.01 = 1.00001*2 ^ 3. First, determine the sign bit. 8.25 is a positive number. Therefore, the sign bit is 0, and then evaluate the order code, 3 plus the 127 complement = 1000 0010. then there is the ending part. The ending part indicates: Remove 1 before the decimal point, which is 00001, followed by 0 to 23 digits: 000 0100 0000 0000 0000 0000 final 8.25 binary stored in memory: 0100 0001 0000 0100 0000 0000 0000 0000,

**Let's talk about some special values in floating point storage,**

**1****: If the index part is not all 0 and the index part is not all 1**: That is, the 0 <index part <255, which is called the regular form. In this case, floating point numbers are calculated based on the rules described above. In this case, the index E is equal to the order code minus 127, add 1 to the front of the decimal part, that is, 1. xxxx.

**2****: If all E values in the index are 0**: At this time, the floating point index E is equal to 1-127 (instead of 0-127, this is required). When the decimal number is obtained, the first 1 is not added, but is restored to 0. xxxxxx decimal number, which is used to represent ± 0 and a small number close to 0. IEEE stipulates that 0.0 is a special number for 0, the zero order and the ending number indicate the zero point of the floating point.

**3****: If the index part E is all 1:**At this time, if the ending part is all 0, it indicates ± (plus or minus sign bit S). If the ending part is not all 0, it indicates that this number is not a number (NaN ).

** **

**What is the value range of the index?**: Based on the above analysis of the conventional and unconventional forms, let's take a look at the value range of the floating point index. The index part is an unsigned integer, which means that for the float type, because the index is 8 bits, the value range is 0 to 255. When the order code is 0, according to the above rules, in this case, the index is 1-127 =-126, and when the order code is 255, it is a special value, so it cannot be calculated. When the order code is 254, the index is equal to 254-127 = 127, so it is inappropriate to say that the value range of the index is-126 to 127.

**About the range of the number that float variables can represent**: First, we will analyze the maximum number of absolute values of the float type. We have analyzed that the maximum value of the index is 127, and the maximum value of the ending part is 23 1, this is the maximum value that float can represent. The value is 1.111 1111 1111 1111 1111 1111*2 ^ 127 = 3.4*10 ^ 38. when the sign bit is 1, it indicates a negative number-3.4*10 ^ 38.

What is the minimum number of absolute values that float can represent? When the index is 0, the index is 1-127 =-126, then, the minimum value of the ending part is 000 0000 0000 0000 0000 00001 0.000, And the decimal point is 0000 0000 0000 0000 0001. therefore, the minimum absolute value of a floating point number must be 1*2 ^-149. (I'm not sure about this. It seems that the request is wrong ,)

**Floating Point overflow and Underflow**: There is a programming exercise question behind Chapter 3 of C Primer Plus, "through the test method, observe how the system handles integer overflow and floating point overflow, floating Point overflow "refers to the answer;

# Include <stdio. h>

Int main (void)

{

Unsigned int a = 4294967295;

Float B = 3.4E38;

Float c = B * 10;

Float d = 0.1234E-2;

Printf ("% u + 1 = % u \ n", a, a + 1 );

Printf ("% e * 10 = % e \ n", B, c );

Printf ("% f/10 = % f \ n", d, d/10 );

Return (0 );

}

/*

The output result in VC ++ 6.0 is:

4294967295 + 1 = 0

3.400000e + 038*10 = 1. # INF00e + 000

0.001234/10 = 0.000123 a valid number is lost

Press any key to continue

*/

I didn't understand it at the beginning. 0.001234/10 = 0.0001234. Isn't it 1.234E-4? The floating point number can store this number completely. How can I lose the valid bit, after taking a closer look, I found that the printf function in the program uses % f output, and % f is a normal decimal output, which is not a scientific notation output, so the valid bit is lost, if % f is changed to % e or % E, the effective bit will not be lost after 0.001234 is divided by 10,

For example:

# Include <stdio. h>

Int main (void)

{

Int a = 0x00000009;

Float B;

B = (float);

Printf ("% e \ n", );

Return 0;

}

If % f is used for output, the result is 0. 000000. split 0x00000009 to get the first signed digit s = 0, the index E = 00000000 for the next 8 bits, and the valid digit M = 000 0000 0000 0000 0000 for the last 23 bits. Because the index E is all 0, the index is 1-127 =-126, and the decimal part is not added with 1 but zero. Therefore, the floating point V is written: V = (-1) ^ 0 × 0. 00000000000000000001001 × 2 ^ (-126) = 1.001 × 2 ^ (-146). If % f is used for output, the result is 0,

**Errors in floating point Storage**: For example, 2.2 converts decimal places to binary decimal places by * 2 decimal places.

0.2 × 2 = 0.4, so the first part of the binary decimal point is 0 of the integer of 0.4;

0.4 × 2 = 0.8, the second digit is an integer of 0.8;

0.8 × 2 = 1.6, the third digit is 1;

0.6 × 2 = 1.2, the fourth digit is 1;

0.2 × 2 = 0.4, the fifth digit is 0;

In this way, it is never possible to multiply to = 1.0. The obtained binary is an infinite loop arrangement of 00110011001100110011...

For single-precision data, the ending number can only represent the precision of 24 bit, so 2.2 of float storage is:

However, this storage method is not converted to a decimal value of 2.2. It may be inaccurate when converting decimal to binary, which leads to an error! Therefore, in the floating-point representation, some numbers are stored with errors. For some data (such as 2.25), it can be computed exactly when the decimal is converted to the binary representation, therefore, this error does not exist.

**How to compare the two floating point numbers**: Many C programmers may have questions about floating point numbers, because the floating point type is just an approximate value, that is, a value may represent a range, this expression makes it unreasonable to compare floating-point data using the method of determining whether the difference is equal to 0. Only by comparing whether a number is within this small range, therefore, when calculating the value to compare two floating point variables, you cannot determine whether the difference is equal to zero. However, you can only judge by the following method:

Const float ESPSION = 0.000001;

If (x-y) >=- 0.000001 & (x-y) <= 0.000001)

This implementation method is a basic comparison method. This method is used to determine whether the variable is in a certain range. The range here is-0.000001 <x <0.000001. To determine whether a value is 0, use a value greater than-0.000001 and less than 0.000001. In this way, the floating point 0 is very close to 0, but not 0, in this way, the error of division by 0 will not be thrown. 0.0 is actually not 0, when x falls into ± 0. within 000001, x + 1.0 = 1.0. In this range, x is regarded as 0.0 by computers.

When I was reading C Primer Plus, I had a post-school question about floating point overflow and underflow. After reading the answer, I couldn't understand it. So I checked the floating point storage, this check found that the storage of floating point numbers involves a lot of things. Due to the average IQ, the storage of floating point numbers is a small problem, and the last and last time it was delayed for about 10 days, after reading one blog after another, I finally learned a little about the storage of floating point numbers. Below are my own notes. Part of the content is a copy of another's blog, which has no intention of infringement, it is purely for taking notes.

If any errors occur in your notes. My predecessors corrected the mistakes so that I could correct them in time to avoid mistakes,