Floating point representation in the computer

Never2die Yangtze University Jingzhou, Hubei Province

Chinese Abstract: This article introduces several Representation Methods of floating point numbers in computers and analyzes these methods to provide reference for future research.

Chinese keywords: floating point number; computer storage; floating point expression

**Floating number expression in Computer**

**Abstract: This paper introduced the floating number in the computer several expression method, and carries on the reorganization analysis regarding this, will later make this aspect for**

People the research to provide the reference.

**Key word: floating number; computer memory; floating number expression**

In today's information age, computers have penetrated into all aspects of our lives. I have been studying computers for many years, but I seldom read articles or books on floating point numbers in computers. After reading the materials and summing up and thinking, I have a clue about this.

Binary is not in line with people's habits, but the computer uses binary to represent information. The main reasons are as follows:

The circuit is simple: the computer is composed of logical circuits, which usually have only two States.

Reliable Operation: two States represent two data. digital transmission and processing are not prone to errors, so the circuit is more reliable.

Simplified operation: the binary operation rule is simple.

Strong logic: the computer operating principle is based on logical operations, and Logical Algebra is the theoretical basis of logical operations. Binary has only two digits, which represent the true and false values of Logical Algebra ".

In computers, data storage and Representation Methods are different from what we usually use. It uses the two polarity of electricity and magnetism to express the difference in data. Of course, there is only 0 and 1 in the computer, and we use decimal in our life. So how do we store the decimal data in a computer?

We know that we need to convert the decimal number to the binary number for storage and representation. It is very easy to convert the integer to the binary number without any error. What if we want to represent a floating point number? In floating-point numbers, only a small number can be accurately expressed using this method, and the vast majority cannot be accurately expressed. Here is a simple example:

3.14159 if we convert it directly, it is 11.0010010000111111001...

In this way, we cannot accurately express 3.14159.

We can use these two methods to represent floating point numbers:

1. BCD code. Binary-coded decimal, BCD for short, is called BCD code or binary-decimal code, also known as binary code decimal number. It is a binary digital encoding format that uses binary-encoded decimal code. This encoding method uses four digits to store a decimal number, so that the conversion between binary and decimal can be quickly carried out. This encoding technique is most commonly used in the design of accounting systems, because accounting systems often need to accurately calculate long numbers. Compared with the general floating-point memory, the BCD code can save both the precision of the value and the time it takes for the computer to perform floating-point operations. BCD encoding is also commonly used for other computation that requires high accuracy.

Since the decimal number is 0, 1, 2 ,...... And 9 digits. Therefore, at least four binary codes are required to represent the one-digit decimal number. 4-bit binary code has a total of 2 ^ 4 = 16 types of code groups. In these 16 types of code, you can choose any 10 types to represent 10 decimal digits, a total of N = 16! /(16-10 )! It is about 2.9 multiplied by 10 to the power of 10. Common BCD codes are listed at the end.

1.1 8421bcd code

This encoding feature is that if the code is regarded as a 4-bit binary number, the single-digit weights are 8, 4, 2, 1, the decimal values of each code are exactly the decimal values they represent. The encoding table is as follows:

Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |

Binary |
0000 |
0001 |
0010 |
0011 |
0100 |
0101 |
0110 |
0111 |
1000 |
1001 |

More than 1.2 three yards

Add 3 to each code of 8421bcd to get the remaining three codes.

Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |

Binary |
0011 |
0100 |
0101 |
0110 |
0111 |
1000 |
1001 |
1010 |
1011 |
1100 |

1.3

Cycle Code

For each code of the 8421bcd code, the highest bit remains the same. The next bit is the same or different from the previous one, and the round robin code is obtained.

Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |

Binary |
0000 |
0001 |
0011 |
0010 |
0110 |
0111 |
0101 |
0100 |
1100 |
1101 |

2. Order code tail Representation

This idea comes from the exponential representation in mathematics:

For example, decimal number

Similarly, binary

2.1 The ending number and order code in floating point notation. For an R-base number, as long as the values of M and C are uniquely identified (

), Then the value of this number is uniquely determined,

That is, the value is unique. Therefore, in a computer, for a binary floating point number (), you only need to store the values of M and C. M and C are called tails and Order Codes respectively. M is a pure decimal number smaller than 1, indicating that the method is similar to the pure decimal number in the number of points. Its length will affect the precision of the number, and its symbol will determine the number of symbols; level C is equivalent to the exponent in mathematics. It is an integer and its representation is similar to that of a fixed integer.

The representation of a floating point number of 2.2. Assume that a floating point number is represented by four bytes. Generally, the level code occupies one byte, And the ending number occupies three bytes. The highest bit of each part is used to represent the positive and negative signs of this part.

For example, the representation in a computer is as follows:

It is worth mentioning that the precision and range of the Four-byte floating point numbers are far greater than the number of fixed points indicated by the same length, which is an advantage of floating point numbers. However, in terms of operation rules, the number of fixed points is simpler and easier to implement than the floating point number. Therefore, the two representation methods are available in computers at the same time, and the application is selected based on the actual situation.

3. We can transform the second method to get a better method. First, move the decimal point of a floating point number to the end, and use the representation, r = 10. Then, convert m to binary B and use C as the level code, B is represented by the order code ending number.

For example:

314159 in binary format: 1001100101100101111

It can be expressed

This method can not only accurately represent the floating point value, but also make full use of the storage space.

Comparison of several methods:

1.

Directly converting a floating point number to a binary method cannot be accurately expressed.

2.

BCD code is easy to understand, but cannot make full use of the space.

3.

Method 3 can accurately represent floating point numbers and make good use of space.

Conclusion: the working principles of computers are getting increasingly unattended, and people are busy with advanced development. It seems that the underlying things have nothing to do with themselves. No matter what others use, it is always useless. If everything is ready-made, it is difficult for us to really do well. I have learned too little about computers. I hope you can give me more advice on some bad things.

As we all know, all data in a computer is represented in binary format, and floating point numbers are no exception. However, the binary representation of floating point numbers is not as simple as that of fixed points. First, clarify the concept that floating point numbers are not necessarily equal to decimal places, and the number of fixed points is not necessarily an integer. The so-called floating point number indicates that the decimal point is logically not fixed, and the number of fixed points can only represent a fixed number of decimal points, A floating-point number or a fixed number of points indicates the value of a number. There are 6 types of floating point numbers in C ++: float: single precision, 32-bit unsigned float: Unsigned single precision, 32-bit double: Double precision, 64-bit unsigned double: double-precision unsigned, 64-bit long double: High dual-precision, 80-bit unsigned long double: High dual-precision unsigned, 80-bit (bytes, should be the longest built-in type in C ++ !) However, different compilers have slightly different support for them. As far as I know, many compilers do not support the last two floating point numbers according to the IEEE Standard 80-bit, most compilers treat them as double, and perhaps a very few compilers treat them as 128 bits ?! I have only heard of the 128-bit long double, and I have not yet verified it. Which of the top people knows this details is annoying. Here I only use float (signed, single-precision, 32-bit) floating point numbers to illustrate how floating point numbers in C ++ are represented in memory. First, let's talk about the basic knowledge, the binary representation of pure decimal places. (A pure decimal point is a decimal point without an integer. It must be normalized first to indicate a pure decimal point in binary format. XXXXX * (2 ^ N) ("^" represents the multiplication party, and 2 ^ n represents the Npower of 2 ). For a pure decimal D, the formula for finding n is as follows: n = 1 + log2 (d); // The N obtained from a pure decimal number must be a negative number and then D/(2 ^ N) then we can get the normalized decimal number. The next step is the conversion from decimal to binary. For better understanding, let's take a look at how the 10-digit pure decimal represents. Suppose there is a pure decimal D, each digit after the decimal point forms a set in order: {K1, K2, K3 ,..., kN} Then D can be expressed as follows: D = K1/(10 ^ 1) + k2/(10 ^ 2) + K3/(10 ^ 3) +... + kN/(10 ^ N) is extended to binary. The pure decimal representation is: D = b1/(2 ^ 1) + b2/(2 ^ 2) + B3/(2 ^ 3) +... + BN/(2 ^ N) Now the question is how to obtain B1, B2, B3 ,......, Bn. It is complicated to describe the algorithm. Let's talk about it with numbers. Declare that the number 1/(2 ^ N) is special. I call it a level value. For example, if the value is 0.456, 1st bits, and 0.456 is smaller than the value of 0.5 bits, the value is 0. If the value is 2nd, the value of 0.456 is greater than the value of 0.25 bits, and the value of is 1, and subtract 0.45 from 0.25 to the next place. 0.206 bits, 3rd is greater than the bid 0.206, and this bid is 1. Then, subtract 0.125 from 0.206 to the next place; 4th bits, 0.081 is greater than 0.0625, is 1, and 0.081 is subtracted from 0.0625 to 0.0185 into the next bits; 5th bits, 0.0185 is less than 0.03125 ...... Finally, we can combine the calculated values of 1 and 0 in bitwise order to obtain a precise pure decimal number expressed in binary. At the same time, the accuracy problem is also caused, many numbers cannot be exactly expressed in a finite number of N values. We can only use a larger N value to more accurately represent this number, which is why in many fields, programmers prefer double instead of float. Float memory structure. I use a struct with a bit field to describe it as follows: struct myfloat {bool bsign: 1; // symbol, indicating positive and negative, 1-bit char cexponent: 8; // index, 8-bit unsigned long ulmantissa: 23; // Number of tails, 23-bit}; the symbol is needless to say. 1 indicates negative, 0 indicates that the positive index is based on 2, the value range is-128 to 127. In actual data, the index is obtained by adding 127 to the original index. If the value exceeds 127, the value starts from-128, its behavior is the same as the overflow of CPU processing addition and subtraction in the X86 architecture. For example:**127 + 2 =-127;-127-2 = 127**The tail number saves 1st-bit 1. Therefore, you must add 1 in the first place during restoration. It may contain integers and decimals, or only some of them, depending on the number size. For floating-point numbers with integers, there are two types of integer notation. When an integer is greater than 16777215 in decimal format, scientific notation is used. If the integer is smaller than or equal to decimal number, general binary notation is used. The scientific counting method and decimal representation are the same. The fractional part uses the scientific notation directly, but the form is not x * (10 ^ N), but x * (2 ^ N ). Open it. 0000000000000000000000000000000 symbol-bit index-bit ending number