Analysis of IEEE floating point representation

Last Update:2018-12-03 Source: Internet

Author: User

Tags decimal to binary

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As we all know, all data in a computer is represented in binary format, and floating point numbers are no exception. However, the binary representation of floating point numbers is not as simple as that of fixed points.

First, clarify the concept that floating point numbers are not necessarily equal to decimal places, and the number of fixed points is not necessarily an integer. The so-called floating point number indicates that the decimal point is logically not fixed, and the number of fixed points can only represent a fixed number of decimal points, A floating-point number or a fixed number of points indicates the value of a number.

There are 6 floating point numbers in C ++:

Float: single precision, 32-bit
Unsigned float: Single-precision unsigned, 32-bit
Double: Double, 64-bit
Unsigned double: double-precision unsigned, 64-bit
Long double: High dual precision, 80 bits
Unsigned long double: Unsigned with High Double precision, 80 bits (bytes, should be the longest built-in type in C ++ !)

However, different compilers have slightly different support for them. As far as I know, many compilers do not support the last two floating point numbers according to the IEEE Standard 80-bit, most compilers treat them as double, and perhaps a very few compilers treat them as 128 bits ?! I have only heard of the 128-bit long double, and I have not yet verified it. Which of the top people knows this details is annoying.

Here I only use float (signed, single-precision, 32-bit) floating point numbers to illustrate how floating point numbers in C ++ are represented in memory. First, let's talk about the basic knowledge, the binary representation of pure decimal places. (A pure decimal point is a decimal point without an integer. It is intended for students who are not easy to learn)

To use a binary representation of a pure decimal number, it must be normalized first, that is, in the form of 1. XXXXX * (2 ^ N) ("^" represents the Multiplication Side, and 2 ^ n represents the Npower of 2 ). For a pure decimal D, the formula for finding n is as follows:
N = 1 + log2 (d); // The N obtained by the pure decimal number must be a negative number.
Then use d/(2 ^ N) to obtain the normalized decimal number. The next step is the conversion from decimal to binary. For better understanding, let's take a look at how the 10-digit pure decimal represents. Suppose there is a pure decimal D, each digit after the decimal point forms a set in order:
{K1, K2, K3,..., kN}
Then D can be expressed as follows:
D = K1/(10 ^ 1) + k2/(10 ^ 2) + K3/(10 ^ 3) +... + kN/(10 ^ N)
To promote to binary, the pure decimal representation is:
D = b1/(2 ^ 1) + b2/(2 ^ 2) + B3/(2 ^ 3) +... + BN/(2 ^ N)
Now the question is how to obtain B1, B2, B3 ,......, Bn. It is complicated to describe the algorithm. Let's talk about it with numbers. Declare that the number 1/(2 ^ N) is special. I call it a level value.
For example, if the value is 0.456, 1st bits, and 0.456 is smaller than the value of 0.5 bits, the value is 0. If the value is 2nd, the value of 0.456 is greater than the value of 0.25 bits, and the value of is 1, and subtract 0.45 from 0.25 to the next place. 0.206 bits, 3rd is greater than the bid 0.206, and this bid is 1. Then, subtract 0.125 from 0.206 to the next place; 4th bits, 0.081 is greater than 0.0625, is 1, and 0.081 is subtracted from 0.0625 to 0.0185 into the next bits; 5th bits, 0.0185 is less than 0.03125 ......
Finally, we can combine the calculated values of 1 and 0 in bitwise order to obtain a precise pure decimal number expressed in binary. At the same time, the accuracy problem is also caused, many numbers cannot be exactly expressed in a finite number of N values. We can only use a larger N value to more accurately represent this number, which is why in many fields, programmers prefer double instead of float.

Float memory structure. I use a struct with a bit field to describe it as follows:
Struct myfloat
{
Bool bsign: 1; // symbol, indicating positive and negative, 1 digit
Char cexponent: 8; // exponent, 8 digits
Unsigned long ulmantissa: 23; // Number of tails, 23 digits
};

The symbol is unnecessary. 1 indicates negative and 0 indicates positive.
The index is based on 2 and ranges from-128 to 127. In actual data, the index is obtained by adding 127 to the original index. If it exceeds 127, it starts from-128, its behavior is the same as the overflow of CPU processing addition and subtraction in the X86 architecture. For example: 127 + 2 =-127;-127-2 = 127
The tail number saves 1st-bit 1. Therefore, you must add 1 in the first place during restoration. It may contain integers and decimals, or only some of them, depending on the number size. For floating-point numbers with integers, there are two types of integer notation. When an integer is greater than 16777215 in decimal format, scientific notation is used. If the integer is smaller than or equal to decimal number, general binary notation is used. The scientific counting method and decimal representation are the same.
The fractional part uses the scientific notation directly, but the form is not x * (10 ^ N), but x * (2 ^ N ). Open it.

0000000000000000000000000000000
Symbol digit index digit tail Digit

The following is a program for analyzing the float type of memory data. The program is well tested. If any problem is found, please submit it for improvement.

# Include <iostream>
# Include <iomanip>
Using namespace STD;

Int _ tmain (INT argc, _ tchar * argv [])
{
// The definition and data initialization of a signed 32-bit float-Type Floating-point variable. Its value can be modified at will.
Float fdigital = 0.0f;
// Temporary variable used to store the memory data of Floating Point Numbers
Unsigned long nmem;
// Copy the memory to the temporary change in bits for use.
Nmem = * (unsigned long *) & fdigital;

// Retain the precision of 8 decimal places after the decimal point to output the original floating point number
Cout <setprecision (8 );
Cout <"floating point:" <fdigital <Endl;
Cout <"-----------------------" <Endl;

// Determines whether the value is 0. If all digits are 0, the value of the floating point data indicates 0, which has no analytical significance.
If (nmem! = 0)
{
// Print out its symbol.
// The highest 1 digit is the symbol bit. It is represented by bool. True indicates a negative number, and false indicates a positive number.
// Calculate the value of 0x80000000 and the value of 0x80000000.
Bool bnegative = (nmem & 0x80000000l )! = 0 );
// If it is a negative number, enter a negative number; otherwise, a space is output.
Cout <"Symbol:" <(bnegative? '-': '+') <Endl;

// Print out its index.
// The value 30th-23 is an exponential digit, which has positive and negative 8-digit integer data, represented by char.
// Shift the memory one bit to the right, and then shift the memory to the left by 24 digits. Then, the original exponential data is obtained after the hard truncation to 8 digits.
Char cexponent = (char) (nmem <1)> 24 );
// IEEE floating point number representation specifies that the original index plus 127 is the memory index.
// Reduce the raw index data by 127 to obtain its real index (the CPU automatically handles the upstream and downstream overflow ).
Cexponent-= 127;
// Returns the exponential data in the form of a 10-digit signed integer.
Cout <"index:" <(INT) cexponent <Endl;

// Print the ending number.
// 22nd-0 is the ending number. Because the first digit 1 is removed, it should be 24-bit unsigned data.
// It is represented by an unsigned long integer.
// And at least 22 bits are 1 Data 0x7fffff for bitwise AND, you can get the original data of the tail
Unsigned long ulmantissa = (nmem & 0x7fffffl );
// Perform bitwise OR with 23rd bits and bits as 1's data 0x800000, which can be filled with the omitted maximum bits 1
Ulmantissa | = 0x800000l;
// Output the ending data in hexadecimal Integers
Cout <"tail: 0x" <setbase (16) <setfill ('0') <SETW (8 );
Cout <setiosflags (ios_base: uppercase) <ulmantissa <Endl;

// It is extremely complicated to convert decimal places to Integers Based on Integer algorithms,
// Here, we use double for simple implementation and only describe its theory. For more information about the algorithm, see the description.
// Calculate the integer part of the floating point, expressed in Double Precision
Double dinteger = 0;
// If the index is greater than or equal to 0, it indicates that the ending number contains an integer.
If (cexponent> = 0)
{
// If the index is greater than 23, the integer is represented by scientific notation.
If (cexponent> 23)
{
// Dcurbit is used to calculate and store the order of each bit in a loop
Double dcurbit = 1.0;
// Calculate the decimal form of decimal places in scientific notation
For (INT nbitidx = 0; nbitidx <24; nbitidx ++)
{
// Move the integer part to the left and add 9 to the current position to the highest position;
// 31 digits to the right to get the current bit, and then multiply the current order by this bit,
// The number of accumulated values. The values are accumulated into binary decimal places.
Dinteger + = dcurbit * (ulmantissa <
(8 + nbitidx)> 31 );
// Divide the current order by 2 to obtain the next order.
Dcurbit/= 2;
}
// Convert the scientific notation represented by binary to a decimal system
Dinteger * = POW (2.0, (double) cexponent );
}
Else
{
// Remove the length of the decimal part from the tail to the right to obtain the integer part.
Int nrightshift = 0;
// If the index is smaller than 23, it indicates that the fractional part exists and needs to be shifted to the right.
If (cexponent <23)
{
// The original length of the ending number minus the exponent is the length of the fractional part.
Nrightshift = 23-cexponent;
}
// Shift the length of the decimal part to the right of the ending data to obtain the integer part.
Dinteger = (double) (ulmantissa> nrightshift );
}
}
// Output the integer part in the form of an unsigned integer in decimal format
Cout <"INTEGER:" <setbase (10) <setprecision (8 );
Cout <dinteger <Endl;

// Calculate the fractional part of the floating point, which is expressed by an unsigned long integer.
Double ddecimal = 0;
// If the index is less than 23, the ending number contains the fractional part.
If (cexponent <23)
{
// The Difference Between the exponent and 23 is the length of the fractional part. For details, see the processing of the integer part.
// Remove the 32-bit integer whose bits are 1 to the right and reduce the length of the 32-bit integer, that is, increase the index by 9,
// Obtain the number of masks used to obtain the decimal part.
Unsigned long uldecimalmask = 0 xffffffff;
// If the index is greater than or equal to 0, it indicates that the ending number contains an integer.
If (cexponent> = 0)
{
Uldecimalmask >>= (9 + cexponent );
}
Else
{
// If it is a pure decimal, the integer 1 that is deleted when normalization must be restored
Ddecimal = 1.0;
}
// Combine the ending number and the mask number by bit to obtain the fractional part.
// The binary decimal point is represented by an unsigned long integer.
Unsigned long uldecimal = ulmantissa & uldecimalmask;
// Dcurbit is used to calculate and store the order of each bit in a loop
Double dcurbit = 0.5;
// Calculate the decimal form of decimal places in scientific notation
For (INT nbitidx = 1; nbitidx <24; nbitidx ++)
{
// Shifts the decimal places to the left and adds 9 to the current decimal places;
// 31 digits to the right can get the current BIT.
// Use dcurbit to multiply the current bit to accumulate the number. Add to ddecimal.
Ddecimal + = dcurbit * (uldecimal <
(8 + nbitidx)> 31 );
// Divide the current order by 2 to obtain the next order.
Dcurbit/= 2;
}
// Convert the scientific notation represented by binary to a decimal system
Ddecimal * = POW (2.0, (double) cexponent );
}
// Output the decimal part in the form of a 10-digit unsigned integer
Cout <"decimal part:" <setbase (10) <setprecision (8 );
Cout <ddecimal <Endl;
}
Else
{
Cout <"the floating point number is 0 and has no analytical significance" <Endl;
}

Cout <"-----------------------" <Endl;
Cout <"after analysis, the program exits. "<Endl;

/// // Unchangeable //////////////
System ("pause ");
// _ Crtdumpmemoryleaks ();
Return 0;
/// // Unchangeable //////////////

}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More