The operation principle of floating point numbers: IEEE 754

Last Update:2017-02-27 Source: Internet

Author: User

Tags valid

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The IEEE binary floating-point Number Arithmetic standard (IEEE 754) is the most widely used standard for floating-point operators since the 1980s, and is used by many CPUs and floating-point operators. IEEE 754 provides four ways to represent floating-point values: Single precision (32-bit), double precision (64-bit), extended-order precision (43 bit, rarely used), and extended double precision (79 bit or more, usually in 80 bits).

IEEE 754 divides the bit sequences that store floating-point numbers into three parts: The sign bit s, the digit E and the tail digit m. According to the rules, for 32-bit floating-point numbers, the highest 1 digits are the symbol bit s, the next 8 digits are exponential E, the remaining 23 are valid digits m; for 64-bit floating-point numbers, the highest 1 is the symbol bit s, followed by the exponent E, and the remaining 11 digits are the valid number M. The calculation method of the corresponding mathematical value of the floating-point number is: v= ( -1) ^s*m*2^e.

To make it easier to compare the size of two floating-point numbers, the exponent part is stored by an unsigned integer, but in order to solve the exponential negative, IEEE 754 stipulates that the storage value of the exponential field is the sum of the actual and exponential offsets, and that the exponential offset is calculated at 2e-1-1, where e is the length of the bits that store the exponent. For example, if the exponential field has a length of 8 and its storage value is 129, then the exponential field corresponds to an actual exponent of 129-(2^7-1) = 2, at which point the storage value range of the exponential field is 0~255 and the actual exponent range is -127~128.

So how does a number store in floating-point numbers? Here's a step-by-step demonstration:

First, the digital use of binary format of the scientific notation, such as 123.625, converted to binary scientific notation: 123.625 (10) = = "1111011.101 (2) = =" 1.111011101*2^6. Then, add the exponent 6 to the exponential offset 127 and populate the Exponential field 6+127==>10000101. To remove 1.111011101 from the first 1 (any binary converted to scientific notation, the integral part must be 1 (the Mantissa field is "1-2"), so you can omit that bit in the Mantissa field to store more data, 1.111011101=== "111011101===" 11101110100000000000000 (fill 0) and assign 11101110100000000000000 to the Mantissa field, then the floating-point number for the value 123.625 is expressed as: 01000010111101110100000000000000.

In the opposite process, we can pull out 0 10000101 1110111010000000..=== 0 134 1110111010000000 ... ==> 0 6 1.111011101==>1.111011101*2^6==>123.625

If the exponential portion of the floating-point number is encoded with a value of 0 and the mantissa is Non-zero, then the floating-point number is called a non-statute floating-point number. The IEEE 754 standard stipulates that the exponential offset of a floating point number in a non-statute form is greater than 1 of the exponential offset of a floating-point number in a statute form. For example, the exponential portion of a single-precision floating-point number in the smallest statute form is 1, the actual value of the exponent is-126, and the exponential field of a single-precision floating-point number that is not a statute is 0, and the corresponding exponential actual value is 126 instead of-127. A non-protocol floating-point number, which is not required to add 1 to the header of the Mantissa field when it is converted to a digit, that is, the Mantissa field is 0~1, so all the non-protocol floating-point numbers are closer to 0 than the protocol floating-point numbers. The conversion process is as follows: 0 00000000 11101110100000..=== "0-126 11101110100000 ... ==> 0-126 0.111011101==>0.111011101*2^-126==>123.625

When you convert a number to a float, you can't always match exactly, for example, when you convert 0.1 to a binary format, the number of bits is much more than 23 bits. There are several ways to run a rounding job, and in fact the IEEE standard lists 4 different methods:

Rounded to nearest: rounded to nearest, even precedence (Ties to even) (this is the default rounding method) in the same close case: The result is rounded to the nearest and most visible value, but the even number (at the end of the binary Chinese style 0) is taken when there are two numbers as close to each other.

Round in +∞ direction: The result is rounded in a direction that is infinitely large.

Rounding toward-∞: The result is rounded in a negative infinity direction.

Round in 0: The result is rounded in the direction of 0.

In addition, floating-point numbers can represent infinity and non-numeric (Nan), with single-precision floating-point numbers as examples, summarized as follows:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More