# Fixed-point and floating-point numbers

Source: Internet
Author: User
Tags numeric reserved

At present, the parallel optimization of the CNN convolutional neural network on the hardware platform is studied, and the relevant knowledge needs to be mended.

Transferred from: http://www.cnblogs.com/kevinq/p/4480563.html

Reference:

1.http://www.cnblogs.com/cloudseawang/archive/2007/02/06/641652.html

2.http://www.cnblogs.com/chenwu128/archive/2012/10/07/2714120.html

Brief introduction:

This paper mainly introduces the concept of fixed-point and floating-point number, the addition and subtraction of fixed-point and floating-point numbers (such as 34.6f-34.0f), and finally introduces the special values of floating-point numbers.

I. Fixed-point number

The so-called fixed-point format, that is, all the data in the contract is fixed in the decimal place. The fixed-point data is usually represented as decimal fraction or a pure integer, in order to represent the number as decimal fraction, the decimal point is usually fixed before the highest bit of the numeric part, and in order to represent the number as a pure integer, the decimal point is fixed to the last face of the numerical part, as shown in the following figure:

The decimal point indicated in the figure is not represented in the machine, but in a fixed position beforehand. For a computer, once the location of the decimal point is determined, it is no longer changed.

Suppose the n-bit is used to represent a fixed-point number x=x0x1x2...x (n-1), where x0 is used to represent the number of sign bits, usually placed in the leftmost position, and the number 0 and 1 respectively for the plus and minus signs, the remaining digits represent its magnitude. If the fixed number x represents a pure integer, the decimal point is at the right of the lowest bit x (n-1), the value range is 0<=|x|<=2^ (n-1)-1, and, for example, 1111 is 7, and if the fixed-point x represents decimal fraction, the decimal point is between x0 and X1, and the value range is 0<= |x|<=1-2^ (-(n-1)), and, for example, 1111 indicates-0.875.

Ii. fixed-point number plus minus operation

Regardless of whether the operand is positive or negative, in the complement plus subtraction, only the sign bit and the numeric part are involved in the operation, and the carry-on of the symbol bit is discarded. Such as:

Short a=-9, b=-5;

cout<<a+b<<endl; -14

The derivation process is as follows:

The original code for A is: 1000 0000 0000 1001, so the complement is: 1111 1111 1111 0111

The original code for B is: 1000 0000 0000 0101, so the complement bit: 1111 1111 1111 1011

The a+b's complement is: 1 1111 1111 1111 0010, discarding the carry-on of the sign bit, so the end result is:

1111 1111 1111 0010, the original code of the result is: 1000 0000 0000 1110, i.e.-14.

Iii. overflow judgment of fixed-point number plus minus operation

1) Use a sign bit to determine overflow

For addition, overflow is possible only in the case of positive home plus and negative plus negative numbers, and the addition of two numbers with different symbols will not overflow.

For subtraction, overflow is possible only in the case of positive and negative numbers minus positive, and a two-digit subtraction with the same symbol will not overflow.

Since the subtraction operation is implemented in the machine with the adder, it is overflow, either for addition or subtraction, as long as the actual operand (subtraction is the minuend and the "after") is the same as the complement sign bit, and the result sign bit is different from the operand complement sign bit. Such as:

In a 4-bit machine, a=5,b=-4, a-B overflow, the derivation process is as follows:

A's original code is 0101, the complement is 0101;-b's original code is 0100, the complement is 0100, the complement of the A-B is 1001, the sign bit of the result is 1, the sign bit of the actual operand is 0, so overflow.

2) Use two bit sign bit to determine overflow

At this point, the principle of overflow is: when the 2-bit sign is not the same, the overflow, otherwise no overflow. The high sign bit always represents the true symbol, whether or not an overflow occurs. Such as:

x=-0.1011,y=-0.0111, the x+y overflow, the derivation process is as follows:

The original code of x is 11.1011, the complement is 11.0101;y's original code is 11.0111, the complement is 11.1001, so X+y's complement is 1 10.1110, the sign bit produces the carry-out, the result is 10.1110, so overflow.

Note: The sign and value bits of the agreed integer are separated by commas, and the sign and value bits of the decimal are separated by a decimal point.

Iv. floating-point numbers

The disadvantage of the fixed-point number notation is that its form is too stiff, the fixed decimal point position determines the integer part and fractional part of the fixed number, which is not conducive to the expression of special large or small number at the same time, and finally, most modern computer systems adopt floating-point expression, which uses scientific notation to express real numbers. That is, with a mantissa (Mantissa, the mantissa is sometimes called a valid number, it is actually an informal representation of a valid number), a radix (base), an exponent (Exponent), and a symbol that represents a positive or negative sign to express real numbers, For example, 123.45 using the decimal science notation can be expressed as 1.2345x102, of which 1.2345 is the mantissa, 10 is the base, and 2 is the exponent. Floating point numbers use an exponent to achieve the effect of a floating decimal point, allowing the flexibility to express a larger range of real numbers.

1) IEEE floating-point number

In the IEEE Standard, floating-point numbers are three fields that divide all bits of a specific length of contiguous bytes into a specific width of the symbol, exponential, and Mantissa fields, and the values in the field are used to represent the symbol, exponent, and mantissa in a given binary floating-point number, so that the given value can be expressed by the mantissa and the adjustable exponent.

The IEEE754 specifies:

Two basic floating-point formats: single-precision and double-precision. The single-precision format has 24 significant digits (that is, the mantissa) precision, a total of 32 bits, and the double-precision format has 53-bit valid digits (that is, the mantissa) precision, which occupies 64 bits in total.

Two extended floating-point formats: single-precision and double-precision extensions. This standard does not specify the exact precision and size of these formats, but specifies the minimum precision and size, such as the IEEE double-precision extended format must have at least 64 digits of effective digital precision and occupies a total of at least 79 bits.

See the following illustration for a specific format:

2) Single-precision format

The IEEE single-precision format consists of three fields: 23-bit fractional F, 8-bit offset exponent E, and 1-bit sign s, which are stored consecutively in a 32-bit word, as shown in the following illustration:

The 0:22-bit contains 23-bit decimal F, where the No. 0 bit is the least significant bit of the decimal and the 22nd bit is the most significant bit. The IEEE standard requires that floating-point numbers must be canonical (the normalization of floating-point numbers is shown later), which means that the mantissa must be bit 1 to the left of the decimal point, so when we save the mantissa, we can omit the 1 in front of the decimal point, freeing up a bits to save more mantissa, In this way we actually express the mantissa of the 24 bits in a 23-bit long mantissa field.

The 23:30-bit contains a 8-bit bias exponent of 3, the 23rd bit is the least significant bit of the offset exponent, and the 30th bit is the most significant bit. A 8-bit exponent can represent 256 exponential values between 0 and 255, but the exponent can be positive or negative, so in order to handle the negative exponent, the actual exponential value is required to add a bias (Bias) value as the value stored in the Exponential field, with a single-precision bias of 127 (2^7-1), For example, the actual exponential value of single-precision 0 is saved in the Exponential field as 127 (0+127), and the actual exponential value-63 is saved as 64 ( -63+127) in the exponential field. The introduction of bias makes the range of exponential values that can actually be expressed for single-precision numbers change from 127 to 128 (including both ends), where the exponent value 127 (saved as full 0) and +128 (saved as full 1) are reserved for processing with special values, as described later. If we use Emin and Emax respectively to express the boundaries of other conventional exponential values, that is, the minimum and maximum indices are represented by Emin and Emax respectively, i.e. 126 and 127, the reserved special exponent values can be expressed as emin-1 and emax+1 respectively;

The highest 31st bit containing the symbol bit s,s is 0 for a positive number and S bit 1 for a negative number.

It is noteworthy that for the single-precision number, since we have only 24 bits of the mantissa (1 of the left side of the decimal point is hidden), so the maximum mantissa that can be expressed is 2^24-1=16,777,215, so the single-precision floating-point number can be expressed in decimal values, the truly valid numbers are no higher than 8 bits.

3) Double precision format

The IEEE double format consists of three fields: 52-bit fractional F, 11-bit offset exponent E, and 1-bit sign s, which are stored consecutively in two 32-bit words, as shown in the following illustration:

In a SPARC architecture, a 32-bit word with a higher address contains the 32-bit least significant bit of a decimal, whereas in the x86 architecture, the 32-bit word with the lower address contains the 32-bit least significant bit of the decimal.

In x86 architecture, for example, f[31:0] represents the 32-bit least significant bit of a decimal, where the No. 0 bit is the least significant bit of the entire decimal. In another 32-bit word, the 0:19-bit represents the 20 most significant bit of the decimal f[51:32], where the 19th bit is the most significant bit of the entire decimal, and 20:30 bits contains the 11-bit bias exponent e, where the 20th bit is the least significant bit of the offset exponent, and the 30th bit is the most significant bit of the offset exponent The 31st bit is the sign bit s. The above figure numbers the two consecutive 32-bit words in the same way as a 64-bit word, where:

The 0:51-bit contains 52-bit decimal F, where the No. 0 bit is the least significant bit of the decimal, and the 51st bit is the most significant bit of the decimal. The IEEE standard requires that floating-point numbers must be canonical, which means that the mantissa must have a 1-to-left decimal point, so we can omit the 1 in front of the decimal point when we save the mantissa, freeing up a bits to hold more mantissa, so that we actually express the 53-bit mantissa with a 52-bit long mantissa field.

The 52:62-bit contains a 11-bit bias exponent E, the 52nd bit is the least significant bit of the offset exponent, and the 62nd bit is the most significant bit. A 11-bit exponent can represent 2048 exponential values between 0 and 2047, but the exponent can be positive or negative, so in order to handle the negative exponent, the actual exponential value is required to add a deviation (Bias) value as the value stored in the Exponential field, and the deviation value of the single-precision number is 1023 (2^ 10-1), the introduction of the deviation so that for the single-precision number, the actual can be expressed in the range of the exponential value of 1023 to 1024 (including both ends). The minimum and maximum indices are expressed in Emin and Emax, and the actual exponential values-1023 (saved as full 0) and +1024 (saved as full 1) are later described and reserved for handling as special values.

The highest 63rd bit containing the symbol bit s,s is 0 for a positive number and S 1 for negative numbers.

4) Double-precision Extended format (SPARC)

The dual-precision format of the SPARC floating-point environment conforms to the IEEE's definition of a double-precision extended format.

The dual-precision format of the SPARC floating-point environment consists of three fields: 112-bit fractional F, 15-bit offset exponent E, and 1-bit sign s, and these three fields are stored continuously, as shown in the following illustration:

The highest-address 32-bit word contains the 32-bit least significant bit of the decimal, denoted by f[31:0], and the adjacent two 32-bit words contain f[63:32] and f[95:64]; in the next 32 bits, 0:15 bits contain 16 bits of the most significant bit f[111:96], The 15th digit is the most significant bit of the entire decimal place; 16:30 bits contain a 15-bit bias exponent e, where the 16th bit is the least significant bit of the bias exponent, the 30th bit is the most significant bit of the offset exponent, and the 31st bit contains the sign bit s.

The figure above four consecutive 32-bit words are numbered as a 128-bit word, where 0:111 bits store the fractional f;112:126 bits store the 15-bit offset exponent E and the 127th bit stores the sign bit s.

5) Double-precision Extended format (x86)

The double-precision format of the x86 floating-point environment conforms to the IEEE's definition of double-precision extended format.

The double-precision format of the x86 floating-point environment contains four fields: 63-bit fractional F, 1-bit explicit leading-digit J, 15-bit offset exponent E, and 1-bit sign S. In the x86 architecture family, these fields are stored continuously in 8-bit bytes of 10 connected addresses due to UNIX system V application Binary Interface Intel 386 Processor Supplement (Intel ABI Requires a double-precision extension parameter, which takes up 32-bit words from three connected addresses in the stack, where the highest 16-bit most significant bit of the address is not used, as shown in the following figure:

The lowest-address 32-bit word contains the decimal 32-bit least significant bit f[31:0], where the No. 0 bit is the least significant bit of the entire decimal, the 32-bit word with the center of the address, 0:30 bits containing the decimal 31 most significant bit f[62:32], where the 30th bit is the most significant bit of the entire decimal, The 31st bit contains an explicit leading significant digit J.

The highest 32-bit word, 0:14 bits contain 15-bit bias exponent e, where the No. 0 bit is the least significant bit of the bias exponent, and the 14th bit is the most significant bit; the 15th bit contains the sign bit s.

Vi. normalization of floating-point numbers

The same values can be expressed in a variety of floating-point numbers, such as the above example of 123.45 can be expressed as 12.345x10^1,0.12345x10^3 or 1.2345x10^2, because of this diversity, it is necessary to standardize it to achieve the goal of unified expression. The canonical (normalized) floating-point representation has the following form:

±d.dd...dxβe, (0≤d i<β)

Where D.DD...D is the mantissa, β is the base, and E is the exponent. The number of digits in the mantissa is called precision, denoted by p, each digit D is between 0 and cardinality, including 0, and the number to the left of the decimal point is not 0.

The specific value of a floating-point number that is based on a specification expression can be computed by the following expression:

± (d 0 + D 1β-1 + ... + d p-1β-(p-1)) Βe, (0≤d i<β)

The above expression is very easy to understand and straightforward for decimal floating-point numbers, that is, base β equals 10. And the computer inside the numerical expression is based on the binary, from the above expression, we can know that the binary number can also have a decimal point, but also have a similar to the expression of decimals, just at this time beta equals 2, and each number D can only be between 0 and 1 value, such as the binary number 1001.101 is equivalent to 1x2^ 3+0x2^2+0x2^1+1x2^0+1x2^-1+0x2^-2+1x2^-3, corresponding to the decimal 9.625, whose canonical floating-point number is expressed as 1.001101x2^3.

6) conversion between real and floating-point numbers

Suppose we have a 32-bit data, which is a single-precision floating-point number, hexadecimal is represented as 0xc0b40000, in order to get the real number that the floating-point number actually expresses, we first convert it to binary form:

1100 0000 1011 0100 0000 0000 0000 0000

Then, according to the format of floating-point numbers are divided into the corresponding fields:

1 1000_0001 0110_1000_0000_0000_0000_000

The sign bit 1 indicates that this is a negative number, the exponential field is 129, meaning that the actual value is 2, and the Mantissa field is 01101, which means that the actual binary mantissa is 1.01101, so the actual real numbers are:

-1.01101x2^2=-101.101=-5.625

It's a little trickier to transform from real numbers to floating-point numbers, assuming we need to express the real-9.625 as a single-precision floating-point number format by first representing it in binary floating-point numbers, and then converting it to the corresponding floating-point number format.

First, the integer part, which is the binary form of 9, is 1001, and the fractional part of the algorithm is to multiply the decimal portion by the cardinality 2 consecutively and record the integer part of the result:

0.625x2=1.25 1

0.25x2=0.5 0

0.5x2=1 1

When the last fractional part is divided into 0 o'clock, the process is ended, so the binary representation of the fractional part is 0.101, so we get the complete binary form 1001.101, represented by the canonical floating-point number:

1.001101x2^3

Because it is a negative number, so the sign bit is 1, the exponent is 3, so the exponential field is 3+127=130, that is, binary 1000 0010, the Mantissa field is omitted to the left of the decimal point 1, the right side with 0 to get the final result is: 1 1000_0010 0011_0100_0000_0000_ 0000_000, finally, the floating-point number can be expressed as 16 binary data as follows: 1100 0001 0001 1010 0000 0000 0000 0000, The final result is 0xc11a0000

One thing to note here is that the process of multiplying the resulting fractional part by 2 is masked by the fact that the process ends with the result of multiplying the fractional part by 2 by 1, but in fact, many decimals do not get results (such as 0.1) at all through a finite number of processes. However, the number of bits in the mantissa field of a floating-point number is limited, and for this reason, the floating-point numbers are processed by continuing the process until the resulting mantissa is sufficient to fill the Mantissa field, after which the extra bits are rounded. In other words, the decimal-to-binary transformation is not guaranteed to be accurate, but only approximate, except for the accuracy problem we talked about earlier. In fact, only a very small number of decimal decimals have accurate binary floating-point expression, coupled with the accumulation of errors in the floating-point operation, the result is that many of our seemingly simple decimal operations on the computer is often unexpected, this is the most common floating-point arithmetic "inaccurate" problem, For example: 34.6f-34.0f=0.599998, the cause of this error is that 34.6f can not be accurately expressed as the corresponding floating-point number, and can only be saved as a rounded approximation, the approximate value of the operation between the 34.0f and the natural can not produce accurate results (the process will say later).

7) rounding

According to the standard requirements, the values that cannot be accurately saved must be rounded to the nearest possible value, which is a bit like the rounding of the decimal that we are familiar with, that is, less than half of it, and more than half (including half) in, but for binary floating-point numbers, it is 0, but 1 does not necessarily go in, Instead, in the two equidistant close to the saved value, take one of the last valid number is zero value to save, that is, to take to even rounding, such as 0.5 to 0, 1.5 to 2 (that is, try to enter 1, will get the final result, if the final bit of the end of the result is 0, the rounding succeeds; directly), see the following examples:

Result derivation Analysis:

Related Keywords:

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

## A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

• #### Sales Support

1 on 1 presale consultation

• #### After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

• Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.