This paper is derived from quantization, data type, overflow and underflow, and the floating-point number is considered as a quantization method of real field, and the difference between floating-point number, especially irregular floating-point number and rule floating-point number is analyzed.

0. Background, motivation and purpose

In order to better understand the content of this article, we can read the contents of quantization, data type, overflow and underflow in advance. It still sees floating-point numbers as a quantitative way to map successive sets of irreducible numbers to a finite set. In this paper, a single-precision floating-point number is discussed, and the double-precision floating point is similar.

A number of bloggers have written about the difference between irregular floating point numbers (denormalized number) and regular floating-point numbers, and here's the first Lu Junyi you should know about floating point basics. This article starts from the difference between the computational efficiency between denormal number and normal number, outlines the relative knowledge of floating-point numbers, and then gives the definition of denormalized number, which is accompanied by the corresponding code. Most of the knowledge of floating-point numbers can be obtained from Wikipedia, including

- Denormal number Irregular floating-point number (not sure whether the translation is appropriate)
- Floating point floats related content
- IEEE floating point IEEE754-defined float

More relevant knowledge can be obtained from the links in the entry.

I have two purposes in writing quantization, data type, overflow and underflow, one is to illustrate the representation of computer internal data (i.e. how to represent any number with a finite set and what kind of problems), from the point of view of quantization in digital signal processing, and give the reason of error in the process of data type conversion and calculation. And the second is to remind yourself that both the number of int or double data type, the ability is limited, you need to pay attention to overflow and underflow (especially underflow) in the process, in order to avoid errors.

In the process of writing, when I consider the floating-point number as non-uniform quantization, but found that the process is not so smooth, perhaps I should explain the relative content of floating point alone, so this article as a quantitative, data type, overflow and underflow of the supplement, to clarify

- How the floating-point number is not uniformly quantified in the real field
- Why do you want to do non-uniform quantification?

1. Non-uniform quantification of floating-point numbers

First give a, here will (0,4) interval is not for a number of paragraphs, the number of each paragraph between the allocation of the same value, this is quantization. And the length of each segment is different, this quantization method is non-uniform.

(by blacklemon67-created with TikZ, CC By-sa 3.0, https://en.wikipedia.org/w/index.php?curid=46487370)

Why non-uniform quantification? From the digital signal processing point of view is to maintain a relatively consistent signal-to-noise ratio, of course, can simply give an example

Measuring about 1 tons of objects, generally 1 tons 1 grams and 2 grams basically no difference, but the weight of about 1 grams of objects, 1 grams or 2 grams difference is very big.

The specific definition of floating-point numbers is given in quantization, data type, overflow, and underflow.

Floating point (32-bit floating point)

Referring to Wikipedia, 32-bit floating-point numbers are stored in a way that represents the.

The corresponding floating-point value can be represented as (decimal)

For rule floating point numbers, the index entry range is 01-FE (1 to 254). Floating-point numbers greater than 0 are sequentially, whereas floating-point numbers greater than 1 are sequentially, meaning that the quantization interval is different. When an index entry takes FF, it can indicate either positive or negative infinity or non-number (for example, 0/0). What we are more concerned about is that the index item takes the characteristic of the 0 o'clock floating point number.

2. Why to introduce irregular floating-point numbers

First look at the characteristics of the regular floating point number, the index term defines the interval size, the next interval is twice times the length of the current segment, and fraction divides the interval into segments, as shown in

Obviously, if you just use the regular floating point representation, the interval between 0 and the minimum normal number is much larger than the minimum normal number to the minor normal number, which is not satisfied with our expectations. Therefore, the selection rule floating point index entry range starts at 1. The remainder of the interval (that is, the yellow parenthesis) is divided evenly into segments, when the index entry takes 0, which is expressed as

This method satisfies the requirement that the quantization interval is smaller (or equal) when the data (absolute value) is less than the hour. This can improve the calculation accuracy to some extent. For example, if you do not introduce irregular floating-point numbers, any less than the number will overflow to 0, and the introduction of irregular floating-point numbers, less than the number will overflow to 0.

3. Problems with irregular floating-point numbers

The ability to express irregular floating-point numbers is still limited, and because of their different definitions from regular floating-point numbers, this can lead to problems with the computational rate, i.e.

- Irregular floating-point numbers are slower than regular floating-point numbers (generally)
- Irregular floating-point numbers do not solve the overflow in the calculation process

The problem of calculation rate in Lu Junyi you should know the floating point number of the basic knowledge has been discussed in detail, here is no longer repeated, simply explain why, in addition, for example, floating point number plus/minus steps for

- to the order. Exponential item alignment, small to large alignment
- The mantissa sums. After alignment, the mantissa is added/subtracted
- The rule of law. Intercept the mantissa to ensure the accuracy of the calculation
- Rounding. To determine the missing value, to round
- Judge the result. Determine if the result overflows

Non-regular floating-point calculations are different when you add a "to-order" calculation. The uniform representation of normal number and denormal number is

If a, B is denormal number, the bitwise process

This means that the order process may have three results, that is, this increases the complexity of the calculation. But I don't know the exact calculation process, I wrote it here.

For the second problem, although irregular floating-point numbers greatly increase the accuracy around 0, the precision of floating-point numbers is still limited and cannot prevent underflow from occurring. Therefore, in the calculation process, especially the high precision requirements and the algorithm is iterative, we must pay attention to the problem of underflow. For an overflow discussion, see quantization, data type, overflow, and underflow.

Irregular floating-point numbers and regular floating-point numbers