Once do a hardware cost control of the project, because the hardware cost is very low, and also need to achieve high precision measurement, in the process also implemented in C language, sine, cosine, arc tangent, square root and other functions.
The following, whether in my actual project or on a local computer system, int is 4 bytes and the machine is a small end, unless specifically mentioned, otherwise it is so default. The storage of the supposedly float is not the size of the end of the point, but indeed in the PowerPC big-endian floating point of storage is the same as the small end machine such as X86/arm opposite. However, because the size of the end to determine the number of floating point number of storage order, then all the C language program in this series of posts at least on the PowerPC big-endian is the same effect.
Although I would like to use a double to store decimals in this project, I had to forget that because it would require a double storage, I even wanted to consider implementing a 3-byte floating-point number for that poor storage, but roughly estimating the error (as to how to estimate the error of a formula calculation, Need to use the structure of floating-point number to find the error of the variable, and then to use the numerical analysis to find the formula error, after the opportunity to open the post-Nathan, it is not reliable, so or float single-precision 4 bytes to store floating-point numbers. This project revolves around the precision, the operation time, thus adjusted several times the data generation, and even the algorithm principle, but this is not in this series to say. This series is only about the square root implementation of a single-precision 4-byte floating-point number.
Let's look at how floating-point numbers represent real numbers, and IEEE 754 defines the structure of floating-point numbers:
Before we learn about the storage of floating-point numbers, we'll look at the scientific notation.
Our usual carry system is decimal, all real numbers not 0 can be represented as s*a*10n, wherein: s takes 1 or-1, which is called the symbol part; a satisfies 1≤a<10, which is called fractional part; n is an integer, called an exponential part.
Our computer count generally uses the binary system, the reason does not need me to say, floating point number also uses the binary system, uses is the binary system scientific counting method. The scientific notation under the decimal system can be counted in the binary. All real numbers that are not 0 can be represented as s*a*10n, where: s takes 1 or-1, which is called the symbol part; a satisfies 1≤a<2, which is called the fractional part; n is an integer, called an exponential part. Of the 32 bits, the highest bit of 1 bits represents the sign bit s, followed by 8 bits representing the digits, and the last 23 bits representing a.
S (1bits) | N (8bits) | A (23bits)
The s/n/a relationship with the scientific notation is shown in uppercase, which represents the binary, as follows:
If S is 0, then s takes 1; if S is 1, then s takes-1.
n = N (2)-127, where n (2) is the binary value of N,
When the symbol is not chaotic, use n instead of N (2)
A = 1 + A (2) *2-23, here is the binary value of a,
When the symbol is not chaotic, use a to replace a (2)
Floating-point and fixed-point numbers are also discrete, 4-byte floating-point numbers have 32 bits, so a maximum of 232 different real numbers is a approximation of real numbers, but there is a large range that can satisfy many of our needs.
Write a C language program to verify this:
#include <stdio.h> #include <string.h> #include <inttypes.h>int main (int argc, char **argv) { Union { float F; uint32_t u; } n; int i; scanf ("%f", &N.F); for (i=31;i>=0;i--) { if (n.u& (1u<<i)) printf ("1"); else printf ("0"); if (i==31 | | i==23) printf (""); } printf ("\ n"); printf ("s:%u\n", (n.u& (1u<<31)) >>31); printf ("n:%u\n", (n.u& (0xff<<23)) >>23); printf ("a:%u\n", n.u&0x7fffff); return 0;}
Just look for a few numbers to verify.
$Echo 1| ./a.out0 01111111 00000000000000000000000S:0N:127A:0$ Echo-1| ./a.out1 01111111 00000000000000000000000S:1N:127A:0$ Echo 2| ./a.out0 10000000 00000000000000000000000S:0N: -A:0$ Echo 3| ./a.out0 10000000 10000000000000000000000S:0N: -A:4194304$ Echo 3.5| ./a.out0 10000000 11000000000000000000000S:0N: -A:6291456$ Echo 3.75| ./a.out0 10000000 11100000000000000000000S:0N: -A:7340032$ Echo 0.75| ./a.out0 01111110 10000000000000000000000S:0N:126A:4194304$ Echo 0.875| ./a.out0 01111110 11000000000000000000000S:0N:126A:6291456
The above numbers are all satisfied.
But we look back, the scientific notation is actually flawed, 0 is actually not used in scientific notation, but 0 use of the occasion is very many, so the floating point number must be supported. So the single-precision floating-point number of the 2^32, not each of which is a scientific counting method.
IEEE754 specifies that a single-precision floating-point number also supports non-normalized numbers, which is not the number of scientific counting methods.
When the digit n is 0, that is, the 8 bits of n are all 0, the sign bit remains the sign bit,
The value represented is s* a*2-149,
The subsequent exponent is 149 because the normalized number can represent a minimum positive of 2-126,
When n is 0, the maximum number represented is (223-1) *2-149,
The two are very close,
Echo ' scale=60;2^ ( -126);(2^23-1) *2^ ( -149); ' | BC. 000000000000000000000000000000000000011754943508222875079687 . 000000000000000000000000000000000000011754942106924410159919
Compile a C-language program to verify
#include <stdio.h> #include <string.h> #include <inttypes.h>int main (int argc, char **argv) { Union { float F; uint32_t u; } n; int i; scanf ("%" PRIx32, &n.u); for (i=31;i>=0;i--) { if (n.u& (1u<<i)) printf ("1"); else printf ("0"); if (i==31 | | i==23) printf (""); } printf ("\ n"); printf ("s:%u\n", (n.u& (1u<<31)) >>31); printf ("n:%u\n", (n.u& (0xff<<23)) >>23); printf ("a:%u\n", n.u&0x7fffff); printf ("%.60f\n", N.F); return 0;}
Find some numbers to verify.
$Echo 0x00000001| ./a.out0 00000000 00000000000000000000001S:0N:0A:10.000000000000000000000000000000000000000000001401298464324817$ Echo 0x007fffff| ./a.out0 00000000 11111111111111111111111S:0N:0A:83886070.000000000000000000000000000000000000011754942106924410754870$ Echo 0x80000001| ./a.out1 00000000 00000000000000000000001S:1N:0A:1-0.000000000000000000000000000000000000000000001401298464324817$ Echo 0x807fffff| ./a.out1 00000000 11111111111111111111111S:1N:0A:8388607-0.000000000000000000000000000000000000011754942106924410754870$ Echo 0x00000000| ./a.out0 00000000 00000000000000000000000S:0N:0A:00.000000000000000000000000000000000000000000000000000000000000$ Echo 0x80000000| ./a.out1 00000000 00000000000000000000000S:1N:0A:0-0.000000000000000000000000000000000000000000000000000000000000
As you can see, there are 0 and 0 differences, and the floating-point numbers are really so magical.
In addition, IEEE754 rules, n equals 127, that is, the 8 bits are all 1 is also a non-specification number, divided into the following three cases:
S=0,n=127,a=0 is a positive infinity;
When s=1,n=127,a=0, it is negative infinity;
N=127,a≠0 Yes, for Nan (not A number).
Similarly, let's verify that:
$Echo 0x7f800000| ./a.out0 11111111 00000000000000000000000S:0N:255A:0inf$Echo 0xff800000| ./a.out1 11111111 00000000000000000000000S:1N:255A:0-inf$Echo 0x7f800001| ./a.out0 11111111 00000000000000000000001S:0N:255A:1nan$Echo 0xff800001| ./a.out1 11111111 00000000000000000000001S:1N:255A:1nan
The INF and-inf are used for two real numbers to be generated by operation, because they have exceeded the maximum degree of floating-point number to represent the real number, can only be represented by infinity, or floating-point number except 0.
And Nan is the result is not a real category, such as the INF involved in the operation, and then, for example, negative open square root will also produce Nan, because the floating-point number is not used to directly represent the complex number, floating point number is not to directly simulate an approximate algebraic closure.
C language implementation of square root (i)--storage of floating-point numbers