Fast floating-point operations and floating-point operations
Code download: several algorithm implementations of the root number
In our previous blog, we introduced data type address conversion, which allows us to directly regard a float value as an int type. What is the significance or purpose of this address translation? Today, we will show you an example-fast floating-point open-side operation, which gives you a better understanding of the meaning of address translation and their correspondence.
1 division
If the floating point operator is given a floating point x, evaluate. There are many solutions to this simple problem. Let's start with the simplest and easiest to think. The idea of square Construction with two points is very simple, that isAssume that the median is the final solution.. Assume that the lower limit is 0, the upper limit is x, and then the middle value is obtained. Then, compare the sum of squares x of the middle value, and modify the lower limit or upper limit according to the size. Then, calculate the middle value and start a new loop, until the distance between the two values is smaller than the given precision.Note that if x is less than 1, we need to set the upper limit to 1, for the reason you know.The Code is as follows:
float SqrtByBisection(float n){float low,up,mid,last; low=0,up=(n<1?1:n); mid=(low+up)/2; do{if(mid*mid>n)up=mid; else low=mid;last=mid;mid=(up+low)/2; }while(fabsf(mid-last) > eps);return mid; }
This method is very intuitive and is also a frequently asked question during the interview process. However, pay special attention to the following points:The upper and lower limits cannot be used for accuracy determination, but the first and second mid values must be used; otherwise, the system may be in an endless loop!This is because of precision issues. During the loop, the mid value may be the same as that in up or low. In this case, the subsequent computation will no longer change the mid value, so when the accuracy is not reached, it will fall into an endless loop. However, if you change the value of the first or second mid, there will be no problem (for your own sake ). You can try some examples. This is a trick in the binary method. Although binary is simple, there is a very big problem: slow convergence! That is to say, it takes many cycles to meet the precision requirements. This is easy to understand, because it usually requires three to four iterations to obtain an accurate result. To increase the convergence speed, we need other methods.
2 Newton Iteration Method
The principle is also relatively simple,It is to replace the mean value with the zero root of the tangent equation as the final solution.The principle can be explained using (from matrix67 ):
Fig 1 Newton Iteration Method
Assume that the required value (a = 2 in the figure) is equivalent to the intersection of the function and the X axis greater than 0. To obtain the value of the intersection, we first assume an initial value, which is in Figure 1. A better result is a result of line and point crossing. If the line crossing the point is the X axis minus the line. Repeat the preceding steps to obtain a better result, such as line and point, line and X axis ......
Obviously, this method is inclined to approach the target value, and the convergence speed should be faster than that of the binary method. But how can we obtain them? We need to obtain a recursive formula. See the shadow triangle in the figure. The length of the vertical side is. If we can obtain the length of the horizontal sideL. Because the oblique side of a triangle is actually a tangent, we can easily know that the slope of the tangent is, and then use the tangent definition to obtain the length of l, the recursive formula is as follows:
Next, we need to use the above formula to iterate until the accuracy requirement is met. The Code is as follows:
Float SqrtByNewton (float x) {float val = x; // Initial Value float last; do {last = val; val = (val + x/val)/2 ;} while (fabsf (val-last)> eps); return val ;}
Test the code above and the result is indeed faster than the binary method. The time for opening All integers in the first 3 million is 1600 milliseconds and 1000 milliseconds, respectively. The reason for the speed is that the number of iterations is less than that of the binary method. Although Newton iteration is faster, there is still room for further optimization:First, there are two Division operations in the code of Newton iteration, and there is only one division in the bipartite method,Division is usually several times slower than multiplication, which may lead to a decrease in the speed of a single iteration. If division can be eliminated, the speed can be improved a lot. The algorithm without division will be introduced later;Next, we choose the original value as the initial valuation, which is not a good estimate,This leads to the need to iterate multiple times to meet the accuracy requirements. Of course, this problem also exists in the bipartite method, but the upper and lower limits are not easy to estimate and can only be used in the most conservative way. However, Newton Iteration can select the initial value at will, so there is a selection problem.
Let's analyze why the Newton iteration method can select the initial value at Will (of course, it must be greater than 0 ). By Formula
We can draw several conclusions:
Therefore, an initial value selection problem exists in Newton Iteration. Good selection will greatly reduce the number of iterations, and the efficiency of selection may be lower than that of the Bipartite method. First, we will provide a code using the new initial value:
float SqrtByNewton(float x){int temp = (((*(int *)&x)&0xff7fffff)>>1)+(64<<23);float val=*(float*)&temp;float last;do{last = val;val =(val + x/val) / 2;}while(fabsf(val-last) > eps);return val;}
For tests before the above Code is repeated, the running time is reduced from 1000 ms to 240 ms, and the performance is improved by nearly four times! Why can the above two complex statements of code speed up so much? This requires the IEEE floating point representation introduced in our previous blog. We know that the IEEE floating point standard represents a number, which is stored in the float type and then changed:
Now we need to open the floating point number. Let's take a look at the general changes in each part. The index E will definitely be divided by 2,127 and will remain unchanged. m needs to be opened. Because the index is the big part of the floating point, modifying the index is the easiest way to make the initial value close to the exact value. Fortunately, we only need to divide the square of the index by 2, that is, shift one digit right. However, E + 127 may be an odd number. If one digit is shifted to the right, the index will be changed. We will first clear the percentile of the index, which is the purpose of & 0xff7fffff. Then, the converted integer is shifted to one digit, that is, the index is divided by 2, and the ending number is also divided by 2 (in fact, only the fractional part of the ending number is divided by 2 ). Because the right shift will divide 127 by 2, we still need to compensate for a 64, which is why we need to add another (64 <23.
Here, you may wonder why adding (64 <23) instead of (63 <23) at the end can we still clear the last digit of the index? The answer is yes, but the speed is not as fast as I wrote above. This shows that my estimation is closer to the exact value. The following is a brief analysis of the cause. Assume that e is an even number. You may set e = 2n. After the opening, e should be changed to n and 127 will not change. Let's see why the above Code will change. E + 127 is an odd number and will be cleared. This is equivalent to e + 126. If one digit shifts right to n + 63, and the compensation value is 64, the index is n + 127! Let's assume that e is an odd number. Let's set e = 2n + 1. After the opening, e should be changed to n + 1 (inaccurate), and 127 will remain unchanged. Let's see why the above Code will change. E + 127 is an even number equal to 2n + 128. It is also required to shift one digit to n + 64 on the right and add the compensated 64 to the index n + 1 + 127! This does show that the above estimation is more accurate than other methods, so the speed is faster.
Although the optimized Newton Iteration algorithm is much faster than the binary algorithm, the speed is still lower than that of the library function sqrtf. The same test of sqrtf takes only 100 milliseconds, the performance is three times that of the optimized Newton Iteration Algorithm! How are database functions implemented! This shows that the initial value is not so accurate. Don't worry. Next we will introduce an algorithm that is faster than the library functions. Its performance is 10 times that of the library functions!
3. carmark Algorithm
This algorithm has been extracted from a game source code for 99 years. The author claims to be a great player in the game industry, camark. However, it seems that this algorithm still exists for a long time, the original author is not allowed to take the test. It is now called the Carma algorithm. If you don't want to talk about anything, go to the code first:
float SqrtByCarmack( float number ){int i;float x2, y;const float threehalfs = 1.5F;x2 = number * 0.5F;y = number;i = * ( int * ) &y; i = 0x5f375a86 - ( i >> 1 ); y = * ( float * ) &i;y = y * ( threehalfs - ( x2 * y * y ) ); y = y * ( threehalfs - ( x2 * y * y ) ); y = y * ( threehalfs - ( x2 * y * y ) ); return number*y;}
The above code has two intuitive feelings:This code has no loops! This code has no division!At first glance, people who see the code will be shocked! The following describes the algorithm. The original version is required:
float Q_rsqrt( float number ){long i;float x2, y;const float threehalfs = 1.5F;x2 = number * 0.5F;y = number;i = * ( long * ) &y; // evil floating point bit level hackingi = 0x5f3759df - ( i >> 1 ); // what the fuck?y = * ( float * ) &i;y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removedreturn y;}
Fig 2 Newton Iteration Method
The original version is the reciprocal of the open side, that is, not the open side. There are two reasons for this. First, it is more common for the open-end reciprocal to be used than the open-end. For example, vector normalization is usually performed in games, and this operation requires the Open-end reciprocal. Another reason is that there is no division operation for the reciprocal Newton iteration of the square, so it is faster than the previous Newton iteration. However, the Code above seems difficult to see Newton's iteration. This is because the function has changed, and the formula needs to be changed, but the derivation of the recursive formula remains unchanged, as shown in figure 2. According to the previous derivation, we have:
With this formula, we can clearly understand the code y= Y * (threehalfs-(x2 * y ));This is actually a single Newton iteration. Why is it all done after a single iteration? Because the accuracy of a single iteration has reached a very high level, the Code also specifically states that there is no need for a second iteration (to achieve the accuracy required by the game ). Figure 3 shows the square reciprocal error of the number from 0.01 to 10000 (from Wikipedia). It can be seen that the error is very small and decreases with the increase of the number.
Figure 3 error of the carmark Algorithm
Why can a single iteration meet the precision requirements? Based on the previous analysis, we can know that the most fundamental reason is that the initial value is very close to the exact solution. The key to estimating the initial solution is the following code:
i = 0x5f3759df - ( i >> 1 );
It is precisely because this code, especially the "magic number" in it, makes the initial solution of the algorithm very close to the exact solution. The specific principle also applies to the strong conversion of addresses described in the previous blog: first, convert the number of float types directly to the int type (in the code, long is equivalent to int on a 32-bit machine ), then, perform a magic operation on the int type value, and finally convert the address to the float type, which is a very accurate initial solution.
In the previous blog, we used to give a formula for the relationship between float floating point numbers and corresponding int integers:
Here, it indicates the int type Integer after the float Floating Point Number Address is strongly converted, and x is the original floating point number (not yet expressed as the float type), B = 127, which is an infinitely small amount. To simplify the above formula, we get:
With this formula, we can deduce the origin of the initial solution. We can convert it to an equivalent value. Then we get the following formula:
This formula is a mathematical representation of magical operations. Only the unknown number is in the formula, and others are known. There is no good solution for the value. Mathematicians use the brute force search and experiment method to obtain the optimal value of 0.0450466. In this case, the first item corresponds to 0x5f3759df. However, after more careful experiments, we found that using 0x5f375a86 can achieve better accuracy, so we switched to this number later.
The final purpose of the algorithm is to square the floating point number, while the original card mark algorithm calculates the reciprocal of the square. Therefore, the original code returns the result of multiplying the original value by the reciprocal of the square. The algorithm has high performance and accuracy. The accuracy of three iterations is the same as that of system functions, but the speed is less than of the system function sqrtf.
4. The improved Newton Iteration carmark algorithm also inspired us to make similar modifications to the original Newton Iteration square algorithm. Previously, we have provided a method to estimate the initial value and achieved a good performance improvement. We hope to improve the performance by modifying the estimation of the initial value according to the idea of the kack algorithm. We can convert it to an equivalent value, and then substitute it into the above formula. We will get:
The new formula is very similar to the Carma formula, but the coefficient changes. This is also easy to understand. the logarithm of the original kaifang and kaifang reciprocal represents only one negative number. This formula is only known as an unknown number. I don't have the energy to search for its optimal value. I can only estimate a few: I can select 0 or 0, or I can choose the same as in the Carma algorithm. The resulting magic numbers are 0x1fc00000 and 0x1fbd1e2d, respectively. These two values are used for calculation respectively, and the results are similar. Compared with our initial value estimation, the performance is improved by about 25%, but it is twice slower than the library function sqrtf.
This result is frustrating. It is reasonable to say that the PI mark algorithm is slow, but it is still 20 times less understandable. Isn't the initial solution not good? In my anger, I removed the loop and changed it to only three iterations,The result is the same as that of the system function, but the speed is doubled!This result is surprising, which indicates that the overhead of the do loop is very large. This conclusion can be obtained by adding the do loop in the carmark algorithm: If the for loop is added to the carmark algorithm, the running speed is reduced by 10 times! Why is the do loop so slow? One reason may be that only fabsf is needed. Other reasons are unknown.
What conclusions can we draw when the speed is doubled? The original Newton Iteration uses two division (addition is ignored), while the carmark algorithm uses three multiplication (one time when the result is returned), all of which are three iterations, their performance is doubled,We can introduce the division operation time, which is about three times the multiplication time..This conclusion reveals that optimization of Division elimination is an important way to increase the speed..
When we tried two random values, but the result was good. Is there a better one? The answer is yes, but it will only have an impact in one iteration (compared with the original carmark algorithm). If three iterations are performed, the value will not be affected, values in a certain range will get the same results, and the running time is the same, because the results are accurate enough.
If the do loop where I initially estimated the initial values is also removed, the accuracy of the three iterations would not meet the requirements, which means that the initial values I have come up with are still too bad. The initial value estimation is indeed a learning. What lessons can we learn from Newton's iteration and Carma's algorithm?First, in order to achieve the best performance, do not have loops in the Code. This is the meaning of loop expansion. Second, the final comparison between the two algorithms is completely the comparison of division and multiplication, try to eliminate division in the Code through mathematical transformation, which will also bring about a lot of performance improvements. It is important to have a deep understanding of the storage structure of floating point numbers in computers, it will bring us a lot of extreme Performance Solutions on many issues.
5 SSE assembly instructions
Optimization is endless. In the advanced language field, the Kakma algorithm is currently the fastest open algorithm, but at the machine instruction level, the evil Intel provides such a command RSQRTSS, supporting the Kakma Algorithm in hardware. RSQRTSS is a SSE command. The SSE (Streaming SIMD Extensions) command is also a single command. The multi-data stream extension command can effectively enhance the CPU floating point operation capability. Generally, the compiler also provides high-level implementation of SSE commands, allowing you to directly use SSE commands without writing assembly code in C ++ code. The following code uses this command (from searches for a faster square root reciprocal algorithm ):
float SqrtByRSQRTSS(float a){float b=a;__m128 in = _mm_load_ss(&b);__m128 out = _mm_rsqrt_ss(in);_mm_store_ss(&b, out);return a*b;}
Because the sse command uses 128-bit registers, we cannot directly convert the float type to _ m128, but call sse's special load and store commands. As for performance, there is a result (4) in the timingsquare root, but I have not reproduced it. In addition, the running speed of the above Code is three times slower than that of the PI mark algorithm without any settings, which means that the compiler does not enable SSE command optimization by default. After the/arch: SSE parameter is specified, the performance matches the Carma algorithm, but it does not reach three times the performance gap shown in Figure 4. However, it is certainly impossible for Intel to implement a command with lower performance than that in advanced languages. I am not very clear about the detailed rules for using SSE commands. I should still have no basic configuration rules for SSE programming, therefore, the result in Figure 4 is trustworthy. In addition, there is a slightly serious problem. The RSQRTSS results are not accurate and the error is ± 1. 5*2 ^-12. This command cannot be used for applications with strict precision requirements.
Figure 4 Performance Comparison between SSE commands and the Carma Algorithm
The second line in Figure 4 also shows the performance of SQRTSS commands. Similar to the improved Newton Iteration Method, the performance is slower than that of RSQRTSS commands, and the gap is not small.
6. This blog summarizes the frequently used methods of floating point developers and hopes to have a positive impact on everyone. Finally, I will post all my experiment results for your reference. My CPU model is dual-core 32-Bit Core 2 T5750 with 2 GB memory. The test content is to enable the integer within 1 to W. The running time is as follows:
Later, I re-tested it on the company's server, and I felt sad. The server CPU is a 24-core 64-bit Xeon E5-2630 with a memory of 128 GB, And the compiler is the icc compiler that costs money. The test content is to start the operation of integers within 1 to. The running time is as follows:
When I see this result, I can only say that the system function is invincible (no calculation error )! All written above will be voided. Take the compiler's actual measurement as the standard ......
Currently, the speed of a computer can reach _ floating point operations
The peak operation speed of Tianhe No. 1 is 4700 trillion times per second. One hour of operation on Tianhe No. 1, equivalent to 1.3 billion people nationwide simultaneously computing over 340. One day of operation on "Tianhe No. 1" is equivalent to one dual-core high-end desktop computer computing over 620.
Simple and quick operations for a data-driven
The power of 5*3/5 of 2 = 8
Please adopt it !!!!!!!!