Abstract: This article provides two programs for calculating factorial. 1st programs use the method of embedding assembly code in C to improve the bottleneck of Program 2 in the previous article, so that the speed is more than three times faster. The 2nd programs further improved the algorithm. When calculating the factorial of 10 thousand, it is 5-6 times faster than the program 2 in the previous article. It takes only 10000 seconds to calculate the factorial of 0.25 at GB in the last article. In the previous article "maxcompute factorial calculation from entry to entry-entry 2", we provide two programs for maxcompute factorial calculation. Among them, 2nd use the division of 64-bit integers, the speed is slower. In this article, we use the method of embedding assembly code in C language to improve the bottleneck and increase the computing speed more than three times. First, let's take a look at the core code used to calculate a big number factorial (see below). We can see that in each loop, We need to calculate a 64-bit multiplication and a 64-bit addition, the division of the 64-bit number twice (the Division is considered as the Division ). We know that the division command is slower than the multiplication command, especially for the 64-bit division. In VC, the division of two 64-digit numbers is implemented by calling the aulldiv function. The remainder operation of the two 64-digit numbers is implemented by calling aullrem. By analyzing the source code (Assembly Code) of these two functions, we know that when performing Division operations, we first check the divisor. If the divisor is smaller than the power of 2, two Division operations are required to obtain a 64-bit operator. If the divisor is greater than or equal to 232, the operation is more complex. In addition, if the division number is smaller than 232, two Division operations are required to obtain a 32-bit remainder. In our example, the divisor is 1000000000, less than 232. Therefore, this code requires four Division times.
for (carry=0,p=pTail;p>=pHead;p--){ prod=(UINT64)(*p) * (UINT64)i +carry; *p=(DWORD)(prod % TEN9); carry=prod / TEN9;}
In this Code, when performing Division operations, the operator is always less than the power 32 of 2. Therefore, this 64-digit division and remainder operation can be implemented using a division command. Next we will use the method of embedding assembly code in the C function to execute a division command and get the quotient and remainder. Function prototype:
DWORD div_ten9_2 (uint64 X, DWORD * premainder );This function adds the operator X divided by 1000000000, And the divisor is saved to the premainder. The following is the code for the div_ten9_2 function and the calculation of factorial. It is implemented using Embedded Assembly in C. For the usage of Embedded Assembly in C code, see msdn.
Inline DWORD div_ten9_2 (uint64 X, DWORD * premainder) {_ ASM {mov eax, dword ptr [x] // low DWORD mov edX, dword ptr [x + 4] // high DWORD mov EBX, ten9div ebxmov EBX, premaindermov [EBX], EDX // remainder-> * remainder // eax, return Value} void calcfac2 (dword n) {DWORD * buff, * phead, * ptail, * P; dword t, I, Len, carry; uint64 prod; if (n = 0) {printf ("% d! = 1 ", n); return;} // --- calculate and allocate the required bucket T = gettickcount (); Len = calcresultlen (n, ten9 ); buff = (DWORD *) malloc (sizeof (DWORD) * Len); If (buff = NULL) return; // the following code calculates n! Phead = ptail = buff + len-1; For (* ptail = 1, I = 2; I <= N; I ++) {for (carry = 0, P = ptail; P >= phead; p --) {prod = (uint64) (* P) * (uint64) I + (uint64) carry; carry = div_ten9_2 (prod, P );} while (carry> 0) {phead --; * phead = (DWORD) (carry % ten9); carry/= ten9 ;}t = gettickcount ()-T; // display the calculated result printf ("It take % d MS/N", T); printf ("% d! = % D ", N, * phead); For (P = phead + 1; P <= ptail; P ++) printf (" % 09d ", * P ); printf ("/N"); free (buff); // release allocated memory}
Note: although the title in this article is the power of assembly, it is not always so effective to use the assembly language. Generally, after the key code is rewritten using the assembly, the performance improvement is not as obvious as in the above example, and may increase by up to 30%. This program is a special case. The final improvement: if we analyze the several calculation factorial functions, we will find that the calculation factorial is actually a double loop, the portion of the inner loop is calculated (I-1 )! *
I. The outer loop is calculated in sequence by 2 !, 3 !, 4 !, Until N !, If we have calculated r = (I-1 )!, Can I calculate prod = first?
I * (I + 1 )*... M, so that I * (I + 1 )*... I * (I + 1 )*... M * (m + 1) is greater than or equal to 2 ^ 32, and then R * prod is calculated. This reduces the number of external loops and increases the speed. Both theoretical and test results show that the speed can be increased by more than one time when the factorial of less than 30000 is calculated. The following code is provided.
Void calcfac3 (dword n) {DWORD * buff, * phead, * ptail, * P; dword t, I, Len, carry; uint64 prod; If (n = 0) {printf ("% d! = 1 ", n); return;} // --- calculate and allocate the required bucket T = gettickcount (); Len = calcresultlen (n, ten9 ); buff = (DWORD *) malloc (sizeof (DWORD) * Len); If (buff = NULL) return; // the following code calculates n! Phead = ptail = buff + len-1; For (* ptail = 1, I = 2; I + 15 <n;) {uint64 T = I ++; while (T <4294967296i64) {T * = (uint64) I; I ++;} I --; T/= I; for (carry = 0, P = ptail; P> = phead; p --) {prod = (uint64) (* P) * t + (uint64) carry; carry = div_ten9_2 (prod, p) ;}while (carry> 0) {phead --; * phead = (DWORD) (carry % ten9); carry/= ten9 ;}}for (; I <= N; I ++) {for (carry = 0, P = ptail; P> = phead; p --) {prod = (uint64) (* P) * (uint64) I + (uint64) carry; carry = div_ten9_2 (PR Od, p) ;}while (carry> 0) {phead --; * phead = (DWORD) (carry % ten9); carry/= ten9 ;}} printf ("It take % d MS/N", gettickcount ()-T); // display the calculated result printf ("% d! = % D ", N, * phead); For (P = phead + 1; P <= ptail; P ++) printf (" % 09d ", * P ); printf ("/N"); free (buff); // release allocated memory}
Liangbch@263.net, copyright, reprinted please indicate the source.