Big integer algorithm [10] Comba multiplication (implementation) and comba Multiplication
★Introduction
In the previous article, we talked about the principle of Comba multiplication and how to implement it this time. To facilitate transplantation and give full play to the performance of different platforms, three different implementation methods are used:
1. Single and Double Precision variables.
2. Only single-precision variables are supported.
3. inline assembly can be used.
We have already introduced the principle and implementation of Comba multiplication. For convenience, I will post it here:
Calculate c = a * B, c0, c1, and c2 as single-precision variables.
1. Increase c to the required precision, and make c = 0, c-> used = a-> used + B-> used.
2. Make c2 = c1 = c0 = 0.
3. For the I from 0 to c-> used-1 cycle, perform the following operations:
3.1 ty = BN_MIN (I, B-> used-1)
3.2 tx = I-ty
3.3 k = BN_MIN (a-> used-tx, ty + 1)
3.4 three precision variable shifts one digit to the right: (c2 | c1 | c0) = (0 | c2 | c1)
3.5 execute a loop between j from 0 to k-1. The calculation is as follows:
(C2 | c1 | c0) = (c2 | c1 | c0) + a (tx + j) * B (ty-j)
3.6 c (I) = c0
4. compress the excess bits and return c.
The three implementation methods mentioned above are mainly reflected in the cycle in 3.5. The implementation code of Comba multiplication is as follows: (the input and output of x = x * y are the same variable)
static void bn_mul_comba(bignum *z, const bignum *x, const bignum *y){ bn_digit c0, c1, c2; bn_digit *px, *py, *pz; size_t nc, i, j, k, tx, ty; pz = z->dp; nc = z->used; c0 = c1 = c2 = 0; for(i = 0; i < nc; i++) { ty = BN_MIN(i, y->used - 1); tx = i - ty; k = BN_MIN(x->used - tx, ty + 1); px = x->dp + tx; py = y->dp + ty; c0 = c1; c1 = c2; c2 = 0U; //Comba 32 for(j = k; j >= 32; j -= 32) { COMBA_INIT COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_STOP } //Comba 16 for(; j >= 16; j -= 16) { COMBA_INIT COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_STOP } //Comba 8 for(; j >= 8; j -= 8) { COMBA_INIT COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_STOP } //Comba 4 for(; j >= 4; j -= 4) { COMBA_INIT COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_MULADDC COMBA_STOP } //Comba 1 for(; j > 0; j--) { COMBA_INIT COMBA_MULADDC COMBA_STOP } *pz++ = c0; } //for i bn_clamp(z);}
In the inner loop of step 1, multiplication is calculated in steps 1, 4, 8, 16, and 32. The expansion process requires a large amount of space, but reduces the overhead of loop control, therefore, it can save a lot of time. The key code for executing multiplication is defined in COMBA_INIT, COMBA_MULADDC, and COMBA_STOP. COMBA_INIT is used for variable definition or initialization. COMBA_MULADDC is used to calculate single-precision multiplication and accumulation. COMBA_STOP is used to store the current calculation result.
★Single and Double Precision Variables
In this case, bn_digit and bn_udbl are defined at the same time, which is the easiest way to implement. The three macros are defined as follows:
#define COMBA_INIT \{ \ bn_udbl r; \#define COMBA_MULADDC \ \ r = (bn_udbl)(*px++) * (*py--) + c0; \ c0 = (bn_digit)r; \ r = (bn_udbl)c1 + (r >> biL); \ c1 = (bn_digit)r; \ c2 += (bn_digit)(r >> biL); \#define COMBA_STOP \}
In a multiplier, the double-precision variable r is defined to store the single-precision multiplication result. In calculation, the single-precision multiplication is calculated first, and then added to the three-precision variable c2 | c1 | c0. Since c0, c1, and c2 are single-precision numbers, in order to avoid overflow, first place the accumulated calculation result to r, and then extract the lower half as the value of the current digit. Therefore, COMBA_STOP has no operation.
★Only single-precision Variables
In this case, bn_udbl is not defined, and each number is split into a high half and a low half. Four single-precision multiplication is required, so the calculation is complicated.
#define COMBA_INIT \{ \ bn_digit a0, a1, b0, b1; \ bn_digit t0, t1, r0, r1; \#define COMBA_MULADDC \ \ a0 = (*px << biLH) >> biLH; \ b0 = (*py << biLH) >> biLH; \ a1 = *px++ >> biLH; \ b1 = *py-- >> biLH; \ r0 = a0 * b0; \ r1 = a1 * b1; \ t0 = a1 * b0; \ t1 = a0 * b1; \ r1 += (t0 >> biLH); \ r1 += (t1 >> biLH); \ t0 <<= biLH; \ t1 <<= biLH; \ r0 += t0; \ r1 += (r0 < t0); \ r0 += t1; \ r1 += (r0 < t1); \ c0 += r0; \ c1 += (c0 < r0); \ c1 += r1; \ c2 += (c1 < r1); \#define COMBA_STOP \}
A1 and a0 respectively store the high half part and the low half part of x, and b1 and b0 respectively store the high half part and the low half part of y, these two operations can be obtained by moving x and y to the left and right half digits. BiLH mentioned above. Here we will talk about biLH, which is half the size of each digit. The above multiplication calculation is complicated, so let's talk about the principle first.
Assume that f (x), g (x), and h (x) represent the numbers a, B, and c, respectively. c = a * B is calculated.
A and B are represented by f (x) and g (x): f (x) = a1 * x + a0, g (x) = b1 * x + b0, here, x is 2 ^ (n/2 ).
Then h (x) = f (x) * g (x) = a1 * b1 * (x ^ 2) + (a1 * b0 + a0 * b1) * x + a0 * b0. By default, a0, a1, b0, and b1 are semi-precision variables, so their product can be stored with a single-precision variable.
Make t0 = a1 * b0 = p * x + q, t1 = a0 * b1 = r * x + s, r0 = a0 * b0, r1 = a1 * b1, where p, q, r, and s are semi-precision, so h (x) can be rewritten:
H (x) = a1 * b1 * (x ^ 2) + (p * x + q + r * x + s) * x + a0 * b0
= A1 * b1 * (x ^ 2) + (p + r) * x + (q + s) * x + a0 * b0
= A1 * b1 * (x ^ 2) + (p + r) * (x ^ 2) + (q + s) * x + a0 * b0
= (A1 * b1 + p + r) * (x ^ 2) + (q + s) * x + a0 * b0
= (R1 + p + r) * (x ^ 2) + (q + s) * x + r0
Therefore, to calculate (r1 + p + r) * (x ^ 2), We need to extract the high half of t0, t1 and the sum of r1 (because r1 and p, r are similar items, therefore, directly calculate the coefficient of x ^ 2 ). Calculate the sum of the remaining items (q + s) x and r0. Extract the lower half of t0 and t1, after moving half of the digits left, add them to r0 (r0 is the number of single precision, while q and s are the half precision. Multiply x to shift the half digit left ). After calculation (q + s) * x + r0, the result may overflow, indicating that carry is generated. Therefore, the carry must be passed to (r1 + p + r) * (x ^ 2), the final product of c can be expressed with the double-precision variable r1 | r0 consisting of r0 and r1. Finally, the product c and the three precision variables c2 | c1 | c0 can be accumulated.
★Use of inline assembly
Note: This section involves x86 assembly. If you do not know about assembly, you can skip this section to avoid wasting your mood. For the inline assembly of C, I will not talk about the details for the moment. refer to the following two articles:
[GCC and VC inline assembly] https://github.com/1184893257/simplelinux/blob/master/inlineasm.md
[X86 inline assembly in Linux] http://www.ibm.com/developerworks/cn/linux/sdk/assemble/inline/
If you can use inline assembly in the compilation environment, using the Assembly command to execute multiplication will speed up the calculation. Since I have only been familiar with the compilation of the x86 platform, I only consider the x86 platform, ARM, PowerPC or MIPS architecture for the time being. In the x86 environment, we need to consider the inline assembly in the GCC and VC environments. After all, their syntax is very different.
A. inline assembly in VC environment: (it is much simpler than GCC)
#define COMBA_INIT \{ \ __asm mov esi, px \ __asm mov edi, py \#define COMBA_MULADDC \ \ __asm lodsd \ __asm mov ebx, [edi] \ __asm mul ebx \ __asm add c0, eax \ __asm adc c1, edx \ __asm adc c2, 0 \ __asm sub edi, 4 \#define COMBA_STOP \ \ __asm mov px, esi \ __asm mov py, edi \}
1. First, send the px and py pointers to the source address register ESI and the target address register EDI with the MOV command.
2. Perform Single-precision multiplication and accumulation:
A. execute the LODSD command to send the content of a unit in the Data Segment pointed to by the ESI register (that is, the value of a certain number of x) to the EAX register, modify the content of the ESI register according to the direction log and data type. The specific operation is to increase the address in ESI by 4, which is equivalent to px ++ in C. In general, the direction sign DF = 0, and the address in ESI increases. If DF = 1, run the CLD command to set DF to 0, but this is not the case in general.
B. Use the MOV command to send the content of a unit in the Data Segment pointed to by the EDI register (that is, the value of a digit of y) to the EBX register.
C. execute the MUL command to execute Single-precision multiplication: (EDX, EAX) = EAX * EBX, calculate the product of EAX and EBX, and put the high result in The EDX register, and the low result in the EAX register.
D. Execute the ADD addition command to accumulate the low order of the product to the bitwise c0 of the Three-precision variable.
E. Execute the addition command of the ADC incoming bit to accumulate the product's high and the carry generated by the previous addition to the second digit c1 of the three refined variables.
F. Execute the addition command of the ADC incoming bit, and add the remaining carry to the highest c2 of the Three-precision variable.
G. Execute the SUB subtraction command to move the address in EDI 4 bytes forward, which is equivalent to py -- in C --.
3. storage Calculation Result: the MOV command is used to send the addresses in ESI and EDI back to the pointers px and py. The reason for this step is that in the loop, loop Control may modify the value of the CPU register. If the address values of ESI and EDI are not stored after calculation, the address may be lost because the register is modified.
B. inline assembly in the GCC environment: (complicated)
#define COMBA_INIT \{ \ asm \ ( \ "movl %5, %%esi \n\t" \ "movl %6, %%edi \n\t" \#define COMBA_MULADDC \ \ "lodsl \n\t" \ "movl (%%edi), %%ebx \n\t" \ "mull %%ebx \n\t" \ "addl %%eax, %2 \n\t" \ "adcl %%edx, %3 \n\t" \ "adcl $0, %4 \n\t" \ "subl $4, %%edi \n\t" \#define COMBA_STOP \ \ "movl %%esi, %0 \n\t" \ "movl %%edi, %1 \n\t" \ :"=m"(px),"=m"(py),"=m"(c0),"=m"(c1),"=m"(c2) \ :"m"(px),"m"(py) \ :"%eax","%ebx","%ecx","%edx","%esi","%edi" \ ); \}
This GCC inline assembly command is the same as the above VC inline assembly command operation, but the syntax is different. For specific syntax, please refer to the above information or Google search :)
★Summary
So much about Comba multiplication. Because the time complexity of Comba multiplication is still O (n ^ 2), the time required will increase dramatically when the input scale n is larger. The next article will explain how to reduce the time complexity of multiplication by means of divide and conquer.
[Back to the directory of this series]
Copyright Notice
Original blog, reproduced must contain this statement, to maintain the integrity of this article, and in the form of hyperlink to indicate the author Starrybird and the original address of this article: http://www.cnblogs.com/starrybird/p/4441022.html