Optimization of Matrix Multiplication

Source: Internet
Author: User

Address: http://www.51nod.com/onlineJudge/questionCode.html! ProblemId = 1113

I got more than two tips last night to optimize this topic. I wrote a blog this morning. I am so sorry, haha.

My friends who work as OJ know the power of quick Bi, so I am not arrogant. I am talking about optimization at the matrix multiplication implementation level.

At the beginning, my code took 1156 ms. The Code is as follows:

void  mat_mul( int (*a)[MAXN], int (*b)[MAXN], int (*c)[MAXN], int n ) {        int  i, j, k;        ULL  sum;        for( i = 0; i < n; ++i ) {                for( j = 0; j < n; ++j ) {                        sum = 0;                        for( k = 0; k < n; ++k ) sum = ( sum + (ULL)a[i][k] * (ULL)b[k][j] ) % P;                        c[i][j] = sum;                }        }}

I used the int array storage matrix to save memory because the result needs to be modulo. (If long is used, the memory is doubled ).

In my mind, I simulate computation. Each time I determine a position of c, I will accumulate the product with the row corresponding to a and column corresponding to B.

Use a local variable sum to store the sum and assign it to c [I] [j]. In this way, sum should be optimized into a register.

The problem with the code is: 1. Every computation requires a modulo, and it is in the innermost loop, which is the place where the program has the largest amount of computing; 2. The value of B is obtained by column, by increasing the miss rate of cpu cache, we all know that reading memory in order is the most efficient.

I think, maybe the data is not so strong, you can take a coincidence and do not take the modulo during the accumulation (assuming sum will not overflow all the time), and finally take the modulo, in this way, the modulo operation is reduced from O (n ^ 3) to O (n ^ 2 ).

The result is a WA, that is, the data is accumulated and exceeds the maximum range of unsigned long (2 ^ 64-1, about 18*10 ^ 18). The Code is as follows:

void  mat_mul( int (*a)[MAXN], int (*b)[MAXN], int (*c)[MAXN], int n ) {        int  i, j, k;        ULL  sum;        for( i = 0; i < n; ++i ) {                for( j = 0; j < n; ++j ) {                        sum = 0;                        for( k = 0; k < n; ++k ) sum += (ULL)a[i][k] * (ULL)b[k][j];                        c[i][j] = sum % P;                }        }}

Next, I made some attempts, including: 1. the output part has an if judgment before, and this branch is split; 2. I tried to replace unsigned long with long; 3. Compare the Code with the top netizens (about Ms faster than me), and only find that he uses long to store the matrix.

The result time is not changed, and even 3 causes my memory to double. I want to cry. It's also a three-cycle. Why is the gap between people so big?

So I carefully studied the code of Fengfeng's netizens. Why is it faster than me? I finally found out that the order of his loop is different from that of mine.

According to the modification, an 828ms code is obtained:

void  mat_mul( int (*a)[MAXN], int (*b)[MAXN], int (*c)[MAXN], int n ) {        int  i, j, k, *p1, *p2, *end;        ULL  tmp;        memset( c, 0, sizeof( A[0] ) );        for( i = 0; i < n; ++i ) {                for( k = 0; k < n; ++k ) {                        tmp = a[i][k];                        for( p1 = c[i], p2 = b[k], end = p1 + n; p1 != end; ++p1, ++p2 )                                *p1 = ( *p1 + tmp * (*p2) ) % P;  // c[i][j] = ( c[i][j] + a[i][k] * b[k][j] ) % P;                }        }}

This code moves the k-loop to the middle, and the innermost layer is the j-loop (I used a pointer to rewrite it, meaning that the sentence to be commented should be less efficient than the two-dimensional reference, this is to be determined ).

The significance of this Code is: in the innermost loop, access to c and B is sequential, and a [I] [k] remains unchanged in this loop, this makes better use of the cpu cache. The larger the matrix, the more obvious the acceleration effect.

This idea should be used in practical projects in the future.

The last step is to optimize the modulo operation. Since all the modulo operations cannot be accumulated, I will partially accumulate and then take a modulo. This will eventually reduce the most time-consuming operation of modulo operation.

Analysis data, assuming that the data in matrix a and matrix B is close to the maximum possible value (for P modulo, the maximum value is P-1, and the value of P is about 10 ^ 9 ), if the product is 10 ^ 18 at a time, then an unsigned long can be accumulated by about 18 characters. I took 16 and accumulated each 16 times (on the one hand, 16 is because it is close to 18, and on the other hand, bit operations can be well utilized, in this way, the number of Modulo operations is about 1/16. Of course, the number of 16 operations increases the number of branches. However, this optimization is almost negligible. It takes 484 ms and the code is as follows (a bit uugly ):

void  mat_mul( int  (*a)[MAXN], int (*b)[MAXN], int  (*c)[MAXN], int n ) {        int  i, j, k, L, *p2;        ULL  tmp[MAXN], con;        //memset( c, 0, sizeof( A[0] ) );        for( i = 0; i < n; ++i ) {                memset( tmp, 0, sizeof( tmp ) );                for( k = 0, L = (n & ~15); k < L; ++k ) {                        con = a[i][k];                        for( j = 0, p2 = b[k]; j < n; ++j, ++p2 )                                tmp[j] += con * (*p2);                        if( ( k & 15 ) == 15 ) {                                for( j = 0; j < n; ++j ) tmp[j] %= P;                        }                }                                for( ; k < n; ++k ) {                        con = a[i][k];                        for( j = 0, p2 = b[k]; j < n; ++j, ++p2 )                                tmp[j] += con * (*p2);                }                 for( j = 0; j < n; ++j )                        c[i][j] = tmp[j] % P;        }}

The code gets longer. L is an integer multiple of 16, and the subsequent loop is less than 16. The tmp array is used for optimization (according to my understanding, stack variables are faster than global variables ).

Of course, I/O can be optimized for this question, because the input output is large, but the speed improvement is of little significance, so I will not modify it any more.

I thought I didn't need to write it. I didn't expect another optimization just now. It actually reached 265 ms. In this case, I 'd like to write it again...

The Code is as follows:

void  mat_mul( int  (*a)[MAXN], int (*b)[MAXN], int  (*c)[MAXN], int n ) {        int  i, j, k, L, *p2;        ULL  tmp[MAXN], con;        //memset( c, 0, sizeof( A[0] ) );        for( i = 0; i < n; ++i ) {                memset( tmp, 0, sizeof( tmp ) );                for( k = 0, L = (n & ~15); k < L; ) {#define  OP  do { for( con = a[i][k], p2 = b[k], j = 0; j < n; ++j, ++p2 ) \tmp[j] += con * (*p2); \++k; } while(0)                        OP; OP; OP; OP;                        OP; OP; OP; OP;                        OP; OP; OP; OP;                        OP; OP; OP; OP;                                                for( j = 0; j < n; ++j ) tmp[j] %= P;                }                                for( ; k < n; ) {                        OP;                }                 for( j = 0; j < n; ++j )                        c[i][j] = tmp[j] % P;        }}

This code is equivalent to removing the branch prediction and manually expanding 16 operations. I didn't expect the effect to be so obvious.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.