Optimization of Matrix Multiplication

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Address: http://www.51nod.com/onlineJudge/questionCode.html! ProblemId = 1113

I got more than two tips last night to optimize this topic. I wrote a blog this morning. I am so sorry, haha.

My friends who work as OJ know the power of quick Bi, so I am not arrogant. I am talking about optimization at the matrix multiplication implementation level.

At the beginning, my code took 1156 ms. The Code is as follows:

void  mat_mul( int (*a)[MAXN], int (*b)[MAXN], int (*c)[MAXN], int n ) {        int  i, j, k;        ULL  sum;        for( i = 0; i < n; ++i ) {                for( j = 0; j < n; ++j ) {                        sum = 0;                        for( k = 0; k < n; ++k ) sum = ( sum + (ULL)a[i][k] * (ULL)b[k][j] ) % P;                        c[i][j] = sum;                }        }}

I used the int array storage matrix to save memory because the result needs to be modulo. (If long is used, the memory is doubled ).

In my mind, I simulate computation. Each time I determine a position of c, I will accumulate the product with the row corresponding to a and column corresponding to B.

Use a local variable sum to store the sum and assign it to c [I] [j]. In this way, sum should be optimized into a register.

The problem with the code is: 1. Every computation requires a modulo, and it is in the innermost loop, which is the place where the program has the largest amount of computing; 2. The value of B is obtained by column, by increasing the miss rate of cpu cache, we all know that reading memory in order is the most efficient.

I think, maybe the data is not so strong, you can take a coincidence and do not take the modulo during the accumulation (assuming sum will not overflow all the time), and finally take the modulo, in this way, the modulo operation is reduced from O (n ^ 3) to O (n ^ 2 ).

The result is a WA, that is, the data is accumulated and exceeds the maximum range of unsigned long (2 ^ 64-1, about 18*10 ^ 18). The Code is as follows:

void  mat_mul( int (*a)[MAXN], int (*b)[MAXN], int (*c)[MAXN], int n ) {        int  i, j, k;        ULL  sum;        for( i = 0; i < n; ++i ) {                for( j = 0; j < n; ++j ) {                        sum = 0;                        for( k = 0; k < n; ++k ) sum += (ULL)a[i][k] * (ULL)b[k][j];                        c[i][j] = sum % P;                }        }}

Next, I made some attempts, including: 1. the output part has an if judgment before, and this branch is split; 2. I tried to replace unsigned long with long; 3. Compare the Code with the top netizens (about Ms faster than me), and only find that he uses long to store the matrix.

The result time is not changed, and even 3 causes my memory to double. I want to cry. It's also a three-cycle. Why is the gap between people so big?

So I carefully studied the code of Fengfeng's netizens. Why is it faster than me? I finally found out that the order of his loop is different from that of mine.

According to the modification, an 828ms code is obtained:

void  mat_mul( int (*a)[MAXN], int (*b)[MAXN], int (*c)[MAXN], int n ) {        int  i, j, k, *p1, *p2, *end;        ULL  tmp;        memset( c, 0, sizeof( A[0] ) );        for( i = 0; i < n; ++i ) {                for( k = 0; k < n; ++k ) {                        tmp = a[i][k];                        for( p1 = c[i], p2 = b[k], end = p1 + n; p1 != end; ++p1, ++p2 )                                *p1 = ( *p1 + tmp * (*p2) ) % P;  // c[i][j] = ( c[i][j] + a[i][k] * b[k][j] ) % P;                }        }}

This code moves the k-loop to the middle, and the innermost layer is the j-loop (I used a pointer to rewrite it, meaning that the sentence to be commented should be less efficient than the two-dimensional reference, this is to be determined ).

The significance of this Code is: in the innermost loop, access to c and B is sequential, and a [I] [k] remains unchanged in this loop, this makes better use of the cpu cache. The larger the matrix, the more obvious the acceleration effect.

This idea should be used in practical projects in the future.

The last step is to optimize the modulo operation. Since all the modulo operations cannot be accumulated, I will partially accumulate and then take a modulo. This will eventually reduce the most time-consuming operation of modulo operation.

Analysis data, assuming that the data in matrix a and matrix B is close to the maximum possible value (for P modulo, the maximum value is P-1, and the value of P is about 10 ^ 9 ), if the product is 10 ^ 18 at a time, then an unsigned long can be accumulated by about 18 characters. I took 16 and accumulated each 16 times (on the one hand, 16 is because it is close to 18, and on the other hand, bit operations can be well utilized, in this way, the number of Modulo operations is about 1/16. Of course, the number of 16 operations increases the number of branches. However, this optimization is almost negligible. It takes 484 ms and the code is as follows (a bit uugly ):

void  mat_mul( int  (*a)[MAXN], int (*b)[MAXN], int  (*c)[MAXN], int n ) {        int  i, j, k, L, *p2;        ULL  tmp[MAXN], con;        //memset( c, 0, sizeof( A[0] ) );        for( i = 0; i < n; ++i ) {                memset( tmp, 0, sizeof( tmp ) );                for( k = 0, L = (n & ~15); k < L; ++k ) {                        con = a[i][k];                        for( j = 0, p2 = b[k]; j < n; ++j, ++p2 )                                tmp[j] += con * (*p2);                        if( ( k & 15 ) == 15 ) {                                for( j = 0; j < n; ++j ) tmp[j] %= P;                        }                }                                for( ; k < n; ++k ) {                        con = a[i][k];                        for( j = 0, p2 = b[k]; j < n; ++j, ++p2 )                                tmp[j] += con * (*p2);                }                 for( j = 0; j < n; ++j )                        c[i][j] = tmp[j] % P;        }}

The code gets longer. L is an integer multiple of 16, and the subsequent loop is less than 16. The tmp array is used for optimization (according to my understanding, stack variables are faster than global variables ).

Of course, I/O can be optimized for this question, because the input output is large, but the speed improvement is of little significance, so I will not modify it any more.

I thought I didn't need to write it. I didn't expect another optimization just now. It actually reached 265 ms. In this case, I 'd like to write it again...

The Code is as follows:

void  mat_mul( int  (*a)[MAXN], int (*b)[MAXN], int  (*c)[MAXN], int n ) {        int  i, j, k, L, *p2;        ULL  tmp[MAXN], con;        //memset( c, 0, sizeof( A[0] ) );        for( i = 0; i < n; ++i ) {                memset( tmp, 0, sizeof( tmp ) );                for( k = 0, L = (n & ~15); k < L; ) {#define  OP  do { for( con = a[i][k], p2 = b[k], j = 0; j < n; ++j, ++p2 ) \tmp[j] += con * (*p2); \++k; } while(0)                        OP; OP; OP; OP;                        OP; OP; OP; OP;                        OP; OP; OP; OP;                        OP; OP; OP; OP;                                                for( j = 0; j < n; ++j ) tmp[j] %= P;                }                                for( ; k < n; ) {                        OP;                }                 for( j = 0; j < n; ++j )                        c[i][j] = tmp[j] % P;        }}

This code is equivalent to removing the branch prediction and manually expanding 16 operations. I didn't expect the effect to be so obvious.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimization of Matrix Multiplication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Optimization of Matrix Multiplication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support