In many problems of science and engineering calculation, matrix product is one of the most basic algorithms. In the distributed storage parallel machine, the classical matrix product algorithm mainly has 1969 years Cannon proposed the matrix product algorithm on two-dimensional mesh and the "broadcast-product-roll" algorithm proposed by Fox in 1987. The Pumma algorithm proposed by Choi in 1994 extended the FOX algorithm to two-dimensional block shutter data distribution. In the same year, Huss-lederman the FOX algorithm to the virtual two-dimensional surrounding data distribution. 1995 van de Geijn and Watts proposed a simple and efficient matrix product algorithm called the Summa algorithm.
Shuguang 4000 Parallel machine is a large scale parallel computer system developed by the National Intelligent Computer Research and development center based on distributed storage structure and message transmission mechanism. This paper discusses the implementation of the typical matrix product algorithm summa algorithm in a large-scale parallel machine similar to the dawning 4000. In subsequent phases, the author will continue to analyze and compare its performance with other matrix multiplication parallel algorithms.
The SUMMA algorithm first divides a, B and C into matrices of the same size, corresponding to the two-dimensional mesh on the mesh_rxmesh_c. However, the Summa algorithm decomposes matrix multiplication into a series of rank NB Corrections, that is, A and B in each processor are decomposed into NB-sized column blocks and row blocks to multiply, the previous block size refers to the size of NB. In the algorithm, the broadcast is implemented as pipelining on the line or column ring of the logical processor, and the overlap between computation and communication is achieved. The specific description is as shown in algorithm 1.
Algorithm 1. Summa algorithm on two-dimensional mesh
c= 0 for
i= 0 t o k-1 step nb do
cur col = ixc/n
cur row = ixr/m
if my col = cur rol to the bank broadcast a from I mod ( K/C) column starting NB column, stored in a′
if my row = cur row to broadcast B from the I mod (k/r) line start NB Line, stored in b′ c= a′xb′
End for
The core idea of the Summa algorithm is that each processor collects all the columns from the A matrix sub-block in the same row of processors and all rows of the B-matrix sub-blocks in the same processor, then multiplies the rows and columns to form a block matrix of the C matrix. Finally, the Rank=0 processor collects data from other processors to form the final matrix C.
Summa algorithm compared to the advantages of the cannon algorithm as long as the Summa algorithm can calculate the M*l a matrix and L*n b matrix multiplication (M, l, N can be unequal), and cannon algorithm can only achieve n*n a matrix and n*n b matrix multiplication, with great limitations. However, due to the limited level, and in order to better divide the matrix, this MPI implementation of the Summa algorithm requires that the input matrix dimension (M, L, N) must be the number of processors (defined by the user) of the integer times, otherwise the error will be prompted. As for how to achieve the random dimension of the matrix multiplication, but also to ask you a great program ape many advice.
Because the source code is too long, here only a part of the MPI source code, need all the source of onlookers can email contact me, contact in the bottom.
/ * summa algorithm * /
void My_summa (const int my_rank, int * a, int * b)
{
int k, i, j;
int * c;
int len_c;
int dest, tag;
int my_rank_row, my_rank_col;
int * a_col_send, * b_row_send, * a_col_recv, * b_row_recv;
MPI_Status status_a, status_b;
my_rank_row = my_rank / mesh_c;
my_rank_col = my_rank% mesh_c;
a_col_send = (int *) malloc ((m / mesh_r) * sizeof (int));
b_row_send = (int *) malloc ((n / mesh_c) * sizeof (int));
a_col_recv = (int *) malloc ((m / mesh_r) * sizeof (int));
b_row_recv = (int *) malloc ((n / mesh_c) * sizeof (int));
/ * Each processor sends the columns of the A matrix block it owns to other processors in the same row as itself * /
for (k = 0; k <l / mesh_c; ++ k)
{
for (i = 0; i <m / mesh_r; ++ i)
{
a_col_send [i] = a [i * (l / mesh_c) + k];
}
for (i = 0; i <mesh_c; ++ i)
{
dest = my_rank_row * mesh_c + i;
tag = my_rank_col * (l / mesh_c) + k;
MPI_Send (a_col_send, m / mesh_r, MPI_INT, dest, tag, MPI_COMM_WORLD);
}
}
/ * Each processor sends the rows of the B matrix block it owns to other processors in the same column as itself * /
for (k = 0; k <l / mesh_r; ++ k)
{
for (i = 0; i <n / mesh_c; ++ i)
{
b_row_send [i] = b [k * (n / mesh_c) + i];
}
for (i = 0; i <mesh_r; ++ i)
{
dest = my_rank_col + i * mesh_c;
tag = my_rank_row * (l / mesh_r) + k;
MPI_Send (b_row_send, n / mesh_c, MPI_INT, dest, tag, MPI_COMM_WORLD);
}
}
/ * Initialize C block * /
len_c = (m / mesh_r) * (n / mesh_c);
c = (int *) malloc (len_c * sizeof (int));
for (i = 0; i <len_c; ++ i)
{
c [i] = 0;
}
for (k = 0; k <l; ++ k)
{/ * Each processor receives columns from all A blocks in the same row processor and all B blocks in the same column processor * /
MPI_Recv (a_col_recv, m / mesh_r, MPI_INT, my_rank_row * mesh_c + k / (l / mesh_c), k,
MPI_COMM_WORLD, & status_a);
MPI_Recv (b_row_recv, n / mesh_c, MPI_INT, k / (l / mesh_r) * mesh_c + my_rank_col, k,
MPI_COMM_WORLD, & status_b);
/ * Each processor already has all the information to calculate the corresponding C sub-blocks, and multiply and accumulate to get the corresponding C sub-blocks
for (i = 0; i <m / mesh_r; ++ i)
{
for (j = 0; j <n / mesh_c; ++ j)
{
c [i * (n / mesh_c) + j] + = a_col_recv [i] * b_row_recv [j];
}
}
}
/ * Each processor sends the C sub-block to process 0 * /
MPI_Send (c, len_c, MPI_INT, 0, 99, MPI_COMM_WORLD);
free (c);
free (a_col_send), free (b_row_send), free (a_col_recv), free (b_row_recv);
}
/ * Process 0 receives all C subblock information * /
void Collect_C ()
{
int k, i, j;
int len_c, id_recv;
int * c_recv;
int c_imin, c_imax, c_jmin, c_jmax;
MPI_Status status;
len_c = (m / mesh_r) * (n / mesh_c);
c_recv = (int *) malloc (len_c * sizeof (int));
/ * Fill C sub-blocks to corresponding C matrix coordinates * /
for (k = 0; k <num_procs; ++ k)
{
MPI_Recv (c_recv, len_c, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, & status);
id_recv = status.MPI_SOURCE;
c_imin = (id_recv / mesh_c) * (m / mesh_r);
c_imax = c_imin + m / mesh_r-1;
c_jmin = (id_recv% mesh_c) * (n / mesh_c);
c_jmax = c_jmin + n / mesh_c-1;
for (i = c_imin; i <= c_imax; ++ i)
{
for (j = c_jmin; j <= c_jmax; ++ j)
{
C [i] [j] = c_recv [(i-c_imin) * (n / mesh_c) + (j-c_jmin)];
}
}
}
free (c_recv);
}
/ * Print matrix * /
void Print (char * str, int ** matrix, int m, int n)
{
int i, j;
printf ("% s", str);
for (i = 0; i <m; ++ i)
{
for (j = 0; j <n; ++ j)
{
printf ("%-6d", matrix [i] [j]);
}
printf ("\ n");
}
printf ("\ n");
}
Compile: MPICC summa.c-o summa-lm Note: P0_18495:p4_error:Child process exited while making connection to remote proc ESS on node2:0
Killed by signal 2. This error, workaround: Enter directly in the main directory: SOURCE/HOME/GUEST/.BASHRC
Run: MPIRUN-NP 9 summa 9 6 12
Operation Result:
Allocate_A_B_C cheers!
Random_A_B cheers!
Scatter_A_B cheers!
Collect_C cheers!
random matrix A:
54 9 90 76 54 45
52 70 1 99 64 31
40 34 34 31 7 16
82 26 18 70 29 44
64 58 29 71 98 89
42 5 50 84 81 4
29 86 26 83 85 42
66 77 76 0 8 35
69 91 13 87 13 43
random matrix B:
83 77