Analysis of matrix multiplication in spark

Last Update:2016-03-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface:Matrix multiplication is a common computational step in data Mining/machine learning, and the shuffle process is unavoidable in big data calculation, and the different calculation methods of matrix multiplication shuffle the amount of data. Through the study of different calculation methods of matrix multiplication, we hope to be able to enlighten the shuffle process optimization of big data algorithm. There are many articles and papers related to distributed matrix multiplication on the internet, but there is little analysis of the distributed matrix multiplication in spark. In this paper, the implementation of distributed matrix multiplication in spark is discussed in detail.
principle of distributed matrix multiplication:The calculation of matrix multiplication can be divided into internal product and outer product method. According to the different granularity, it can be divided into common matrix implementation and block matrix implementation.matrix multiplication Calculation formula:
Equation 1 calculates the formula for the inner product method, that is, the vector inner product is calculated using the rows of the A matrix and the columns of the B matrix, one element of the matrix C is computed at a time. Equation 2 calculates the formula for the outer product method, that is, using the column of a matrix and the line of the B matrix to calculate the outer product of the vector, each time computes a nxk matrix, and then adds the calculated m matrix to get the matrix C. Distributed matrix multiplication inner product method:The inner product requires that one element of matrix C can be computed each time, because the calculation of each element in C is independent, so the computation process can be executed concurrently, because C has nxk elements, so the maximum amount of concurrency that can be supported is nxk. When calculating each element in C, it is necessary to use the M elements of a line in a and the M elements of a row in B, which need to shuffle m elements from A and B shuffle m elements to the compute nodes of C, so we use the inner product method to calculate the multiplication of matrices directly. The maximum number of elements that need to be shuffle is 2xmxnxk (because C has nxk elements), each compute node must calculate at least 1 elements in C at a time, so each compute node memory must be able to store 2m of data at least. However, in the operation of large matrices, it is usually not possible to develop NXK degree of concurrency, so the fact that the shuffle data will be much smaller than 2xmxnxk, because the concurrency is much smaller than the number of elements in C, so the number of C elements on the same compute node is calculated, this time, The data on this node can be reused without the need to shuffle 2m of data for each element's computation. This is also the use of the internal accumulation of the calculation of the block matrix, the amount of shuffle will be greatly reduced. The algorithm shuffle process is as follows:
of this algorithm DisadvantagesIt is obvious that shuffle data is too large. In distributed computing systems, the shuffle process is an important factor that affects the performance of the system. But this algorithm is obvious when a or b one of the matrices is a small matrix. AdvantagesOf For example, if B is a small matrix, matrix B can be broadcast to each node of the distributed matrix A, so when the C matrix is computed, a does not need to do shuffle operation, can make full use of data locality to calculate. Distributed matrix multiplicative external product method:The vector outer product calculation formula is as follows:
Combined with the calculation formula of the outer product, from the above equation 2 can be seen, the matrix multiplication of the outer product formula is the first to calculate the M nxk matrix, and then the M matrix added to the matrix C. In this process, the computational process of M matrices is independent of one another. So the maximum amount of concurrency that can be supported is M. When calculating any one of the M matrices, it is necessary to use the n elements of a column in A and the k elements of a row in B, so it is necessary to move the MX (n+k) elements when calculating the M matrices. After calculating the M matrices, you need to add the M matrix Plus. Matrix addition is the corresponding position element addition, so at the end of the calculation of each element in C, it is necessary to shuffle the corresponding position of M matrix data to a compute node to add sum, so in addition and the process, the need to shuffle the maximum amount of data is mxnxk (because M matrices, each matrix has n XK elements). Each compute node calculates at least 1 of M matrices, so the compute nodes need to be able to store the memory of N+K elements at least, because the amount of output elements per node is nxk, the amount of data is quite large in large scale matrices, and the memory of a single compute node is difficult to store. Therefore, the output of each node is usually reached after a certain amount of storage to disk, but if it is in a large-scale sparse matrix, M matrix in each matrix in the number of values is usually much smaller than nxk, so the second shuffle data volume is usually much smaller than nxk. The algorithm shuffle process is as follows:
of this algorithm DisadvantagesThat is, in a distributed large matrix, if it is not sparse matrix, the computed intermediate matrix will be very large, in order to complete the use of memory in a single compute node intermediate matrix is basically impossible, you need to use disk-assisted storage intermediate matrix calculation results. Algorithm of Advantagesis to calculate the sparse matrix and although AB is a large-scale matrix, but the result is a small matrix, in both cases, each intermediate matrix can be completely stored in memory, it will be relatively fast. the distributed multiplication implementation of the block matrix:The process of block matrix multiplication is similar to that of non-block calculation, and can also be achieved by using the inner product method and the outer product method. The benefits of distributed matrix multiplication using a block matrix are mainly two, one is to reduce the amount of data in the shuffle process, and the other is that the block matrix can call the existing matrix calculation package when each small block is computed locally, and the mature matrix calculation package is usually more efficient than its own implementation. For example matrix calculation of common package Blas package. The basic calculation formula of the block matrix is as follows (from Wikipedia: https://zh.wikipedia.org/wiki/%E5%88%86%E5%A1%8A%E7%9F%A9%E9%99%A3):
The calculation complexity of the above formula is O (n^3), in addition to the above formula 1969 Strassen using the divide-and-conquer algorithm to reduce the computational complexity of block matrix multiplication to O (n^log7), the calculation formula is as follows (from Wikipedia: https:// ZH.WIKIPEDIA.ORG/WIKI/%E6%96%BD%E7%89%B9%E6%8B%89%E6%A3%AE%E6%BC%94%E7%AE%97%E6%B3%95):
Both the Jampack package and the JAMA in Java use the Strassen algorithm to implement matrix multiplication, and these packages do not implement parallel computations, but in distributed computing, each small block can be distributed to local computations before the corresponding single-machine package is called. In the theory of Block matrix, the number of rows in a matrix can be different, but for the sake of convenience, the implementation of the distributed block matrix multiplication usually has the same number of rows for each block in a matrix. In Spark, the Org.apache.spark.mllib.linalg.distributed.BlockMatrix implemented by Spark comes with distributed matrix multiplication, and Blockmatrix is the multiplication of distributed block matrices using the inner product method. In addition, third-party packages have been implemented, the Nanjing University Pasa Laboratory on Spark improved distributed matrix multiplication (url: http://pasa-bigdata.nju.edu.cn/project/Marlin.html), The Pasa package is also a distributed block matrix multiplication implemented using the internal product distribution. The inner product method and the outer product method of the block matrix shuffle the same as the non-block matrices mentioned above, except that each element is transformed into a small matrix, where only the amount of data shuffle is analyzed. Suppose that the block matrices A and B are still the original matrices, just the small matrix of the number of rows R and the number of columns J. The amount of data shuffle by the inner product method is 2xm/cxn/rxk/rx (RXC) =2xmxnxk/r, and the outer product method shuffle data is M/CX (N/R+K/R+N/RXK/R) x (RXC) = mx (n+k/r+nxk/r). The algorithm of visible block matrix multiplication has less shuffle data than block. The block matrix multiplication in spark does not use the external product method, mainly taking into account the external product method memory consumption is large.
Spark comes with blockmatrix multiplication source analysis:The necessary annotations have been given in the source code. defMultiply(Other:blockmatrix): Blockmatrix = {
.......
if(Colsperblock = = Other.rowsperblock) {
//gridpartitioner is divided into numrowblocks*other.numcolblocks partition
ValResultpartitioner =Gridpartitioner( Numrowblocks,Other. Numcolblocks,
Math.Max(Blocks.partitions.length,Other.blocks.partitions.length))
//Here is the calculation of each leftdestinations and rightdestinations type is map[(Int,int), Set[int]], that is, the first to calculate the left and right matrix
//Each block will shuffle to which partition
Val(leftdestinations, rightdestinations) = Simulatemultiply (Other,Resultpartitioner)
//Each block of A must is multiplied with the corresponding blocks in the columns of B.
ValFlata = blocks.flatmap { Case((Blockrowindex, Blockcolindex),block) =
ValDestinations = Leftdestinations.getorelse ((blockrowindex,Blockcolindex), Set.Empty)
Destinations.map (j = (j, (Blockrowindex,Blockcolindex, Block)))
}
Each block of B must is multiplied with the corresponding blocks in each row of A.
ValFLATB = other.blocks.flatMap { Case((Blockrowindex, Blockcolindex),block) =
ValDestinations = Rightdestinations.getorelse ((blockrowindex,Blockcolindex), Set.Empty)
Destinations.map (j = (j, (Blockrowindex,Blockcolindex, Block)))
}
//Gridpartitioner There is a total of numrowblocks*other.numcolblocks partitions, so in Cogroup, when calculating the a*b=c, all the A and B used in the C matrix
//block will be in a partition, in the Reducebykey time can be combinebykey to optimize, in fact, in the process of Reducebykey, only the process of adding,
//No shuffle process.
ValNewblocks = Flata.cogroup (FLATB, Resultpartitioner). FlatMap { Case(pId,(A, b) = =
A.flatmap { Case(Leftrowindex,Leftcolindex, Leftblock) =
B.filter (_._1 = = leftcolindex). Map { Case(Rightrowindex, Rightcolindex,Rightblock) =
when the matrix multiplication is implemented, the local matrix computation uses the matrix algorithm provided by the Com.github.fommil.netlib package, and the matrix addition calls the matrix addition provided by the SCALANLP package.
ValC = Rightblock Match{
CaseDense:densematrix = leftblock.multiply (dense)
CaseSparse:sparsematrix = leftblock.multiply (sparse.todense)
Case_ =
throw NewSparkexception (S "Unrecognized matrix type ${Rightblock.getclass}.")
}
((Leftrowindex, Rightcolindex),C.tobreeze)
}
}
}.reducebykey (Resultpartitioner, (A,b) = + A + b). Mapvalues (matrices.Frombreeze)
// Todo:try to use Aggregatebykey instead of Reducebykey to get rid of intermediate matrices
NewBlockmatrix (newblocks, Rowsperblock,Other.colsperblock, NumRows (),Other.numcols ())
} Else{
.......
}
}
The above code has a simulatemultiply method is more important, the source code comments are as follows: Private[Distributed] defsimulatemultiply(
Other:blockmatrix,
Partitioner:gridpartitioner): (blockdestinations,blockdestinations) = {
ValLeftmatrix = Blockinfo. Keys.collect ()//Blockinfo should already be cached
ValRightmatrix = Other.blocks.keys.collect ()
//The following code is understood, assuming a*b=c, because A11 will be used when calculating C11 to C1N, so A11 will store a copy of the machine that calculates C11 to C1N.
ValLeftdestinations = leftmatrix.map { Case(RowIndex, Colindex) =
The column number in the//left matrix is multiplied by the same block with the right matrix row number, resulting in the position of all rows in the right matrix and the same matrix as the column index in the left matrix.
//Due to this judgment, the fast left matrix with no value in the right matrix is not duplicated and avoids a 0 value calculation.
ValRightcounterparts = Rightmatrix.filter (_._1 = = Colindex)
//Because after the matrix has been multiplied and added operations (Reducebykey), the added operation can be optimized with Combineby on the same machine,
//This directly gets each chunk to be used in which partition after the multiplication is done.
Valpartitions = Rightcounterparts.map (b = partitioner.getpartition (rowIndex,b._2)))
((RowIndex, Colindex),Partitions.toset)
}.tomap
ValRightdestinations = rightmatrix.map { Case(RowIndex,Colindex) =
ValLeftcounterparts = Leftmatrix.filter (_._2 = = RowIndex)
Valpartitions = Leftcounterparts.map (b = partitioner.getpartition (b._1,Colindex)))
((RowIndex, Colindex),Partitions.toset)
}.tomap
(leftdestinations, Rightdestinations)
}

As you can see from the code, the multiplication of the chunked matrix in spark requires that each executor memory be able to save at least 0 blocks of all non-0 and right-matrix columns in a row of the left matrix. Only one shuffle is required in the process of calculation.
Pasa based on the Blockmatrix multiplication implemented by spark:The source notes are as follows: defMultiply(Other:blockmatrix): Blockmatrix = {
.......
if(Numblksbycol () = = Other.numblksbyrow ()) {
//num of rows to being split of this matrix
ValMsplitnum = Numblksbyrow ()
//num of columns to being split of this matrix, meanwhile num of rows of the that matrix
ValKsplitnum = Numblksbycol ()
//num of columns to be split of the that matrix
ValNsplitnum = Other.numblksbycol ()
ValPartitioner = NewMatrixmultpartitioner (Msplitnum,Ksplitnum, Nsplitnum)

ValThisemitblocks = Blocks.flatmap ({ Case(blkId,BLK) =
Each block of the//left matrix is multiplied by each block of matrix corresponding to the right matrix, and each row of the right matrix has nsplitnum blocks, so each block copies nsplitnum copies.
//There is no consideration of the right matrix some blocks are 0 values, so there is an unnecessary 0 value calculation when the join is local matrix multiplication
Iterator.tabulate[(Blockid,Submatrix)] (nsplitnum) (i = {
ValSeq = blkid.row * Nsplitnum * ksplitnum + i * ksplitnum + blkid.column
(Blockid(Blkid.row,I, Seq, BLK)})
}). Partitionby (Partitioner)
ValOtheremitblocks = Other.blocks.flatMap ({ Case(blkId,BLK) =
Iterator.tabulate[(Blockid,Submatrix)] (msplitnum) (i = {
ValSeq = i * nsplitnum * ksplitnum + blkid.column * ksplitnum + blkid.row
(Blockid(I, Blkid.column, Seq,Blk
})
}). Partitionby (Partitioner)
if(Ksplitnum! =1) {
//The following code joins the partitioner used by Matrixmultpartitioner,reducebykey Partitioner is Hashpartitioner,
//two times shuffle is used to different partitioner, so it is unavoidable to need two times shuffle.
Valresult = Thisemitblocks.join (otheremitblocks). mappartitions (iter =
Iter.map { Case(blkId,(Block1, BLOCK2)) =
(Blockid(Blkid.row,Blkid.column), Block1.multiply (Block2))
}
). Reducebykey ((A, b) = A.add (b))
NewBlockmatrix (Result,NumRows (), Other.numcols (),Msplitnum, Nsplitnum)
} Else{
Valresult = Thisemitblocks.join (otheremitblocks). mappartitions (iter =
Iter.map { Case(blkId,(Block1, BLOCK2)) =
(Blockid(Blkid.row,Blkid.column), Block1.multiply (Block2))
}
)
NewBlockmatrix (Result,NumRows (), Other.numcols (),Msplitnum, Nsplitnum)
}
}
......
}

Summary:Through the source analysis, you can know that Spark's own blockmatrix multiplication algorithm is more efficient than the Blockmatrix multiplication algorithm implemented by PASA, can avoid unnecessary 0 value calculation, and can reduce the shuffle. In practice, using Spark's own Blockmatrix algorithm to pay attention to the use of memory, when the block, the size of the block is how much in addition to attention to memory, but also to pay attention to the data in the sub-block can be as compact as possible, reduce the 0 value calculation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of matrix multiplication in spark

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support