Using Hadoop to multiply large matrices (II)

Source: Internet
Author: User

Previous Article

The method we introduced in "using Hadoop to multiply large Matrices" has the defect of "large storage space occupied by files during computing, this article focuses on solving this problem.

Concept of Matrix Multiplication

The traditional method of matrix multiplication is to multiply rows and columns, that is, multiply a row of the Left matrix by a column of the right matrix. However, this method is used to multiply sparse matrices, resulting in excessive invalid calculations and reduced computing efficiency. To solve this problem, the invention uses the column and row multiplication calculation method, that is, the elements in a column of the Left matrix are multiplied by all the elements in the row corresponding to the right matrix, this method effectively avoids invalid computation during the multiplication of sparse matrices. The specific computing process 1 is shown.

 

Figure 1 column, row matrix multiplication calculation

Data preprocessing

To make it easier for the Map-Reduce model to process matrix elements, all matrix elements are stored in a text file. A row of records represents a matrix element. For sparse matrices, the 0 element is not included in the input text. 2.

 

Figure 2 input matrix elements

 

Figure 3 pre-processed Matrix Elements

Let's give an example of Figure 2. Assume that a row of records is :. It indicates that the element value in the second column of the first row of the Left matrix is.

During a Map process, input data is preprocessed for each row. If a row of records represents the left matrix element, the column number is extracted as the Key Value, and the remaining information forms the Value; if a row of records represents the right matrix element, the row number is extracted as the Key Value, and the remaining information constitutes the Value, as shown in 3. The reason for this is that in the next step in the Reduce process, the Key column of the Left matrix is calculated based on the Key value, which corresponds to the elements in the Key row of the right matrix.

Statistics and Segmentation

When the matrix size is large to a certain extent, the memory may encounter a problem where one column of the Left matrix or one row of the right Matrix cannot be loaded. In order to improve the scalability of matrix multiplication, the invention proposes a method to segment left Matrix Elements by column and right Matrix Elements by row. In this way, A single computing node can load a section of the Left matrix and a section of the right matrix to multiply the memory, breaking the memory limit. Multiply segments by 4.

 

Next, let's take a look at Figure 4 to illustrate how the Reduce stage completes statistics and segmentation. In the Reduce stage, all values with the same Key are first aggregated to form a Value-List. If the Key is k, Value-List indicates all the elements in the k column of the Left matrix and the k row of the right matrix. These elements are mixed together. In the Reduce stage, we traverse Value-List in the first round to obtain the number of elements in column k of the Left matrix.Mk, The number of elements in row k of the right matrix isNk. Next, we use the second traversal to segment the elements in column k of the Left matrix and row k of the right matrix. Assume that each segment contains w elements, column k of the Left matrix is divided into segments, and the right matrix is divided into segments.

 

Figure 5 distributed cache storage matrix segment information

The invention represents segment I in column k of the L matrix in the following format:

 

Indicates that a copy is required for the segment in the next process. element_list indicates the element set in the segment.

Similarly, the j segments in row k in the R matrix are represented in the following format:

 

To facilitate the subsequent processing of the Map-Reduce process, each segment is stored in the disk file, and the file also represents a segment. At the same time, we store the specific segment information in the two matrices in the distributed cache, which helps solve the communication and data query problems between different nodes in the subsequent steps. The storage format is 5.

Figure 5 shows that the number of elements in column 1st in matrix L is M1, and the number of elements in each segment is w. Therefore, the number of segments in this column is; similarly, the number of elements in the first row in the matrix R is N1, and the number of elements in each segment is w. Therefore, the number of segments in the row is.

Copy task distribution-Map Iteration Algorithm

LMapIteration Algorithm

4. We need to multiply the segments in the two matrices one by one. The following is an example: Because the I segment in column k in matrix L needs to be multiplied by all the segments in row k in matrix R, therefore, copy the content of segment I in column k in column L. Similarly, each segment in row k in column R needs to be copied. Of course, the copy operation is done through Map-Reduce. The problem is that if each segment of the two matrices needs to be copied a large number, then, a Map needs to copy the records of each row many times, which greatly prolongs the execution time of the Map. At the same time, many computing nodes may not be involved in the operation.

 

To solve the preceding problems, the invention proposes the "Map iterative copy Task Distribution Algorithm" to distribute the copy tasks of each record (each segment, this effectively controls the number of copies of each segment on each node, and more nodes are involved in the copy operation.

To help each piece of data know the score that needs to be copied, we make a simple modification to formula (1) and (2:

 

 

Type (3) indicates that the record (segment) needs to be copied, and the copy ID is 1 to. Likewise, the Interpretation Type (4 ).

Here, we use figure 7 as an example... 1 #10000... The abbreviation of formula (3) or formula (4) indicates that 10000 copies are required for a record. If the number of copies required for all segments is 10000, there will be N nodes involved in the copy operation. To allow more computing nodes to participate in the copy operation, we have designed this Map iterative copy task distribution algorithm. Assuming that the distribution expansion rate is 10, after an iteration, the file size is increased by about 10 times, about 10 x N computing nodes will be involved in the copy operation, and so on, after three iterations, about 1000 x N computing nodes are involved in the copy operation. When 1000 x N nodes are involved in the copy operation, the maximum number of copies of each record is 10, as shown in 7.

LNumber of iterations

In real-world big matrix multiplication, since the matrix is sparse in most cases, the number of elements in each column in each row is different. Therefore, the number of copies to be copied in each segment is unknown. In this way, we need to calculate the number of iterations of the Map iteration process and control the Map iteration process in sequence. Here, we use each segment information stored in the distributed cache shown in Figure 5 to obtain the maximum number of segments, and combine the distribution expansion rate n to calculate the number of Map iterations using the formula (5, control the Map iteration process in sequence.

 

Final computing module

After copying records, we also need two rounds of Map-Reduce operations to complete matrix operations.

LThe first round of Map-Reduce -- multipart copy and corresponding

In this round, we first complete the distributed copy task in the Map Phase .... 2191 #2200... The original format is as follows:

 

After the copy operation is executed in Map, the record style is as follows:

K−i − 2191 element_list

K−i − 2192 element_list

......

K−i − 2199 element_list

K−i − 2200 element_list

If... in figure (7 .... 2191 #2200... The original format is as follows:

 

After the copy operation is executed in Map, the record style is as follows:

K−2191-j element_list

K−2192-j element_list

......

K−2199-j element_list

K−2200-j element_list

After this round of Map, each key (the first half of each record after copying) corresponds to two values, that is, one segment in the L matrix and one segment in the R matrix, the two values of the same key are merged in the Reduce phase of the Round. The result is as follows:

Xiao Zhou, cloud computing group of a research and development center

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.