Recently, a very good general manager of the company resigned and gave me a banquet. She said to me, "You are a very good programmer." after that, she immediately apologized and said, "I'm sorry, I said that you are a programmer, did you insult you?" I'm surprised. are programmers very low-end and very despised? Maybe now, even those who sell pirated discs and repair computers all call themselves engaged in IT. Ordinary people may have no idea what they do. In fact, I want to say that there are many types of programmers, some of which can only be written into if-then-else, and some can only be based on Huludao, but the real programmer I think must be an expert in a certain field, maybe he is a mathematician, maybe he is a physicist, or maybe he is an expert in a Computer Subdivision field. He is a combination of theory and reality, is beyond the existence of pure theory! I am aiming to become a proud programmer.
When talking about cloud computing, We have to raise big data and process big data, so we certainly cannot leave distributed computing. In the Internet industry, whether it is product recommendations, friend recommendations, or PageRank, the Items scale and user scale to be processed are extremely large, with millions and tens of millions of records, hundreds of millions of records. On the basis of this data, many excellent recommendation algorithms are created, most of which are applied to matrix operations. A computer is no longer capable of processing such massive data. Simply put, the memory of a server may not be enough to load half a matrix of Data, let alone. "When a cow does not pull a bullet train, few people look for a bigger and stronger ox, but find more cows to pull them together .", This is distributed computing, and Hadoop is a powerful tool for processing large record sets on distributed clusters.
I have recently been interested in recommendation algorithms and have studied some of them! After some mathematical formulas have been thoroughly studied, you have the impulse to implement them. The matrix operation in the formula is not that simple! Therefore, I want to start from studying super-large-scale matrix multiplication. On the one hand, I want to make technical reserves for Large-scale Matrix Operations and recommendation algorithms in the future. On the other hand, I want to truly experience the fun of using Hadoop to implement distributed operations; the most important thing is to be able to write code that contains unique ideas, research components, and technical content.
This article first discusses the existing big matrix calculation methods and points out their shortcomings. Then, we propose our own matrix calculation methods to solve the problems existing in the existing methods, at the same time, we observe the problems existing in the methods in this article through experiments, and optimize the methods in this article to solve these problems.
- Row-column Multiplication
In traditional matrix operations, each row in matrix A is multiplied by each column in matrix B. Assume that the scale of matrix A is (m * r) and that of matrix B is (r * n), then the scale of matrix C is (m * n ). Element C in matrix CIs the result of multiplying and summarizing the elements in column j of row I and column B in Row. The formula is as follows:
Every Ci, jComputing is independent, so it can be done by different computing nodes.
1. The matrix size is limited. If matrix A or matrix B has A large size, A certain computing node may be limited by memory, the row I of matrix A or column j of matrix B cannot be loaded.
2. sparse matrix computing has no advantages. If A and B have A sparse matrix, You need to determine whether there are 0 elements in the position of row I in A and column j in B. In other words, you still need to load row I, all content in column j. If no input is made at a certain position, fill the corresponding position with 0 during the operation. This will cause the above problem: the memory cannot be placed.
When the matrix is large to a certain extent, a server is unable to process it due to memory restrictions. However, due to the fact that the matrix is naturally segmented, many block-based matrix calculation algorithms have emerged,The beauty of MathematicsThe big matrix multiplication method introduced in this book is based on blocks. The following is a brief introduction:
1,When matrix A is large vertically and not large horizontallyWe divide matrix A and multiply the blocks in matrix A by matrix B. Through Hadoop, these computations can be performed in parallel, as shown in 1:
A1 * B = C1, A2 * B = C2 ,..., Each part of the computing can be completed on different computing nodes, and finally the results are combined.
2,When matrix A is A real super-large matrix (both horizontally and vertically), the matrix B multiplied by it must also be A super-large matrix (at least vertically large)In this case, matrix A and matrix B all need to be segmented by rows and columns, and different block computing tasks are handed over to different computing nodes, as shown in figure 2.
In the figure, each block in matrix A needs to be multiplied by the block at the corresponding position in matrix B. The multiplication operation between these blocks can be completed by different computing nodes, finally, computation results of different blocks are strictly and precisely controlled, and the results are combined (mainly adding) to obtain the final computation result C.
1. Block Size is difficult for different matrix scales, and the block size is limited by the memory size.
2. Block-to-block calculation and organization are cumbersome.
3. It is not conducive to sparse matrix operations (the value 0 occupies a large amount of storage space and does a lot of invalid operations)
- Algorithm Based on Minimum granularity Multiplication
For the naming structure of the document, I name it based on the algorithm principle.
Both "row-column multiplication" and "block Operation" are limited by the memory limit of computing nodes. Is there an operation unrelated to the memory size of the computing node? The answer is: yes! In summary, the minimum granularity of matrix multiplication is to multiply two numbers in two matrices. For example, the calculation result is an integral part.
Assume that there are two super-large matrices A and B. The scale of A is (m * r) and the scale of B is (r * n). Then, the minimum-granularity Multiplication operation in matrix multiplication is used for statistics, it is not difficult to find that each element A in A needs to be the element B (j = 1, 2 ,..., n) Multiply by one, and the calculation result is A component of C. In B, each element B needs to be A (I = ,..., m) is multiplied in turn, and the calculation result is a component of C. See figure 3.
Because A * B is independent, different computing nodes can perform operations. Finally, the calculation results are summarized and added based on the key (I, j) to obtain the results Ci and j. At the same time, each computing node loads only two numbers for multiplication each time, and does not need to load a block of the matrix or a column in a row. Therefore, there is no memory limit, theoretically, as long as the HDFS File System of hadoop is large enough, it can calculate any large-scale matrix multiplication.
In the Map-Reduce process, since each input record of Map is processed only once, it is no longer used. Therefore, according to the theory in Figure 3, for each element in matrix, before multiplication, We need to generate n replicas. For each element in matrix B, we need to generate m replicas and correspond the replicas at the corresponding position. For example, if A needs to generate n copies, which correspond to the corresponding elements in B, and uses the row number of the element in A and the column number of the element in B as the key:
Take the preceding file as the Map input, perform multiplication in the Map, and perform addition by key in the Reduce stage to obtain the matrix multiplication calculation result.
- Disadvantages and difficulties
1. Prepare a matrix element copy.
If you want to use the above format as the initial Map input, we need to sort the data into the above format in advance. It is an arduous task to multiply two super-large matrices. Matrix Elements are generally derived from databases (for the moment, such as product recommendations, user data, and product data are stored in databases). Then, the documents in the preceding format are organized as Map input files, the number of times we need to query the database is:
M * r * n + r * n * m
Since m, r, and n are extremely large, the number of queries is intolerable. The ideal number of database queries is:
M * r + r * n
That is, the matrix element is retrieved only once.
Another method is to take the matrix element only once, and generate copies of each element to Map-Reduce. However, there is another problem: if the elements in matrix A and matrix B are copied in the Map-Reduce process, the operation time of A single Map is A bit unacceptable. For example, A Map block is 64 MB, there are about 5 million elements in matrix A, and the n value of matrix B is 1 billion. Therefore, the node that calculates the Map needs to generate 5 million * 1 billion copies, which is intolerable.
2. How the elements to be multiplied in the two matrices correspond.
It is generally not feasible because Matrix Elements Used for database queries are too complex for the corresponding time. Therefore, Map-Reduce can be used to correspond the corresponding elements. However, Map only processes the input records once, and the processing ends without memory. Therefore, it is difficult to correspond the elements in the two matrices.
3. file size.
For ultra-large scale matrices, since m and n in B are too large, many elements need to be copied, except that sparse elements (with A value of 0) are not included in the computation, the size of the copied file is extremely large. I did an experiment to copy 1000 copies of each element in A (1000*1000) B (1000*1000) Two dense matrices according to the rules, copy 1000 copies for each element in B. The number of copied records is 2*109, and the file size reaches 24 GB. The file size will increase exponentially for the matrix of hundreds of millions.
"Column-and-column multiplication" can be used for sparse matrices, but it cannot be used for large-scale dense matrices. Many scholars have done a lot of research on "block matrix operations, however, I do not like this algorithm very much. The first is the logic control troubles, and the second is the optimization of the block size, which does not solve the essential problems. I like simple things, so I prefer the "least granularity-based Multiplication Algorithm ". However, as we have mentioned earlier, there are three problems in "operation based on least-granularity multiplication". Next, I will elaborate my own ideas on two of them.
- Novel matrix multiplication element ing method
In matrix A * B = C, Ci and j are the results of multiplying row I in A and column j in B, as shown in formula (1. The common point can be written in the following format:
Then, multiply each record on the Map end, and summarize the records at the Reduce stage to obtain the final matrix multiplication result. However, as mentioned in the "least granularity Multiplication Algorithm", because key I-j does not have obvious discrimination, and matrix elements are not retained in the memory during the Map process, it is extremely difficult to organize data into the above format. If data is organized into the preceding format before Map input, the time complexity of database query is unacceptable.
Through thinking, we can easily find that in the final result Ci, j is composed of r values, and k is composed of A. In order to make the key more differentiated, we changed the key:
The two values represented by such a key are multiplied to obtain the k Composition Element in Ci and j. Therefore, after copying data copies of matrix A and matrix B in the Map Phase, All Map data records, the key of I-j-k has at most two (since sparse elements are not included in computing and copying, if one is used, it indicates that the other element multiplied by one is 0, if one does not exist, it means that both A and B are 0 and are not included in calculation and copy ).
Since each element in A needs to be calculated n times theoretically, the elements in A can be copied n times according to the following rules. For A, the copy method is as follows:
For each element in B, in theory, each element needs to be calculated m times, so the elements in B can be copied m times according to the following rules. For B, the copying method is as follows:
Since the copy of each element is independent, it can be performed by different maps, greatly accelerating the copy speed.
I used the above method to do the experiment. A (m, r) * B (r, n) = C, where m = r = n = 1000, therefore, there are 2*106 elements in the two matrices. both A and B are dense matrices, the original elements of matrix A and matrix B are stored in the form of A-I-k value and B-k-j value. The file size is 24 MB. Because the file is too small, only one Map is handed over for copy. Each element is copied one thousand times, and the total number of copies is 2*109. The consumed time is as follows:
Figure 4 shows that the execution time of a Map is very long because every record in the Map needs to be copied 1000 times. In practical applications, if the two matrices are too large, the size of many Map blocks will be filled up, and a block will have about 5 million records, at the same time, because each record is copied m or n times (m, n is probably hundreds of millions), the execution time of a Map is a bottomless pit.
In order to reduce the execution time of each Map, I am struggling and finally come up with a method, which will be introduced in the following section.
- Innovative cell division copy Algorithm
As mentioned in the previous section, the execution time of Map is too long. Some colleagues suggested that you reduce the number of Map blocks so that the number of records is smaller. Different blocks are executed by different nodes. However, I think this idea is unreasonable. A chunk is smaller and the record is smaller. However, if the number of copies required for each record is large, it will not be helpful. In addition, for matrix multiplication of different sizes, the number of matrix elements to be copied is also different, so the block size is difficult to control. In addition, for Hadoop operations, the size of Map blocks is usually increased, which is conducive to computing concentration.
In this method, the reason why the Map copy process takes too long is that the number of copies per record is too large, if the copy of a record can be completed on different nodes in segments, out of this idea, I have designed a method for copying using Map iteration, because the expansion of Map quantity in the iteration process is a bit like cell division, I call it the "cell division copy algorithm ".
Since it is determined how many times each matrix element needs to be copied, we can design a multipart COPY method for different nodes to copy data. There are two variables to be introduced here. One is num_split, which represents the maximum number of segments for a record during an iteration, and the other is num_copy, indicates the maximum number of records to be copied for each final segment. During iteration, if the range of a segment of a record is greater than num_copy, the segment is continued; otherwise, the copy operation is performed. An example is provided to illustrate the iteration process.
For element A in A, it needs to be copied 1000 times. In order to copy it into formula (3), we use the cell division copy algorithm to distribute the copy work to different computing nodes, in this example, the num_split and num_copy values are both 10, so the iteration process is as follows:
For elements in B, we also use the iterative COPY method to copy them to the format shown in formula (4), that is, the main range discrimination is concentrated on I.
As shown in figure 5, data records in each iteration increase in multiples of num_split. As the size of the record set increases, files are divided into more and more maps, naturally, it is allocated to more and more computing nodes for execution. View the third iteration in Figure 5. Because the record range meets the record generation condition, that is, the record range <= num_copy. During the third iteration, each record on each Map is copied only num_copy times, which greatly reduces the time compared with the previous copy1000 times. This method is especially applicable to large-scale matrices.
It is also worth mentioning that, in reality, matrix A and matrix B are often different in size, when the "cell division copy algorithm" is implemented, you need to set two flag variables to determine whether the record segmentation iteration process of different matrices is over. If the segmentation iteration process of the two matrices is over, it enters the last iteration process: Record copy generation.
I have performed experiments on A (m, r) * B (r, n) = C, where m = r = n = 1000, and completed copying of matrix elements through three iterations, as shown in figure 6, For the first iteration, the input is only 24 M, so there is only one Map, and three maps are output. For the second iteration, 30 maps are output due to the three input maps, in line with the expansion multiple of num_split, the third iteration is to execute the copy operation.
At the same time, we can see that during the last copy generation process, the execution time of each Map is relatively stable, as shown in 7. In this way, when our cluster is large enough, the 30 maps can be executed in one round.
Finally, after the Matrix Element copying and corresponding work are completed, the next step is relatively simple. After two rounds of Map-Reduce process, the calculation result can be obtained.
This method aims at the shortcomings and difficulties inherent in the "least-granularity-based Multiplication Algorithm" and uses clever design to effectively use the Map-Reduce tool for the correspondence of multiplication elements, at the same time, in order to reduce the time loss caused by copying too many records to a single element on a node, the "cell division copy algorithm" is designed ", the copy operation of the same record is effectively distributed to different nodes, which greatly shortens the execution time of a node and makes full use of and gives full play to the advantages of cluster operations.
However, due to the inherent characteristics of the algorithm, this article does not solve the final disadvantage of the "least granularity-based Multiplication Algorithm": The file occupies too much space. Theoretically, this shortcoming is not a drawback for HDFS systems. HDFS systems have enough space to accommodate enough data. However, the experiment shows that this file is huge. After multiplying the two dense matrices (,) and, the number of records reaches 2*109, And the occupied space is 20 or 30 GB, as shown in 8. How much file space does a larger matrix operation occupy? The answer is: immeasurable.
Most algorithms have their own advantages and limitations. In view of the file storage space occupation inherent in the nature of the methods in this article, I have always been worried, at least this algorithm is not perfect, although it solves some problems. For several days, I had a hard time thinking and thinking. When dreaming, I had two matrix elements in my mind! Huang Tian is a brave man and a new method emerges in my mind! I am looking forward to the next issue "using Hadoop to multiply large matrices (ii)". In the next period, I will analyze the reasons why this method causes large file space in nature, at the same time, I will introduce the new method that I think is perfect. The new method is very suitable for the multiplication calculation of large-scale dense matrices and sparse matrices. Especially for sparse matrices, there is basically no invalid calculation and no extra space is wasted.
Xiao Zhou, cloud computing group of a research and development center