Sparse matrix-based k-Nearest Neighbor (KNN) implementation, matrix k-Nearest Neighbor knn
Implementation of k-nn Based on Sparse Matrix
Zouxy09@qq.com
Http://blog.csdn.net/zouxy09
New Year's Day! It's about to enter 2015! Haha, I have been busy for a long time. I haven't updated my blog, and my blog has quietly become gray. During the New Year, I had nothing to worry about. I also summarized some scattered things and recorded them in my blog to restore my anger. Long time before writing, the word algae abnormal unfamiliar. Well, I wish you all a better life!
I. Overview
Here we will first look at how to use the sparse matrix feature to accelerate KNN algorithm when our data is sparse. The KNN algorithm mentioned in the previous blog that the test program was written for the dense matrix data. But in fact, we will also encounter a lot of sparse data, and many of them are intended for it, because sparse data has a storage and computing feature that is incomparable to dense data, this is important for the memory and real-time requirements in engineering applications. So here we will also focus on the storage of the sparse matrix and its application examples in knn algorithm.
As we all know, a sparse matrix is a matrix with many zero elements, or the number of non-zero elements in a matrix is much smaller than the total number of matrix elements, for example. If we only store these few non-zero elements, we can greatly reduce the storage space of this matrix. For example, a 100 x matrix contains only non-zero elements. If this matrix is all stored, x1000x4byte = 4 MB space is required (assuming that the matrix element is float, which occupies 4 bytes ). However, if you only store the 100 non-zero elements, you only need x4byte = 0.4KB space. This powerful difference is very flattering for the memory. Hi, you can see the problem in your eye. You can only describe this sparse matrix in KB space? Isn't each element of a matrix having a positional significance? Isn't it necessary to store the row and column of each element? Well, that's right. We also need to provide an auxiliary array to describe the position of a non-zero element in the original matrix. It must be marked as a radish.
For matrix operations, different sparse data organization methods have different characteristics, so there are a variety of different storage formats for sparse matrices, but they also keep changing, it stores all non-zero elements of the matrix into a linear array and provides an auxiliary array to describe the position of non-zero elements in the original matrix. For example, the sparse matrix module of the famous scipy Library provides the following storage methods:
Bsr_matrix: Block Sparse Row matrix
Coo_matrix: A sparse matrix in COOrdinate format.
Csc_matrix: Compressed Sparse Column matrix
Csr_matrix: Compressed Sparse Row matrix
Dia_matrix: Sparse matrix with DIAgonal storage
Dok_matrix: Dictionary Of Keys based sparse matrix.
Lil_matrix: Row-basedlinked list sparse matrix
For more information about the differences, see the references and official documents of scipy. This section only describes the mainstream CSR formats.
Ii. CSR sparse matrix Storage
The full CSR name is CompressedSparse Row Format. The following figure shows how it is stored:
As you can see, a dense matrix is stored in three arrays. Save all non-zero element values of the Matrix to the values array in the row order. The non-zero elements in the first line are first stood up and lined up, and then followed up in the second line ...... This matrix has nine non-zero elements, so the values array is nine elements. How can we record the positions of these non-zero elements in the original matrix? Isn't the position of the matrix A column in which row? Then we can add a row array and a column array respectively. This is a natural idea. First, you need to record the row number of each element (that is, the non-zero element of the original matrix) in the values array. In this way, we create an array named row indices. Its value is [0 0 1 1 2 2 3 3]. You also need to create an array column indices to record the number of each element in the values array. Its value is [0 1 1 2 0 2 3 3 3]. Obviously, the three arrays correspond one by one to describe the original array. For example, the third non-zero element of values is 2, and the row in the original matrix is the third element of row indices, that is, 1. The column is the third element in column indices, so the position of the Non-zero element in the original matrix is (1, 1 ).
Here our goal is achieved, but the required storage space is "Number of non-zero elements x3 ". Can it be smaller? Smaller than smaller! The answer is yes. CSR is one of the methods. Its idea is also very simple. Many non-zero elements in the matrix belong to the same row, right? For example, in the preceding example, the 1st and 2nd elements in the values array belong to the 1st rows of the matrix, and the 3rd and elements belong to the rows of the matrix. We can also see in row indices that its value is [0 0 1 1 2 2 2 3 3]. We can see that many elements are continuously the same and the base increases, then we can merge the same continuous data and only mark the beginning and end of each row. It can also be said that only the offset is recorded. The array name is "row offset". Its size is the number of rows in the original matrix plus 1, it uses its own element values at the position I and I + 1 to mark the offset [I] to the offset [I + 1]-1 position in the values array. the element of row I of the matrix. In the preceding example, the row offset is [0 2 4 7 9]. I don't know if I can tell it clearly, but I don't know how to read it.
Iii. csr-based knn implementation
Sparse matrix-based operations are relatively fast, such as matrix multiplication, because many elements in the matrix are 0, and any number multiplied by 0 is equal to 0. Currently, many databases have implemented sparse matrix operations. Here we do not involve detailed implementation processes. We only use the existing optimized libraries to build blocks. Here we use the csr_matrix module of the scipy package. For more information, see the official documentation. We are doing experiments in the python environment. We recommend that you use the python release version Anaconda. The latest version provides up to 195 popular python packages, includes our commonly used numpy, scipy, and other scientific computing packages. With it, Mom no longer has to worry about installing one dependent package after another. Anaconda in hand, easy to have! Http://www.continuum.io/downloads:
Here we will use knn for experiments. It involves the multiplication of sparse matrices. For KNN, there are generally two matrices. One is the matrix A that stores N training samples. Assume that each row of the matrix represents A training sample. Another is matrix B that stores M test samples. KNN requires us to calculate: Calculate the Euclidean distance between each sample in B and N samples in matrix A (used here), and then find the minimum K samples. We know that the Euclidean distance can be expanded: |A-B| 2 =A2-2AB+B2. In the program, we can calculate the sum of squares of each row of matrix A and matrix B, that isA2 andB2. Because the inner product of all samples in matrix A and all samples in matrix B can be unified to multiply A and B. In this way, all the calculations can be unified into matrix operations, so that we can use the power of vectorized operations. Therefore, the following program is obtained:
Knn_sparse_csr.py
#****************************************************#* #* Description: KNN with sparse data#* Author: Zou Xiaoyi#* Date: 2014-12-31#* HomePage : http://blog.csdn.net/zouxy09#* Email : zouxy09@qq.com#* #****************************************************import numpy as npfrom scipy.sparse import csr_matrixdef kNN_Sparse(local_data_csr, query_data_csr, top_k):# calculate the square sum of each vectorlocal_data_sq = local_data_csr.multiply(local_data_csr).sum(1)query_data_sq = query_data_csr.multiply(query_data_csr).sum(1)# calculate the dotdistance = query_data_csr.dot(local_data_csr.transpose()).todense()# calculate the distancenum_query, num_local = distance.shapedistance = np.tile(query_data_sq, (1, num_local)) + np.tile(local_data_sq.T, (num_query, 1)) - 2 * distance# get the top ktopK_idx = np.argsort(distance)[:, 0:top_k]topK_similarity = np.zeros((num_query, top_k), np.float32)for i in xrange(num_query):topK_similarity[i] = distance[i, topK_idx[i]]return topK_similarity, topK_idxdef run_knn():top_k = np.array(2, dtype=np.int32)local_data_offset = np.array([0, 1, 2, 4, 6], dtype=np.int64)local_data_index = np.array([0, 1, 0, 1, 0, 2], dtype=np.int32)local_data_value = np.array([1, 2, 3, 4, 8, 9], dtype=np.float32)local_data_csr = csr_matrix((local_data_value, local_data_index, local_data_offset), dtype = np.float32)print local_data_csr.todense()query_offset = np.array([0, 1, 4], dtype=np.int64)query_index = np.array([0, 0, 1, 2], dtype=np.int32)query_value = np.array([1.1, 3.1, 4, 0.1], dtype=np.float32)query_csr = csr_matrix((query_value, query_index, query_offset), dtype = np.float32)print query_csr.todense()topK_similarity, topK_idx = kNN_Sparse(local_data_csr, query_csr, top_k)for i in range(query_offset.shape[0]-1):print "for %d image, top %d is " % (i, top_k) , topK_idx[i]print "corresponding similarity: ", topK_similarity[i]if __name__ == '__main__':run_knn()
The program output is as follows:
[[ 1. 0. 0.] [ 0. 2. 0.] [ 3. 4. 0.] [ 8. 0. 9.]][[ 1.10000002 0. 0. ] [ 3.0999999 4. 0.1 ]]for 0 image, top 2 is [[0 1]]corresponding similarity: [ 0.00999999 5.21000004]for 1 image, top 2 is [[2 1]]corresponding similarity: [ 0.02000046 13.61999893]
This code runs in this toy data without any acceleration. However, if you process large and high-dimensional data with a high degree of sparsity, the acceleration ratio of sparse computing is amazing. PS: The above program has run in the large matrix.
Iv. References:
[1] Sparse Matrix Storage Formats)
[2] http://www.bu.edu/pasi/files/2011/01/NathanBell1-10-1000.pdf