The SVD of machine learning combat

Source: Internet
Author: User

1. Singular value decomposition SVD (singular value decomposition)

1.1 SVD Evaluation

Benefits: Simplifies data, removes noise and redundancy information, and improves the results of the algorithm

Cons: Data conversions can be difficult to understand

1.2 SVD applications

(1) Implicit semantic indexing (latent semantic indexing, LSI)/implicit semantic analysis (latent semantic analyst, LSA)

In LSI, a matrix consists of documents and words. Applying SVD on the matrix can build multiple singular values that represent concepts or topics in the document and can be used for more efficient document searches.

(2) Recommended system

Use SVD to build a topic space from the data and then calculate the similarity in that topic space.

1.3 SVD decomposition

SVD is a matrix decomposition technique, which decomposes the original data set matrix A (m*n) into three matrices, and the dimensions of the three matrices that are decomposed are m*m,m*n,n*n. Except that the diagonal element is not 0, the other elements are 0, the diagonal elements are called singular values and are arranged in order from large to small. These singular values correspond to the singular values of the original dataset matrix A, i.e. the square root of the eigenvalues of the A*a (T).

After some singular value (r), the other singular values are ignored because the values are too small to be set to 0, which means that there are only r important features in the dataset, and the rest are noise or redundancy features. As shown in:

Question: How do I select a value R?

Answer: There are many heuristic strategies for determining the number of singular values to retain, and one typical practice is to preserve 90% of the energy information in the matrix. In order to calculate the energy information, the sum of all singular values is squared and the singular values are superimposed from large to small, until the sum of the singular values reaches 90% of the total value; When the matrix has tens of thousands of singular values, the first 2000 or 3,000 are retained directly. But the latter method does not guarantee that the first 3,000 singular values can contain the energy information of 90%, but the operation is simple.

SVD decomposition is time-consuming, and it is a way to reduce redundant computation and recommend time by calculating SVD decomposition and similarity calculation off-line.

2. Recommendation engine based on collaborative filtering

2.1 Definitions

Collaborative filtering is done by comparing the data of users and other users to the recommendations.

For example, trying to predict a movie that a user likes, the search engine will find a movie that the user has not seen, and then it calculates the similarity between the movie and the movie the user has seen, and if the similarity is high, the recommendation algorithm will think the user likes the movie.

Cons: In the case of collaborative filtering, because of the lack of information about the preferences of all users due to the arrival of new items, it is not possible to judge the preferences of each user.

2.2 Calculation of similarity

Collaborative filtering uses the user's opinion of the food to calculate the similarity and gives a matrix of the user's rating information about the dish:

Defining similarity varies between 0-1, and the more similar the item pair is, the greater the similarity value, the similarity can be calculated using the formula similarity = 1/(1 + distance).

The method for calculating distances is as follows:

(1) Euclidean distance

(2) Pearson correlation coefficient (Pearson correlation)

Measure the similarity between two vectors, which is better than Euclidean in that it is insensitive to the magnitude of the user rating, for example, when a person has scored 5 points for all items and the other has scored 1 points for all items, Pearson correlation coefficient considers the two scoring vectors to be equal. However, the Pearson correlation coefficient range is ( -1,1) and is normalized to 0-1 by 0.5 + 0.5 * CORRCOEF ().

(3) Cosine similarity (cosine similarity)

Calculates the cosine of the angle between two vectors. The range of values is ( -1,1), so it is also normalized to the (0,1) interval.

Here are the code implementations of these three similarity calculation methods:

<span style= "FONT-SIZE:18PX;" >def Eulidsim (in1,in2): Return 1.0/(1.0+la.norm (in1-in2)) def Pearsonsim (in1,in2): If Len (in1) < 3: #检查是否存在3个或更多的点 , less than words, these two vectors are fully correlated return 1.0        return 0.5 + 0.5 * CORRCOEF (in1,in2,rowvar = 0) [0][1]</span><span style=] font-size:18px; " > Def cossim (in1,in2): num = float (in1. T * in2) denom = La.norm (in1) * La.norm (in2) return 0.5 + 0.5 * (num/denom) </span>
2.3 Restaurant dish recommendation engine

(1) Usefulness: recommend restaurant food. Given a user, the system will recommend n the best recommended dish for this user. To achieve this goal:

    • Looking for a dish that the user has no rating, that is, 0 value in the user-item matrix;
    • Expect a possible rating score for each item in all items that are not rated by the user.
    • The ratings for these items are sorted from high to low, and the first n items are returned

Here is the implementation code:

<span style= "FONT-SIZE:18PX;" > #计算在给定相似度计算方法的条件下, users user's estimate of the item item def standest (Datamat,user,simmea,item): n = shape (datamat) [1]simtotal = 0.0 Ratsimtotal = 0.0for j in range (n): userrate = datamat[user,j]if Userrate = = 0:continue# Get an overly user ID for the item and J to calculate the item item and J The similarity between overlap = Nonzero (Logical_and (datamat[:,item). A&GT;0,DATAMAT[:,J]. a>0)) [0]if len (overlap) = = 0:similarity = 0else: #计算物品item和j之间的相似度 (You must select the user score that the user scores on both items to make the item score vector) Similarity = Simmea (Datamat[overlap,item],datamat[overlap,j]) Simtotal + = similarityratsimtotal + = similarity * UserRateif simTotal ==0: return 0else:return ratsimtotal/simtotal #归一化处理 # input is the data matrix, the user number, the number of dishes returned, the distance calculation method, the function of obtaining an item rating def recommend (Datamat,user, N=3,simmea=cossim,estmethod=standest): #返回user用户未评分的菜的下标unratedItem = Nonzero (Datamat[user,:]. A = = 0) [1]if (len (unrateditem) = = 0): Return ' rated every one ' Itemscore = [] #对每个没评分的菜都估计该用户可能赋予的分数for item in Unratedite M:score = Estmethod (Datamat,user,simmea,item) itemscore.append ((Item,score)) #返回评分最高的前n个菜下标以及分数retUrn sorted (itemscore, key = lambda Jj:jj[1],reverse = True) [:n]</span> 
2.4 Using SVD to improve the recommended results

The matrix obtained by the actual data set is quite sparse, so the original matrix can be mapped into the low-dimensional space by using SVD first; Then in the low-dimensional space, the similarity between items is calculated, and the computational amount is greatly reduced.

Its code is implemented as follows:

<span style= "FONT-SIZE:18PX;" > #通过SVD对原始数据矩阵降维 for easy calculation of similarity between items def scdest (datamat,user,simmea,item): n = shape (datamat) [1]simtotal = 0.0 Ratsimtotal = 0.0U,SIGMA,VT = LA.SVD (datamat) #sigma是行向量sig4 = Mat (Eye (4) * Sigma[:4]) #只利用最大的4个奇异值, convert it to 4*4 matrix, non-diagonal element 0xformedItems = DATAMAT.T * U[:,:4] * sig4. I #得到n *4for j in Range (n): userrate = datamat[user,j]if Userrate = = 0  or J = = item:continue# Get the user ID of the menu item and J, used to calculate the item I The similarity between TEM and J #overlap = Nonzero (Logical_and (datamat[:,item). A>0,DATAMAT[:,J]. a>0)) [0] #if len (overlap) = = 0: #similarity = 0#else: #计算物品item和j之间的相似度 #similarity = Simmea (Datamat[overlap,item], DATAMAT[OVERLAP,J]) similarity = Simmea (Xformeditems[item,:]. T,xformeditems[j,:]. T) Simtotal + = similarityratsimtotal + similarity * Userrateif simtotal ==0:return 0else:return ratSimTotal/simTotal  #归一化处理 </span>
3. SVD-based image compression

<span style= "FONT-SIZE:18PX;" > #打印矩阵def Printmat (in1,thresh=0.8): for-I in range (+): for-K in range: if (float (in1[i,k]) > Thresh):p rint 1,else :p rint 0,print ' #利用SVD实现图像压缩, allows to reconstruct an image based on any given singular value, default to the first 3 singular values def imgcompress (numsv=3,thresh=0.8): #32 *32 matrixmy1 = [] For line in open (' 0_5.txt '). ReadLines (): NewRow = []for i in range: newrow.append (int (line[i])) My1.append (NewRow) Mymat = Mat (my1) print ' ***original matrix*** ' Printmat (mymat) u,sigma,vt = LA.SVD (Mymat) #将sigma矩阵化, That is, the diagonal element of Sigrecon is the element of sigma Sigrecon = Mat (Zeros ((NUMSV,NUMSV))) for K in range (NUMSV): sigrecon[k,k] = sigma[k]# Refactoring matrix Reconmat = u[:,:numsv] * Sigrecon * vt[:numsv,:]print ' ***reconstruct matrix*** ' Printmat (reconmat) </span>

Take the numbers as an example: the number 0 is stored as a matrix of 32*32, which needs to store 1024 data; Through the experiment found that only 2 singular values can be very accurate reconstruction of the image, the size of the U,VT is 32*2 matrix, plus 2 singular values, you need to 32*2*2+2=130 0-1 values to store 0; By comparison, it achieves almost 10 times times the compression ratio.



The SVD of machine learning combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.