"Machine Learning in action" notes-simplifying data with SVD

Source: Internet
Author: User
Tags square root

SVD (Singular value decomposition) singular value decomposition, can be used to simplify the data, remove noise, improve the results of the algorithm.

I. SVD and recommendation system

The dishes are made by the restaurant's food and vegetable master, who can use any integer from 1 to 5 to rate the dish, and if the food master has not tasted a dish, it is rated at 0.


Create a new file svdrec.py and add the following code:

Def loadexdata ():    return[[0, 0, 0, 2, 2],           [0, 0, 0, 3, 3],           [0, 0, 0, 1, 1],           [1, 1, 1, 0, 0],           [2, 2, 2, 0, 0],           [5, 5, 5, 0, 0],           [1, 1, 1, 0, 0]]

>>> import svdrec>>> data=svdrec.loadexdata () >>> data[[0, 0, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1], [1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [5, 5, 5, 0, 0], [1, 1, 1, 0, 0]]>>> U,SIGMA,VT=LINALG.SVD (Data) >& Gt;> Sigmaarray ([  9.64365076e+00,   5.29150262e+00,   8.05799147e-16,         2.43883353e-16,   2.07518106E-17])

We can find the eigenvalues, the first two are much larger than the others, so we can get rid of the last three values because they have very little effect.

Can be seen in the top three people, like roast beef and hand-torn pork, these dishes are American barbecue restaurant has the dishes, the two eigenvalues can be corresponding to the food BBQ and Japanese food two categories of food, so you can think of these three people belong to a class of users, the following four people belong to a class of users, so recommend is very simple.

Create a new file svdrec.py and add the following code:

Def loadexdata ():  return[[1, 1, 1, 0, 0],    [2, 2, 2, 0, 0],    [1, 1, 1, 0, 0],    [5, 5, 5, 0, 0],    [1, 1, 0, 2, 2],    [0, 0, 0, 3, 3],    [0, 0, 0, 1, 1]]

SVD decomposition:

>>> Reload (svdrec) <module ' Svdrec ' from ' svdrec.py ' >>>> data=svdrec.loadexdata () >> > data[[1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [1, 1, 1, 0, 0], [5, 5, 5, 0, 0], [1, 1, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1]]>>> U,SIGMA,VT=LINALG.SVD (Data) >>> sigmaarray ([  9.72140007e+00,   5.29397912e+00,   6.84226362e-01,         1.67441533e-15,   3.39639411e-16])
We can find the eigenvalues, the first 3 are much larger than the other values, so we can get rid of the last 2 values because they have very little effect.

The above example can approximate the original data with the following results:

Second, the recommendation engine based on collaborative filtering

Collaborative filtering (collaborative filtering) is recommended by comparing users to other users ' data.

1. Calculation of similarity


From numpy import *from numpy import linalg as Ladef Eulidsim (INA,INB):    return 1.0/(1.0+la.norm (INA,INB)) def Pearssim (INA,INB):    If Len (ina<3): Return 1.0    return 0.5+0.5*corrcoef (ina,inb,rowvar=0) [0][1]def Cossim (INA,INB):    num=float (INA.T*INB)    denom=la.norm (InA) *la.norm (InB)    return 0.5+0.5* (num/denom)

2. Similarity based on item and user-based similarity degree

When the number of users is very large, it is better to use item-based similarity calculation method.

3. Example: Restaurant dish recommendation engine based on item similarity



From numpy import *from numpy import linalg as Ladef loadexdata (): return[[1, 1, 1, 0, 0], [2, 2, 2, 0, 0], [1, 1, 1, 0, 0], [5, 5, 5, 0, 0], [1, 1, 0, 2, 2], [0, 0, 0, 3, 3], [0, 0, 0, 1, 1]] def loadExData2 (): return[ [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5], [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3], [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0 ], [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0], [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0], [0, 0, 0, 0, 5, 0,  1, 0, 0, 5, 0], [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1], [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4], [0, 0,    0, 2, 0, 2, 5, 0, 0, 1, 2], [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0], [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]    def ecludsim (INA,INB): Return 1.0/(1.0 + la.norm (ina-inb)) def Pearssim (INA,INB): If Len (InA) < 3:return 1.0 Return 0.5+0.5*corrcoef (INA, inB, Rowvar = 0) [0][1]def cossim (ina,inb): num = float (ina.t*inb) denom = La.norm (ina ) *la.norm (InB) return 0.5+0.5* (Num/denom) #计算在给定相似度计算方法的条件下, the user estimates the value of the item #standest () function: The parameter Datamat represents the data matrix, user represents the subscriber number, Simmeas represents the similarity calculation method, Item indicates item number DEF standest (datamat,user,simmeas,item): N=shape (Datamat) [1] #shape用于求矩阵的行列 simtotal=0.0; ratsimtotal=0.0 for J in Range (N): userrating=datamat[user,j] If userrating==0:continue #若某个物品评分值为0, indicating that the user is not If the item is scored, skip and continue traversing the next item #寻找两个用户都评分的物品 Overlap=nonzero (Logical_and (Datamat[:,item). A&GT;0,DATAMAT[:,J]. a>0)) [0] If Len (overlap) ==0:similarity=0 else:similarity=simmeas (datamat[overlap,item],datamat[overlap,j ]) #print ' The%d and%d similarity is:%f '% (item,j,similarity) simtotal+=similarity Ratsimtotal+=simi larity*userrating if Simtotal==0:return 0 else:return ratsimtotal/simtotaldef recommend (dataMat,user,N=3,simMeas=c Ossim,estmethod=standest): #寻找未评级的物品 Unrateditems=nonzero (Datamat[user,:]. a==0) [1] If Len (unrateditems) ==0:return ' You rated everything ' itemscores=[] for item in unratedItems:estimatedscore=estmethod (Datamat,user,simmeas,item) #对每一个未评分物品, call Standest () to generate a forecast score for the item Itemscores.app End ((item,estimatedscore)) #该物品的编号和估计得分值放入一个元素列表itemScores中 #对itemScores进行从大到小排序 to return the top N outstanding items return sorted ( Itemscores,key=lambda jj:jj[1],reverse=true) [: N]def svdest (datamat, user, Simmeas, item): N = shape (Datamat) [1] SimT otal = 0.0;    Ratsimtotal = 0.0 U,SIGMA,VT = LA.SVD (datamat) Sig4 = Mat (Eye (4) *sigma[:4]) #arrange Sig4 into a diagonal matrix Xformeditems = datamat.t * U[:,:4] * sig4.i #create transformed items for j in range (n): userrating = Datamat[u SER,J] if userrating = = 0 or J==item:continue similarity = Simmeas (Xformeditems[item,:]. T, Xformeditems[j,:].  T) print ' The%d and%d similarity is:%f '% (item, J, similarity) Simtotal + = Similarity ratsimtotal + = similarity * userrating if simtotal = = 0:return 0 Else:return ratsimtotal/simtotal

Which Datamat[:,item]. A, to find out the item column, because it is the matrix, with. A turn into Array,logical_and, in fact, is to find the most item column and j column are >0, only more than 0 will be True,nonzero will give the index which is not 0.

Perform SVD decomposition:

>>>from numpy import Linalg as La>>> U,SIGMA,VT=LA.SVD (Mat (Svdrec.loadexdata2 ())) >>> Sigmaarray ([1.38487021e+01, 1.15944583e+01, 1.10219767e+01,        5.31737732e+00, 4.55477815e+00, 2.69935136e+00,        1.53799905e+00, 6.46087828e-01, 4.45444850e-01,        9.86019201e-02, 9.96558169E-17])

How to decide R? A quantitative method is to see how many singular values can reach 90% of the energy, in fact, as with PCA, because the singular value is actually equal to the square root of the dataxdatat eigenvalue, so the total energy is the eigenvalues and

>>> sig2=sigma**2>>> sum (SIG2) 541.99999999999932

In the first 4, it is found that the total energy is greater than 90%, so r=4

>>> sum (Sig2[:3]) 500.50028912757909

The key to SVD decomposition is to reduce the user dimension from N to 4

def svdest (Datamat, user, Simmeas, item):    n = shape (Datamat) [1]    simtotal = 0.0; ratsimtotal = 0.0    U,SIGMA,VT = LA.SVD (datamat)    Sig4 = Mat (Eye (4) *sigma[:4]) #arrange Sig4 into a diagonal matrix    xformeditems = datamat.t * u[:, : 4] * sig4.i  #create transformed items for    j in Range (n):        userrating = datamat[user,j]        if userrating = = 0 or j==item:continue        similarity = Simmeas (Xformeditems[item,:). T,                             xformeditems[j,:]. T)        print ' The%d and%d similarity is:%f '% (item, J, similarity)        simtotal + = similarity        Ratsimtotal + = si Milarity * userrating    if simtotal = = 0:return 0    else:return ratsimtotal/simtotal
One of the key steps, DATAMAT.T * u[:,:4] * sig4.i

Convert the Datamat of MXN to the matrix of the nx4 item and user class by using the eigenvalue scale

>>> Mymat=mat (Svdrec.loadexdata2 ()) >>> Mymatmatrix ([[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5], [0, 0, 0,  3, 0, 4, 0, 0, 0, 0, 3], [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0], [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0], [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0], [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0], [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1], [0        , 0, 0, 4, 0, 4, 0, 0, 0, 0, 4], [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2], [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0], [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]) >>> Svdrec.recommend (Mymat,1,estmethod=svdrec.svdest) the 0 and 3 similarity is:0.490950the 0 and 5 similarity is : 0.484274the 0 and Similarity is:0.512755the 1 and 3 similarity is:0.491294the 1 and 5 similarity is:0.481516the 1 and similarity Is:0.509709the 2 and 3 similarity is:0.491573the 2 and 5 similarity is:0.482346the 2 and Similarit Y is:0.510584the 4 and 3 similarity is:0.450495the 4 and 5 similarity is:0.506795the 4 and similarity IS:0.512896the 6 and 3 similarity is:0.743699the 6 and 5 similarity is:0.468366the 6 and similarity is:0.439465the 7 and 3 similarity is:0.482175the 7 and 5 similarity is:0.494716the 7 and ten similarity Is:0.524970the 8 and 3 Similarit Y is:0.491307the 8 and 5 similarity is:0.491228the 8 and similarity is:0.520290the 9 and 3 similarity is:0.522379th E 9 and 5 similarity is:0.496130the 9 and Similarity is:0.493617[(4, 3.3447149384692283), (7, 3.3294020724526967), (9 , 3.328100876390069)]

"Machine Learning in action" notes-simplifying data with SVD

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.