Singular value decomposition (SVD) and its extended detailed explanation __ machine learning

Source: Internet
Author: User

SVD is a common matrix decomposition technique and an effective method of algebraic feature extraction. The main idea of SVD in collaborative filtering is to analyze the degree of the raters ' preference to each factor and the extent of the film including each factor according to the existing score, and finally to analyze the data to get the forecast result. RSVD, svd++ and ASVD are the improved algorithms based on SVD.

This algorithm mainly considers the personalized recommendation field

1.Matrix factorization Model and Baseline predictors

SVD is actually matrix factorization model and baseline predictor combination, so in order to facilitate our first here to introduce these two things.

(1) Matrix factorization Model

Think of our user rating as a table:

Each row represents a user, each column represents an item, this is actually a rectangle, but the rectangle we have is probably very sparse, that is, we know the score is very small, but now we know that it is a rectangle, a rectangle can naturally be expressed as the product of another two rectangles (the following will give proof):

This method of decomposition is embodied in collaborative filtering, that is:

In such a decomposition model, PU represents the user-implicit factor matrix (which indicates the user's preference for factor k), and Qi represents the film's implicit Factor matrix (the degree to which the movie I is on Factor k).

(2) Baseline predictors

Baseline predictors uses vector bi to denote the deviation of the score of movie I from the average score, and the vector bu represents the deviation of the user U's score relative to the average score, and the average score is recorded as μ.

2.SVD Mathematical principle and derivation

Whether or not a set of orthogonal bases can be found for any m*n matrix, which is transformed by the orthogonal basis. The answer is yes, it is the essence of SVD decomposition.

Now suppose there is a m*n matrix A, in fact, the a matrix maps the vectors in the n-dimensional space to the K (k<=m) dimension space, K=rank (a). The goal now is to find a set of orthogonal bases in n-dimensional space, which is orthogonal after a transformation. Suppose you have found such a set of orthogonal bases:

Then the a matrix maps the set of bases to:

If you want to make them 22 orthogonal, that

On the assumption that there is

So if the orthogonal base V is selected as the eigenvector of a ' a ', because a is a symmetric array, and V is 22 orthogonal, then

In this way, the orthogonal base is found to be mapped or orthogonal, and now the orthogonal base is mapped:

Because

So there

So take the unit vector

This can be

When K < i <= m, U1,u2,...,uk is extended u (k+1),..., um so that U1,u2,...,um is a set of orthogonal bases in m-dimensional space, and the {U1,u2,...,uk} orthogonal base is extended to the {u1,u2,...,um} unit orthogonal base. Similarly, the V1,V2,...,VK is Extended V (k+1),..., vn (this n-k vector exists in the 0 space of a, which is the base of the ax=0 's solution space), making V1,V2,...,VN a set of orthogonal bases in n-dimensional space.

You can get

Then we can get the singular value decomposition of a matrix (multiplied by VT on both sides):

The V is the orthogonal array of the NXN, the U is the nxn orthogonal array, and the ∑ is the Nxm diagonal array.

It is now possible to analyze the mapping process of a matrix: if a (super) rectangle is found in an n-dimensional space and its edges fall in the direction of a eigenvector of a, the shape after a transformation is still a (super) rectangle.

The characteristic vector of a ' a ' is called the right singular vector of a, and the eigenvector of Ui=avi is actually AA ', which is called the left singular vector of a. Below, use SVD to prove the full rank decomposition at the beginning of the article:

Using matrix block multiplication:

You can see that the second item is 0, with

Make

Then A=xy is the full rank decomposition of a. 3.Basic SVD

As indicated above, the scoring matrix R exists such a decomposition, so you can use the product of two matrices P and Q to represent the scoring matrix R:

U represents the number of users, I represents the number of items, K=rank (R). Then, using the known scoring training p and Q in R so that the results of the p and Q multiplication are best fitted to the known score, then the unknown score can be obtained from a row of p on a certain line of Q:

This is the predictor of the user U's score on the product I, which equals the first U row of the P matrix multiplied by the Q matrix in column I. This is the most basic SVD algorithm, then how to use the known score training to get p and Q of the specific values.

Suppose the known score is:

The error of the true value and the predicted value is:

You can then calculate the total error squared sum:

Just by training to minimize the SSE so p, Q can best fit R. So how to minimize the SSE (this article uses gradient descent optimization method).

By using the gradient descent method, the gradient of SSE in the PUK variable (that is, the value of the U row K of the P matrix) can be obtained:

Using the derivative chain law, e^2 the derivation of E by multiplying E to PUK:

Because

So

The lump in brackets in the upper bracket if unfolded, its related to PUK only Pukqki, other unrelated items to PUK derivative is equal to 0, so the derivation result is:

So

To make the formula simpler,

This has no effect on the results, just to get rid of the 2 before the derivative results, better to watch. Get

Now the gradient of the target function at PUK is obtained, then the PUK is changed toward the negative gradient according to the gradient descent method. To make the updated step size (that is, the learning rate)

The PUK update is

The same way to get the Qik update is

Get an updated formula and start talking about how this update is going to go. There are two options: 1, after the calculation of all known scores of the forecast error and then the P, q update. 2, after each calculation of a eui immediately after the PU and qi to update. Both of these methods have the name, respectively: 1, batch gradient decline. 2, random gradient drop. The difference is that the batch gradient drops in the next iteration to use the updated value of this iteration, the random gradient drop the value used by the current sample in this iteration may be the value of the last sample update. Because randomness can bring a lot of benefits, such as helping to avoid local optimal solutions, most of them now tend to use random gradient drops to update. 4.RSVD

The above is the basic SVD algorithm, but, the problem is that the above training is for known scoring data, the excessive fitting of this part of the data may lead to poor test results, the test set on the poor performance, this is the problem of fitting.

So how to avoid fitting it. That is, adding a regularization parameter to the objective function (adding a penalty term), for the objective function, all values in the P matrix and Q matrix are variables that are penalized for all variables without knowing which variable will bring the fit:

At this point, the derivative of the objective function to the PUK is changed, and the derivative after the penalty term is now asked.

The first in parentheses has been asked for the derivation of the PUK, and the second for the derivation of the PUK is easy to obtain, the third is independent of PUK, and the derivative is 0, so

The derivative of SSE to Qik can be

To change the two variables toward the negative gradient, the new

This is the normal after the SVD, also known as RSVD. 5. Add bias to the SVD, RSVD

About the SVD algorithm is too many variants, the name is not uniform, in the prediction of the formula added to a number of parameters will come out a title. Since the user's rating of the goods depends not only on the relationship between the user and the commodity, but also on the unique nature of the user and commodity, Koren the SVD prediction formula to such

The first item is the total average score, BU is the user U's attribute value, bi is the attribute value of the commodity I, the two variables added in SSE equation also need punishment, then SSE becomes the following:

We can see from the above formula that SSE has no change to the derivative of PUK and Qik, but at this time the BU and bi variables are also required to be updated. First of all, SSE on the derivative of BU, only the first and fourth and BU related, the first one on the derivative of BU and before the derivative similar, with the chain law can be obtained, the fourth direct derivation can be, and finally the partial derivative for

Similarly, the derivative of bi can be

So the change in direction of its negative gradient gets its newer

This is the addition of the bias after the SVD (RSVD). 6.ASYMMETRIC-SVD (ASVD)

The full name is called ASYMMETRIC-SVD, that is, asymmetric SVD, and its predictive formula is

R (U) indicates that the user U comments overboard on the collection of goods, and N (U) indicates that the user U browsed but did not comment overboard on the product set, XJ and YJ are the attributes of the commodity.

This model is very interesting, look at the predictive formula, the user matrix P has been removed, instead of using users to comment on the goods and users have not rated the properties of the merchandise to represent the user attributes, which has a certain degree of rationality, because the user's behavior record can reflect the user's preferences. Moreover, this model can bring a great advantage, a mall or site users tens of thousands or even billions of dollars, the storage of user properties of the two-dimensional matrix will occupy huge storage space, and the number of goods is not so much, so the benefits of this model is obvious. But it has a drawback, that is, the iteration time is too long, this is predictable, to change space in time.

Similarly, you need to compute the partial derivative of each parameter, and find the update, obviously, the update of BU and bi is the same as what you just asked.

Where the vector z is as follows:

Now requires an update of Qik, X, and y, where z is treated as a Qik update based on the update that is obtained from the previous updates:

It takes a little patience to ask for the derivative of XJ, and the second derivative of SSE is easy to obtain, now for the first derivative of XJ:

The i,j in the upper-type summation symbol belongs to R (U) and can be seen ZM only when m==k is associated with XJK, so

Because

So

So the derivative of SSE to XJK is

The derivative of SSE to YJK can be

So we get the update equation for XJK and YJK.


This is called Asymmetric SVD (ASVD) 7.svd++

This model is also mentioned in the Koren article, Svdplusplus (svd++), which predicts that the formula is

Here the N (u) represents the user U behavior record (including browsing and commenting over the collection of items). Look at the ASVD of the updated derivation process to see this should be very simple, PUK and Qik update the same formula, YJK Update and ASVD the same as the update:

These are the SVD algorithms that Koren used in the Netflix race . 8. Dual Algorithm

The dual algorithm is obtained by replacing the U and I in the preceding prediction formula, for RSVD, the position of U and I is equivalent and symmetrical, so its dual algorithm is not different from itself, but for ASVD and svd++, sometimes the result of the dual algorithm is more accurate and If you combine the dual algorithm with the predictions of the original algorithm, you will be amazed by the increase in the effect.

The dual ASVD prediction formula:

Here R (i) represents the user set that commented on the product I, and N (i) represents a collection of users who have browsed the product I but have no comments. Because of the large number of users, so the dual ASVD will occupy a lot of space, here need to make a choice.

The dual svd++ prediction formula:

Here n (i) represents a collection of users who have had behavior (browsing or scoring) on the product I.

The realization of this pair of even operation is very simple, as long as the data to read the user ID and the Product ID swap position can be, that is, the R matrix transpose after training. 9.SVD Combat

Task Introduction:

Datasets: U1 data for movielens100k
Algorithm: Biased SVD (specific parameter update formula can see the 5th part)
Evaluation index: RMSE

#-*-Coding:utf-8-*-from __future__ import print_function #引入python 3.x print function from __future__ Import Division #精准 Division import NumPy as NP def load_data (path): data = [] with open (name=path,mode= ' r ") as File:for line in File: (User_id,moive_id,rating,time_stamp) = Line.strip (). Split (' \ t ') data.append ([user_id,moive_i D,rating]) data = Np.array (data). Astype (np.uint16) return data #重要变量 Ave,bi,bu,qi,qu #核心参数 K (rank), gamma (learning rate), 
        LAMBDA (regular Factor) class Svd:def __init__ (self,x,k=20): ' ' K ' is the length of vector ' Self. X=np.array (X) self.k=k Self.ave=np.mean (self. x[:,2]) print ("The Train data size is", self. X.shape) self.bi={} self.bu={} self.qi={} self.pu={} self.movie_user={} SE lf.user_movie={} for I in range (self. X.shape[0]): Uid=self. X[i][0] Mid=self. X[I][1] Rat=self. X[i][2] Self.movie_user.setdefault (mid,{}) Self.user_movie.setdefault (uid,{}) self.movie_ User[mid][uid]=rat Self.user_movie[uid][mid]=rat Self.bi.setdefault (mid,0) self.bu.set Default (uid,0) Self.qi.setdefault (Mid,np.random.random ((self.k,1))/10* (Np.sqrt (SELF.K)) Self.pu.s Etdefault (Uid,np.random.random (self.k,1))/10* (Np.sqrt (SELF.K))) def pred (Self,uid,mid): Self.bi.setdefault (M id,0) Self.bu.setdefault (uid,0) Self.qi.setdefault (Mid,np.zeros ((self.k,1))) Self.pu.setdefault (UI D,np.zeros (self.k,1)) if (Self.qi[mid]==none): Self.qi[mid]=np.zeros ((self.k,1)) if (Self.pu [Uid]==none): Self.pu[uid]=np.zeros ((self.k,1)) Ans=self.ave+self.bi[mid]+self.bu[uid]+np.sum (self.qi[ MID]*SELF.PU[UID]) If Ans>5:return 5 elif Ans<1:return 1 return S Def train (self,steps=20,gamma=0.04,lambda=0.15): For step in Range (steps): Print (' The ', step, ' th step is running ') rmse_sum=0.0 kk=np.random.permutation (self. X.shape[0]) #np. Random.permutation (5) [3,1,2,0,4] in order to disrupt the for J in range (self. X.shape[0]): I=kk[j] uid=self. X[i][0] Mid=self. X[I][1]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.