Main reference Paper "A Guide to Singular Value decomp osition for collab orative Filtering"
In fact, the beginning is more confused, because initially did not look at the paper, but the online search for the concept and use of SVD, many of the search is the following formula: It is assumed that C is m*n words, then you can get three decomposition of the matrix, respectively, M*r,r*r,r*n, This can greatly reduce the storage cost, but here is particularly important to note is: This concept is used for information retrieval, its C matrix is complete, so they can directly apply this matrix SVD decomposition, but in the recommendation system, the user and the object of the score matrix is incomplete, is a sparse matrix , it can not be directly decomposed into the above appearance, there are some articles that fill the vacancy of the average, such as the method, but the effect is not how to drop ... After reading the above paper, it is understood that the application of SVD in collaborative filtering is actually an optimization problem, we assume that there is no direct relationship between the user and the object, but defines a dimension, called feature,feature is used to characterize the feature, such as describing whether the movie is comedy or tragedy, is action or love, and the user and feature is related, such as a user must choose to see Love movies, another user likes to watch action films, items and feature is also related, such as a movie is comedy, a movie is a tragedy, then through and feature between the link, We can divide a scoring matrix rating = M*n (M represents the number of users, n represents the number of items) into two matrix multiplication: user_feature*t (item_feature), T () for transpose, where User_feature is m* K (k is feature dimension, can be arbitrarily determined), Item_feature is n*k, then we have to do is to find the values in both matrices, so that the result of multiplying the two matrices and the original scoring matrix closer to the better.
Here the so-called closer the better, refers to the smaller expectations the better, the expected formula is as follows:
where n represents the number of users, M represents the number of items, I[i][j] is used to indicate that user I has no comment on the item J too much, because we just need to review the more close to the better, did not have to consider the evaluation, VIJ represents the training data given in the score, that is, the actual score, p (UI,MJ) Indicates that we predict the score of the user I to the item J, the result is based on two vectors point multiplication, the two sides are mainly to prevent overfitting, the reason is to add a factor of 1/2 is to wait for the derivative convenient.
Our goal here is to make the expectation that e smaller the better, in fact, is a minimum of the problem, so we can use random gradient descent to achieve. The stochastic gradient descent is a problem of derivation, at some point, at the point of the derivative, and then to the highest gradient in the opposite direction, you can quickly go to the local minimum value. Therefore, we have a derivative of the above formula:
So the flow of this algorithm is actually the following process:
The implementation is relatively convenient and quick, here Rmse is used to evaluate the effect, will be said later.
The above algorithm is called batch-processing learning algorithm, the reason is called batch processing because it is expected to calculate the entire matrix of expectations (so big batch), there is an incremental learning algorithm, the difference between batch processing and incremental is that the former calculation of the expectation is to calculate the entire matrix, The latter only computes the expectation of one row or point in the matrix, where the expectation of calculating a line is called incomplete incremental learning, and the expectation of calculating a point is called fully incremental learning.
Incomplete incremental learning is expected as follows (for one row in the matrix, which is the expectation for a user i):
Then the derivation of the following formula:
The approximate idea of the algorithm is as follows:
The fully incremental learning algorithm is expected to be calculated for each score, with the following expectations:
After the derivation is as follows:
So the whole algorithm flow is this:
All of these are SVD variants, but the implementation is not the same, according to the paper, the third kind of complete incremental learning algorithm is the best, the convergence rate is very fast.
Of course there are better variants, considering each user, each item of the bias, here the so-called bias is everyone's deviation, such as a film, a B two people think it is good, but a score convenient more conservative, good to 3 points, B score relatively loose, good to 4 points, So the scoring method takes into account each user, each item bias, is more accurate than the above algorithm. The original score is directly calculated User_feature*t (item_feature), T () represents transpose, but now to consider a variety of bias, as follows:
Where a represents the average of all scores, AI indicates that user i's BIAS,BJ represents the deviation of item J, and the Matrix multiplied is the same as the above meaning.
So this time the expected formula and derivation of the formula as follows (here only the derivation of bias, matrix derivation or the same as above):
Of course, the light said no practice false bashi, we chose the last algorithm, and consider the bias algorithm to achieve a, the data source is from Movielens 100k data, which contains 1000 users of 2000 items of the score (of course, I am here is directly open the array, If the amount of data is larger, it will not be implemented, mainly to verify the effect of a gradient drop, using the base data set to train the model, test data set for testing, the effect of evaluation with a formula to measure:
To be blunt is the sum of squared errors ... We also document the RMSE of the training data set after each iteration.
The code is as follows:
#include <iostream> #include <string> #include <fstream> #include <math.h>using namespace std; const int USERMAX = 1000;const int itemmax = 2000;const int FEATURE = 50;const int Itermax = 20;double Rating[usermax][ite Mmax];int I[usermax][itemmax];//indicate If the item is rateddouble userf[usermax][feature];d ouble ItemF[ITEMMAX][ FEATURE];d ouble Biasu[usermax];d ouble Biasi[itemmax];d ouble lamda = 0.15;double Gamma = 0.04;double mean;double predict ( int i, int j) {Double rate = mean + Biasu[i] + biasi[j];for (int f = 0; f < FEATURE; f++) rate + = userf[i][f] * itemf[j][ F];if (Rate < 1) rate = 1;else if (rate>5) rate = 5;return rate;} Double Calrmse () {int cnt = 0;double total = 0;for (int i = 0, i < Usermax; i++) {for (int j = 0; J < Itemmax; J + +) {do Uble rate = Predict (I, j); Total + = i[i][j] * (Rating[i][j]-rate) * (Rating[i][j]-rate); CNT + = I[i][j];}} Double Rmse = POW (total/cnt, 0.5); return RMSE;} Double Calmean () {Double total = 0;int cnt = 0;for (int i = 0; i < Usermax; i++) for (int j = 0; J < Itemmax; J + +) {total + = i[i][j] * rating[i][j];cnt + = i[i][j];} return total/cnt;} void Initbias () {memset (biasu, 0, sizeof (Biasu)), memset (Biasi, 0, sizeof (BIASI)), mean = Calmean (); for (int i = 0; i < US Ermax; i++) {Double total = 0;int cnt = 0;for (int j = 0; J < Itemmax; J + +) {if (I[i][j]) {total + = rating[i][j]-mean;cnt++;}} if (cnt > 0) biasu[i] = total/(CNT); elsebiasu[i] = 0;} for (int j = 0; J < Itemmax; J + +) {Double total = 0;int cnt = 0;for (int i = 0; i < Usermax; i++) {if (I[i][j]) {Total + = Rating[i][j]-mean;cnt++;}} if (cnt > 0) biasi[j] = total/(CNT); elsebiasi[j] = 0;}} void Train () {//read rating Matrixmemset (rating, 0, sizeof (rating)), memset (i, 0, sizeof (i)), Ifstream in ("Ua.base"); In) {cout << ' file not exist ' << endl;exit (1);} int userId, itemId, rate;string timestamp;while (in >> userId >> itemId >> rate >> timeStamp) {Rati Ng[userid][itemid] = Rate;i[userid][itemid] = 1;} InitbiAs ();//train matrix decomposationfor (int i = 0; i < Usermax; i++) for (int f = 0; f < FEATURE; f++) userf[i][f] = (RA nd ()%)/10.0; for (int j = 0; J < Itemmax; J + +) for (int f = 0; f < FEATURE; f++) itemf[j][f] = (rand ()% 10)/10.0 ; int itercnt = 0;while (itercnt < Itermax) {for (int i = 0; i < Usermax; i++) {for (int j = 0; J < Itemmax; J + +) {if (I[i][j]) {Double predictrate = predict (i, j);d ouble eui = rating[i][j]-predictrate; Biasu[i] + = gamma* (Eui-lamda*biasu[i]); BIASI[J] + = gamma* (Eui-lamda*biasi[j]); for (int f = 0; f < FEATURE; f++) {userf[i][f] + = gamma* (Eui*itemf[j][f]-LAMD A*USERF[I][F]); Itemf[j][f] + = gamma* (Eui*userf[i][f]-lamda*itemf[j][f]);}}} Double Rmse = Calrmse () cout << "Loop" << itercnt << ": Rmse is" << Rmse << endl;itercnt+ +;}} void Test () {Ifstream in ("Ua.test"), if (!in) {cout << "file not exist" << endl;exit (1);} int userId, itemId, rate;string timestamp;double total = 0;double cnt = 0;while (in >> userId >> itemId >> rate >> timeStamp) {Double R = Predict (userid, itemId); Total + = (r-rate) * (r-rate); cnt + = 1;} cout << "Test Rmse is" << pow (total/cnt, 0.5) << Endl;} int main () {train (); test (); return 0;}
As follows:
Can be seen, RMSE can be very fast convergence, training data in the RMSE can quickly converge to about 0.8, and then take the test set of data to test, Rmse for 0.949, is pretty good prediction results, of course, here can adjust various parameters to obtain better experimental results ... is the so-called black technology? :)
Application and implementation of SVD in Recommender system (c + +)