In the Nnet series, The Matrix factorization feels strange, but after listening to the first section of the course it becomes clear.
Lin first introduced a difficult problem in machine learning: categorical features
The problem is characterized by some kind of ID number, not numerical.
If you want to handle this situation, you need encoding from categorical to numerical
One of the most common encoding methods is binary vector encoding (which is also used in internships) and binary vectors as input.
Contact the model you learned before, and you can use Nnet to learn the mapping relationship.
However, the binary vector is not numerical vector after all, since each input is only 1 on one dimension, the remainder is 0, so the Tanh in nnet is not necessary ( because each input data x is fed to a value of only one dimension per Tanh, The output is also affected only by the value of one dimension, and Tanh is about X being monotonous .
So, with the following simplified version of the linear Network, that is, the tanh replaced by Σ summation.
Here is a description of the symbol:
1) v is the DXN matrix (D is the number of hidden unit, n is the number of user ):Each column of V represents the weight of each user on the hidden unit .
2) W ' is the matrix of the Mxd (m is the number of the movie):each row of M represents the weight of each movie about the hidden unit .
Consider that each xn is a binary vector, then h (xn) = W ' vn ( it is ok to write a deduction): the output H (xn) of the Linear network is an m-dimensional vector, Represents each user's rating for each movie.
In summary, Linear network One of the recommender system needs to learn is the V matrix (User-hidden unit or latent factor), and the other is the W matrix (Item-hidden or latent Factor).
Before introducing the learning method, Lin re-organized the linear network problem.
Linear Network for m-th movie: There is a corresponding wm ' to the converted x to be linearly weighted HM (x) = Wm ' fi (x).
As a result, the learning goals are clear:
1) coefficient matrix of transform
2) coefficient matrix of linear model
In conclusion, since the input of the linear network is binary vector, a variant of the original linear network problem is made: Rnm = Wm ' vn→r = V ' W, which translates into a matrix factorization problem. (The individual likes this motivation explanation, the matrix factorization Why in the Nnet this part appears also to understand )
As for the derivation of the linear network transformation into matrix factorization, I will write two more things according to my personal understanding:
H (x) = W ' Vx ( found in the previous PPT )
= (Vx) ' W ( because H (x) is a vector so it's okay to reverse it, the output h (x) is changed from the original column vector to the row vector, but the value of the corresponding position remains the same )
= X ' V ' W ((AB) ' =b ' A ', matrix transpose operation Properties )
Then h (x) = X ' V ' W ( all input XN=1...N by line)
= I (N) V ' W (x ' Matrix each row represents an input binary vector, where x is arranged in numbered order, so X ' is a unit array )
= V ' W ( original linear network problem converted to basic Matrix factorization problem )
And, this decomposition can add some physical meaning: each hidden unit can be regarded as an implicit feature (Comedy, Action ...). )。 V and W represent the relationship between the user and the movie and the hidden unit.
Here's how to solve the model:
The optimization problem has two sets of variables that mimic the alternating minimization pattern that K-means learned: rotation optimization, which is alternating least square algorithm.
1) fixed V (equivalent to user to hidden unit weight fixed): Need to learn the WM (M=1,..., m); When studying each wm, feed in <v, rm-th column) > N=1,... M, with a bias of linear regression
It is easy to create a misconception: The matrix is empty on most positions, and the values of these locations are linear regression.
Think about it, these values are not in the solution range of the linear regression ( Note that the error is calculated only for those points with rating ratings )
2) v is similar to M, and the method of learning is similar, do not repeat
The algorithm flow for the entire alternating Least squares is as follows:
1) randomly when initializing
2) Because Ein has a lower limit, so it can converge
Here, Lin also raised a sentence: Linear autoencoder (PCA) is a special matrix factorization.
Another method to solve Matrix factorzation is a more common one is the stochastic Gradient descent method.
In the optimization of the Ein, regardless of the preceding constants, consider the following equation.
Because there are two variables, the gradient needs to be calculated separately. Can consult the SGD algorithm, here is the simplest derivative, no longer repeat.
Here's a more: Why is the derivation of Vn only considered (RNM-WM ' vn) ² this item?
Because, here the derivative has two variables, vn and WM:
1) Items that do not contain VN are naturally not considered.
2) contains VN and contains W1,... WM in the entry:
A. In the case of batch gradient, these items containing VN should be considered (each one, and then an average of this type ).
B. If it is stochastic gradient method, only need to consider the WM this point can be ( if rnm has value ), so the gradient of the formula left this one.
( personal feeling of detail or clasp clearly, help to understand complex problems )
Here's an article about the parallelization of gradient algorithms: http://www.superchun.com/machine-learning/parallel-matrix-factorization.html
The overall algorithm flow is as follows:
Finally, Lin spoke a little bit about SGD in the KDD Cup using trick:
This trick called Time-deterministic GD: that is, in the final round of GD, no longer with the strategy of random points, instead of selecting the nearest point on the timeline. This can achieve better results for data with time attributes.
"Matrix factorization" heights Field machine learning techniques