1. KNN
1.1
Basic
KNN
Model
KNN (k-Nearest Neighbor) is simply to evaluate an unknown thing U, you only need to find K known items similar to U and use k known items, evaluate U. If we want to predict the score of Feng yanjun on a movie m, we can first find k users who are similar to Feng yanjun and have scored m according to KNN's idea, then, use the scores of these k users to predict the score of Feng yanjun on M. Alternatively, you can first find K movies that are similar to m and have been evaluated by Feng yanjun, and then use the scores of these K movies to predict the rating of Feng yanjun on M. In this example, the method for finding similar users is user-based KNN, and the method for finding similar items is item-based KNN. The ideas and implementations of these two methods are similar. Therefore, we will only discuss item-based KNN in the following section and refer to it as KNN.
Based on KNN, we can divide KNN into the following three steps (assuming that user U scores item I ):
(1) similarity calculation
Commonly used similarity in recommendation systems include Pearson correlation, cosine, and squared distance. Pearson correlation is the most widely used. Therefore, this article only describes Pearson correlation.
The Pearson correlation value range is [-]. When the value is-1, the two groups of variables are negatively correlated. If the value is 0, the two groups of variables are irrelevant, if the value is 1, the two groups of variables are positively correlated. The formula is as follows:
(2) Select a neighbor
Among all Movies rated by user u too much, K movies with the highest similarity with movie m are found, and N (u, m) is used to represent the set of K movies.
(3) Calculate the predicted value
With K similar movies, you can use the following formula to predict the score:
1.2
Data sparsity and
KNN
Improvement
Currently, the scale of Recommendation Systems to be processed is growing, with hundreds of thousands of users and products. There is very little overlap between the two users. If the system's sparsity is measured by the proportion of existing selection relationships between users and products to all possible selection relationships, the sparsity of the most widely studied movielens dataset is 4.5%, netflix is 1.2%, bibsonomy is 0.35%, and delicious is 0.046%.
From Pearson correlation's calculation formula, if the intersection of two movies is much smaller than that of other movies, the similarity between the two movies is less reliable. From the data sparsity described above, we can see that in the recommendation system, there may be a small number of intersections. This will greatly enhance the reliability of similarity. In order to predict the reliability of the result, it is necessary to reduce the reliability. Therefore, we need to compress the similarity (shrinkage) based on the intersection size ):
1.3
Global functions and
KNN
Improvement
Users have various trends in rating movies. For example, some users are strictly raters and tend to give lower scores. Some users are loose raters and tend to give higher scores; some movies tend to get a higher score even if they do. In the recommendation system, these trends are called global effect (GE ).
There are 16 Commonly used ge types. Here we only list the three types used in this article:
No. |
Global Effect |
Meaning |
0 |
Overall Mean |
Average of all scores |
1 |
Movie × 1 |
Tendency to score movies |
2 |
User × 1 |
User scoring Tendency |
3 |
User × time (User) 1/2 |
How long have users been separated since the first score? |
The first column of the table indicates the order in which each Ge is considered; the second column indicates the GE name; and the third column indicates the meaning of Ge. The name of the second column indicates that the GE before "X" indicates that it is based on users or movies, and the section after "X" indicates Xu, M (as described below.
GE aims to estimate a specific parameter for the Ge (except Ge 0th, because the average value of all scores can be calculated directly ). When estimating parameters, only one Ge is taken into account at a time, and the predicted residual (residual) of all ge obtained above is used as the true score of this estimation. The true score of t + 1 Ge is calculated by the following formula:
When estimating a specific Ge parameter, we also need to consider the data sparsity problem mentioned above, that is, this parameter also needs to be compressed. The following formula is used to estimate the parameters after compression:
It indicates that this is the T parameter and is based on the user. It indicates the set of all movies that the user U has rated too much, description variable (explanatory variable) related to the U user and the M movie, 1 for GE 1 and 1 for GE 3rd
The basic KNN model does not include GE. To make the prediction result more accurate, it is necessary to add GE to The KNN prediction formula. The improved prediction formula is as follows:
2.
Lab
The experiment data uses the movielens K data. The data is composed of 1000 million users scoring 1700 million movies, with a sparsity of 100000. RMSE (Root Mean Squared Error) is used for rating indicators ):
Each algorithm performs as follows in this dataset, where the value in the table is RMSE.
|
K = 10 |
K = 15 |
K = 20 |
Basic KNN Model |
1.076 |
1.071 |
1.068 |
KNN of compression Similarity |
1.011 |
1.016 |
1.020 |
KNN with Ge |
0.987 |
0.988 |
0.989 |
Compress similarity and KNN with Ge |
0.946 |
0.951 |
0.955 |
As shown in the table above, when K is 10, the compression similarity is improved by 6%, Ge is improved by 8.2%, and GE is improved by 12.1%. This shows: (1) data sparsity has a greater impact on the rough model. (2) Ge has a great influence, because KNN's prediction result is the weighted average value of similarity and user rating. When a user's score contains a factor unrelated to the similarity (GE), the more unreliable the final result is.
Because there are many codes, they are not directly pasted. You can download them from the following address (Python implementation)
Http://ishare.iask.sina.com.cn/f/34170290.html
3.
Reference
[1] scalable collaborative filtering with jointly derived neighborhood interpolation Weights
[2] collaborative filtering algorithm in Netflix Prize
[3] research on collaborative filtering algorithms in Personalized Recommendation Technology
[4] Top Ten Challenges of Personalized Recommendation for big data applications
[5] movielens data sets