Suppose the data is as follows, where the row represents the user, and the column represents the rating item:
Let's look at the three formulas first.
Cosine similarity (cosine-based similarity):
Pearson coefficient (Pearson correlation):
Fixed cosine similarity (adjusted cosine similarity):
Where ru,i represents the user U gives the item I rating
1. Comparison of cosine similarity with the rest
The cosine similarity calculation is based on the information of all users in the rating item I and item J, which includes all users who have filled in the rating with the No-fill rating (0 without filling in the rating);
The Pearson coefficient and the modified cosine similarity represent all combinations of users who have rated I and J together ;
Summary: the cosine similarity differs from the rest of the user collections that are selected in the calculated formula.
2. Comparison between Pearson's coefficient and modified cosine similarity
from the formula, the difference between the two is the difference between.
The Pearson coefficient represents all users who have been rated I and J, the average of their ratings for I, that is, when the Pearson coefficients are computed, a table of users listed as I and J, which behaves in the same rating , and calculates the average of the column I.
The corrected cosine similarity represents the average value of the user U- rated items, i.e., items that are not rated when calculated are not taken by 0 but are ignored directly.
Summary: The difference between the Pearson coefficient and the modified cosine similarity is in the different ways of centering .
Reference article:
1.http://www.zhihu.com/question/21824291
2.http://www10.org/cdrom/papers/519/node11.html
3.http://guidetodatamining.com/assets/guidechapters/datamining-ch3.pdf
if there are errors or suggestions please advise, O (∩_∩) o Thank you