Euclidean similarity (Euclidean Distance)
Originally used to calculate the distance between two points in Euclidean space, with two users x and Y as an example, as two vectors x and y in n-dimensional space, Xi represents the user X's preference value for Itemi, and Yi represents the user Y's preference value for Itemi, and the Euclidean distance before them is
The corresponding Euclidean similarity, generally using the following formula for conversion: the smaller the distance, the greater the similarity
In taste, the class that calculates Euclidean similarity between user and item is euclideandistancesimilarity.
Pearson similarity (Pearson Correlation coefficient)
Pearson correlation coefficients are generally used to calculate the degree of linear correlation between two fixed-distance variables, and its value is between [ -1,+1]. When the value is greater than 0 indicates that two variables are positive correlation, that is, the greater the value of one variable, the greater the value of the other variable; When the value is less than 0, it means that two variables are negatively correlated, that is, the greater the value of one variable, the smaller the value of the other variable. The calculation formula is as follows
where SX and SY are the standard deviations of the sample
In taste, the implementation of Pearsoncorrelationsimilarity is not based on the above formula, but in the implementation of 3.
Cosine similarity (cosine similarity)
is the angle cosine of two vectors, which is widely used to calculate the similarity of document data.
In taste, the class that realizes cosine similarity is pearsoncorrelationsimilarity, another class uncenteredcosinesimilarity realizes the cosine vector angle after formalization, the following formula
The reasons for this equation are as follows: The cosine similarity is more about distinguishing the difference from the direction and not sensitive to absolute values. Therefore, it is impossible to measure the difference in the value of each dimension, resulting in a situation such as user rating of content, 5 points, X and y two users scoring two content respectively (4,5), The result of using cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y prefer, the cosine similarity to the value of the results of the error, the need to correct this irrationality, there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as X and y of the score mean value Are 3, then adjusted for ( -2,-1) and (after), and then with the cosine similarity calculation, get-0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality.
Tanimoto Similarity degree
Tanimoto coefficients, also called jaccard coefficients, are extensions of cosine similarity and are used to calculate document similarity. The calculation formula is as follows:
where x represents the collection of all the item that user X prefers, and y represents the collection of all the item that user Y prefers.
In taste, the class that implements the Tanimoto similarity is tanimotocoefficientsimilarity, and it can be seen that this method of calculation applies to the user's preference for item 0 and 1 that is the case.
City Block (or Manhattan) similarity
The taxi geometry or the Manhattan distance (Manhattan Distance) is a term created by the 19th century Minkowski, a geometrical term used in geometric metric spaces to mark the sum of the absolute wheelbase of two points on a standard coordinate system. The red line represents the Manhattan distance, the green represents the Euclidean distance, which is the straight line distance, while the blue and yellow represent equivalent Manhattan distances.
The calculation formula is:
The converted similarity is:
The implementation class Cityblocksimilarity in Tasete uses a simplified calculation method to compare 0 or 1 of users ' favorite data.
Similarity of Loglikelihood (logarithmic likelihood similarity)
The formula is more complex, the implementation class is loglikelihoodsimilarity, the comparison applies to the user likes the data when 0 or 1 of the situation
Spearman (Spearman) similarity
Spearman correlation can be understood as the Pearson correlation between the ranked user preferences values. Mahout in action has this explanation: Suppose for each user, we find his least favorite item, rewrite his rating to "1", then find the next least favorite item, rewrite the score value to "2", and so on. Then we calculate the Pearson correlation coefficients for these converted values, which is the spearman correlation coefficient.
The calculation of Spearman correlation has discarded some important information, that is, the real scoring value. But it retains the intrinsic nature of the user's preferences-the sort (ordering), which is calculated based on the sort (or rank, rank).
Because the calculation of Spearman dependencies takes time to calculate and store a sort of preference value (ranks), depending on the magnitude of the data. Because of this, spearman correlation coefficients are generally used for academic research or for small-scale computations.
The implementation class in taste is Spearmancorrelationsimilarity
Several similarity calculation methods in the taste of mahout