From: http://blog.csdn.net/samxx8/article/details/7691868
Similar distances (the smaller the distance, the greater the value)
|
Advantages |
Disadvantages |
Range of values |
Pearsoncorrelation similar to calculating the covariance of two matrices |
Unaffected by high user ratings or low habits |
1. A similar distance cannot be calculated if two items have a similar number less than 2. [You can use the item similar number of thresholds to solve.] The intersection size between two users is not considered [solved by using the weight parameter] 2. Unable to calculate two identical items |
[-1, 1] |
euclideandistancesimilarity calculating Euclidean distance, using 1/(1+D) |
Use situations that are more important than the scoring size |
If the score is not important, you need normalization, the calculation is large and every time there is data update trouble |
[-1, 1] |
Cosinemeasuresimilarity Calculation angle |
Consistent with Pearsoncorrelation |
|
[-1, 1] |
Spearmancorrelationsimilarity uses ranking to replace the scoring pearsoncorrelation |
Balance between full reliance on scoring and total abandonment of scores |
Calculating rank consumes too much time is bad for data updates |
[-1, 1] |
Cacheusersimilarity saved a few tags, reference |
Cache frequently queried user-similarity |
Additional memory Overhead |
|
Tanimotocoefficientsimilarity statistics the intersection of two vectors occupies the same proportion of the same set and the more similar the number of sets. |
For cases where there is only relevance and no scoring |
Without considering the score, the information was lost |
[ -1,1] |
Loglikelihoodsimilarity is a tanimotecoefficientsimilarity. An improvement based on probability theory |
The occasional coincidence of the two calculations takes into account the uniqueness of the two item adjacent to each other |
Computational complexity |
[ -1,1] |
In reality, the recommendation system is generally based on collaborative filtering algorithms, which usually need to calculate the user and user or project and project similarity, for data volume and data types of different data sources, need different similarity calculation method to improve the recommended performance, In Mahout, a large number of components are provided for computing similarity, and these components implement different similarity calculation methods respectively. The relationship between the components used to achieve the similarity calculation:
Figure 1, Project similarity calculation component
Figure 2, User similarity calculation component
Here are some of the key similarity calculation methods:
Pearson related degrees
Class Name: Pearsoncorrelationsimilarity
Principle: A statistic used to reflect the degree of linear correlation of two variables
Range: [ -1,1], the greater the absolute value, the stronger the correlation, negative correlation for the recommended significance is small.
Note: 1, do not consider the number of overlapping, 2, if there is only one overlap, can not calculate the similarity (the calculation process is divided by n-1); 3. If the overlapping values are equal, the similarity can not be computed (the standard deviation is 0 and the divisor is divided).
This similarity is not the best choice, nor is it the worst choice, just because it is easy to understand and is often mentioned in early studies. The use of the Pearson linear correlation coefficients must assume that the data is obtained from the normal distribution in pairs, and that the data must be equal-spaced at least in the logical category. In Mahout, an extension is provided for Pearson-related calculations by adding a parameter to an enumeration type (Weighting) to make the overlap count an influence factor for the computational similarity.
Euclidean distance similarity degree
Class Name: Euclideandistancesimilarity
Principle: The Similarity degree S,s=1/(1+D) is defined by Euclidean distance d.
Range: [0,1], the larger the value, the smaller the D, that is, the closer the distance, the greater the similarity.
Description: Similar to Pearson's, this similarity does not take into account the effect of overlapping numbers on the results, and similarly, the mahout by adding an enumeration type (Weighting) parameter to make the overlap number an influence factor of the computational similarity.
Cosine similarity degree
Class Name: Pearsoncorrelationsimilarity and Uncenteredcosinesimilarity
Principle: The cosine of the angle between the two points of the multidimensional space and the set point.
Range: [ -1,1] The larger the value, the greater the angle, the farther apart the two points, the smaller the similarity.
Note: In the mathematical expression, if the attributes of two items are data- centric , the computed cosine similarity and Pearson similarity are the same, in Mahout, the data center process is realized, so Pearson similarity value is also the cosine similarity after data center. In addition, in the new version, Mahout provides the Uncenteredcosinesimilarity class as the cosine similarity for computing the non-centralized data.
Spearman rank correlation coefficient
Class Name: Spearmancorrelationsimilarity
Principle: Spearman rank correlation coefficients are generally considered to be the Pearson linear correlation coefficients between the arranged variables.
Range: { -1.0,1.0}, 1.0 when consistent, 1.0 for inconsistencies.
Description: Calculations are very slow and have a large number of sorts. For data sets in Recommender systems, it is inappropriate to use spearman rank correlation coefficients as similarity measures.
Manhattan Distance
Class Name: Cityblocksimilarity
Principle: The realization of the Manhattan distance, similar to the continental distance, are used to measure the spatial distance of the multidimensional data
Range: [0,1], consistent with the European range, the smaller the value, the greater the distance value, the greater the similarity.
Description: Less than the Euclidean distance calculation, the performance is relatively high.
Tanimoto coefficient
Class Name: Tanimotocoefficientsimilarity
Principle: Also known as generalized Jaccard coefficients, is the expansion of the Jaccard coefficient, the equation is
Range: [0,1], when full overlap is 1, no overlapping item is 0, the closer the 1 description is the more similar.
Description: Handle non-scoring preference data.
Logarithmic likelihood similarity
Class Name: Loglikelihoodsimilarity
Principle: Number of overlapping, number of non-overlapping, no number
Scope: Specific to Baidu Library to find papers "accurate Methods for the Statistics of Surprise and coincidence"
Note: Processing the preference data without scoring is more intelligent than the calculation method of Tanimoto coefficient.
Introduction to the method of similarity calculation in Hadoop mahout (RPM)