Several similarity calculation methods in the taste of mahout

Source: Internet
Author: User

Euclidean similarity (Euclidean Distance)

Originally used to calculate the distance between two points in Euclidean space, with two users x and Y as an example, as two vectors x and y in n-dimensional space, Xi represents the user X's preference value for Itemi, and Yi represents the user Y's preference value for Itemi, and the Euclidean distance before them is

The corresponding Euclidean similarity, generally using the following formula for conversion: the smaller the distance, the greater the similarity

In taste, the class that calculates Euclidean similarity between user and item is euclideandistancesimilarity.

Pearson similarity (Pearson Correlation coefficient)

Pearson correlation coefficients are generally used to calculate the degree of linear correlation between two fixed-distance variables, and its value is between [ -1,+1]. When the value is greater than 0 indicates that two variables are positive correlation, that is, the greater the value of one variable, the greater the value of the other variable; When the value is less than 0, it means that two variables are negatively correlated, that is, the greater the value of one variable, the smaller the value of the other variable. The calculation formula is as follows

where SX and SY are the standard deviations of the sample

In taste, the implementation of Pearsoncorrelationsimilarity is not based on the above formula, but in the implementation of 3.

Cosine similarity (cosine similarity)

is the angle cosine of two vectors, which is widely used to calculate the similarity of document data.

In taste, the class that realizes cosine similarity is pearsoncorrelationsimilarity, another class uncenteredcosinesimilarity realizes the cosine vector angle after formalization, the following formula

The reasons for this equation are as follows: The cosine similarity is more about distinguishing the difference from the direction and not sensitive to absolute values. Therefore, it is impossible to measure the difference in the value of each dimension, resulting in a situation such as user rating of content, 5 points, X and y two users scoring two content respectively (4,5), The result of using cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y prefer, the cosine similarity to the value of the results of the error, the need to correct this irrationality, there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as X and y of the score mean value Are 3, then adjusted for ( -2,-1) and (after), and then with the cosine similarity calculation, get-0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality.

Tanimoto Similarity degree

Tanimoto coefficients, also called jaccard coefficients, are extensions of cosine similarity and are used to calculate document similarity. The calculation formula is as follows:

where x represents the collection of all the item that user X prefers, and y represents the collection of all the item that user Y prefers.

In taste, the class that implements the Tanimoto similarity is tanimotocoefficientsimilarity, and it can be seen that this method of calculation applies to the user's preference for item 0 and 1 that is the case.

City Block (or Manhattan) similarity

The taxi geometry or the Manhattan distance (Manhattan Distance) is a term created by the 19th century Minkowski, a geometrical term used in geometric metric spaces to mark the sum of the absolute wheelbase of two points on a standard coordinate system. The red line represents the Manhattan distance, the green represents the Euclidean distance, which is the straight line distance, while the blue and yellow represent equivalent Manhattan distances.

The calculation formula is:

The converted similarity is:

The implementation class Cityblocksimilarity in Tasete uses a simplified calculation method to compare 0 or 1 of users ' favorite data.

Similarity of Loglikelihood (logarithmic likelihood similarity)

The formula is more complex, the implementation class is loglikelihoodsimilarity, the comparison applies to the user likes the data when 0 or 1 of the situation

Spearman (Spearman) similarity

Spearman correlation can be understood as the Pearson correlation between the ranked user preferences values. Mahout in action has this explanation: Suppose for each user, we find his least favorite item, rewrite his rating to "1", then find the next least favorite item, rewrite the score value to "2", and so on. Then we calculate the Pearson correlation coefficients for these converted values, which is the spearman correlation coefficient.

The calculation of Spearman correlation has discarded some important information, that is, the real scoring value. But it retains the intrinsic nature of the user's preferences-the sort (ordering), which is calculated based on the sort (or rank, rank).

Because the calculation of Spearman dependencies takes time to calculate and store a sort of preference value (ranks), depending on the magnitude of the data. Because of this, spearman correlation coefficients are generally used for academic research or for small-scale computations.

The implementation class in taste is Spearmancorrelationsimilarity

Several similarity calculation methods in the taste of mahout

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.