International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Introduction to the calculation method of similarity in Mahout

Last Update:2018-07-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In reality, the recommendation system is generally based on collaborative filtering algorithms, which usually need to calculate the user and user or project and project similarity, for data volume and data types of different data sources, need different similarity calculation method to improve the recommended performance, In Mahout, a large number of components are provided for computing similarity, and these components implement different similarity calculation methods respectively. The following figure is used to implement the relationship between components of the similarity calculation:

Figure 1, Project similarity calculation component

Figure 2, User similarity calculation component

Here is a description of several key similarity calculations: Pearson correlation

Class Name: Pearsoncorrelationsimilarity

Principle: A statistic used to reflect the degree of linear correlation of two variables

Range: [ -1,1], the greater the absolute value, the stronger the correlation, negative correlation for the recommended significance is small.

Note: 1, do not consider the number of overlapping, 2, if there is only one overlap, can not calculate the similarity (the calculation process is divided by n-1); 3. If the overlapping values are equal, the similarity can not be computed (the standard deviation is 0 and the divisor is divided).

Class Name: Euclideandistancesimilarity

Principle: The Similarity degree S,s=1/(1+D) is defined by Euclidean distance d.

Range: [0,1], the larger the value, the smaller the D, that is, the closer the distance, the greater the similarity.

Class Name: Pearsoncorrelationsimilarity and Uncenteredcosinesimilarity

Principle: The cosine of the angle between the two points of the multidimensional space and the set point.

Range: [ -1,1] The larger the value, the greater the angle, the farther apart the two points, the smaller the similarity.

Note: In the mathematical expression, if the attributes of two items are data- centric , the computed cosine similarity and Pearson similarity are the same, in Mahout, the data center process is realized, so Pearson similarity value is also the cosine similarity after data center. In addition, in the new version, Mahout provides the Uncenteredcosinesimilarity class as the cosine similarity for computing the non-centralized data. spearman rank correlation coefficient

Class Name: Spearmancorrelationsimilarity

Principle: Spearman rank correlation coefficients are generally considered to be the Pearson linear correlation coefficients between the arranged variables.

Range: { -1.0,1.0}, 1.0 when consistent, 1.0 for inconsistencies.

Description: Calculations are very slow and have a large number of sorts. For data sets in Recommender systems, it is inappropriate to use spearman rank correlation coefficients as similarity measures. Manhattan distance

Class Name: Cityblocksimilarity

Principle: The realization of the Manhattan distance, similar to the continental distance, are used to measure the spatial distance of the multidimensional data

Range: [0,1], consistent with the European range, the smaller the value, the greater the distance value, the greater the similarity.

Description: Less than the Euclidean distance calculation, the performance is relatively high. tanimoto coefficient

Class Name: Tanimotocoefficientsimilarity

Principle: Also known as generalized Jaccard coefficients, is the expansion of the Jaccard coefficient, the equation is

Range: [0,1], when full overlap is 1, no overlapping item is 0, the closer the 1 description is the more similar.

Description: Handle non-scoring preference data. logarithmic likelihood similarity

Class Name: Loglikelihoodsimilarity

Principle: Number of overlapping, number of non-overlapping, no number

Scope: Specific to Baidu Library to find papers "accurate Methods for the Statistics of Surprise and coincidence"

Note: Processing the preference data without scoring is more intelligent than the calculation method of Tanimoto coefficient.

This article is from "someone who says I am a tech house" blog, please be sure to keep this source http://1992mrwang.blog.51cto.com/3265935/1337938

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to the calculation method of similarity in Mahout

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support