Similarity measurement of mahout (similarity algorithm)

Last Update:2018-07-26 Source: Internet

Author: User

Tags constant square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Both the user CF and the item CF rely on the similarity calculation, because only by measuring the similarity between the user or the item can the user's "neighbor" be found to complete the recommendation. The calculation of similarity is briefly described above, but incomplete, the following is a detailed description of common similarity calculation methods:

1. Similarity--pearson correlation-based similarity based on Pearson correlation

Pearson's correlation coefficient reflects the linear correlation between the two variables, which is evaluated between [-1, 1]. When the linear relationship of two variables is enhanced, the correlation coefficient tends to 1 or-1; When one variable increases and the other variable increases, it indicates that they are positively correlated, the correlation coefficient is greater than 0, and if one variable increases, the other variable decreases, indicating that they are negatively correlated and the correlation coefficient is less than 0; if the correlation coefficient equals 0 , indicating that there is no linear correlation between them.

Using a mathematical formula, the Pearson correlation coefficient equals the standard deviation of the covariance of two variables in addition to the two variables.

covariance (covariance): Used in probability theory and statistics to measure the overall error of two variables. If the change of two variables tends to be consistent, that is, if one is greater than its own expectation and the other is greater than its own expectation, then the covariance between the two variables is positive, and if the two variables change in the opposite direction, the covariance is negative.

where u represents X's expectation E (x), v denotes Y's expectation e (y)

standard deviation (Deviation): The standard deviation is the square root of the variance

Variance (Variance): In probability theory and statistics, the variance of a random variable is expressed by its degree of dispersion, that is, the distance between the variable and the expected value.

That is, the variance equals the sum of squared errors.

There are two drawbacks to the similarity based on Pearson correlation coefficients: (1) There is no consideration of the effect of the number of scores on the similarity between users (take-to account), and (2) if there is only one common scoring item between two users, the similarity cannot be calculated

In the table above, the row represents some of the scoring values for the user (101~103) for the item. Intuitively, User1 and User5 with 3 common scoring items, and the score is not very good, it is supposed that their similarity should be higher than the similarity between User1 and User4, but User1 and User4 have a higher similarity of 1.

The same scenario often occurs in real life, such as two users who have watched 200 of movies together, although not necessarily giving the same or exactly similar scores, the similarity between them should be higher than the similarity of the other 2 films. But this is not the case, if the similarity between the two movies and two users is the same or similar, the similarity calculated by Pearson correlation will be significantly greater than the similarity between the users who viewed the same 200 movies.

Mahout is implemented based on the similarity of Pearson correlation coefficients, which relies on a datamodel as input.

At the same time, the mahout is optimized for the disadvantage (1), By simply passing in a single weighting.weighted parameter when constructing pearsoncorrelationsimilarity, you can make the similarity between users with more of the same scoring items closer to 1 or 1.

[Java] View plain copy usersimilarity similarity1 = New pearsoncorrelationsimilarity (model); double value1 = similarity1.usersimilarity (1, 5); usersimilarity similarity2 = New pearsoncorrelationsimilarity (model, weighting.weighted); Double value2 = similarity2.usersimilarity (1, 5); results: Similarity of User1 and user5:0.944911182523068
Similarity of User1 and User5 with weighting:0.9655694890769175 &nbs P

2. The similarity degree based on Euclidean distance--euclidean distance-based similarity Euclidean distance calculation similarity is the simplest and most understandable method in all similarity calculations. It is an axis of an object that has been consistently evaluated, and then draws the person who participates in the evaluation to the coordinate system and calculates the straight distance between them.

In the figure, User A and User B score items x, Y, respectively. User A has a score of 1.8 for Project X, a score of 4 for Project Y, and a coordinate point of a (1.8, 4) in the coordinate system, and the same User B's score for item x, Y is expressed as coordinate point B (4.5, 2.5), so the Euclidean distance between them (straight line distance) is: sqrt ((b.x-a.x ) ^2 + (A.Y-B.Y) ^2)

The calculated Euclidean distance is a number greater than 0, in order to make it more able to reflect the similarity between users, it can be regulated to (0, 1] between, the specific practice is: 1/(1 + D). See above table

As long as there is at least one common score, the similarity can be calculated using Euclidean distance, and Euclidean distance will be lost if there are no common scoring items. In fact, if there is no common scoring item, then it means that these two users or items are not similar at all.

3. Cosine similarity--cosine similarity cosine similarity uses the cosine of the two vectors in the vector space as a measure of the difference between the two individuals. The cosine similarity focuses more on the direction of the two vectors than on distances or lengths, compared to distance measurements.

Similar to Euclidean distance, the calculation method based on cosine similarity also takes user preferences as a point in the N-dimensional coordinate system, by connecting the point to the origin of the coordinate system to form a straight line (vector), the similarity value between the two users is the cosine of the angle between the two lines (vectors). Because the connection represents user-scored points and the origin of the line will intersect at the origin, the smaller the angle represents the more similar two users, the larger the angle represents two users of the smaller similarity. Also in the triangular coefficients, the cosine of the angle is between [-1, 1], and the cosine of the 0-degree angle is the cosine of the 1,180-angle is-1.

The difference between Euclidean distance and cosine similarity is viewed with three-dimensional coordinate system:

It can be seen from the figure that the distance measure is the absolute distance between the points of the space, which is directly related to the coordinates of each point (i.e., the value of the individual feature dimension), and the cosine similarity measures the angle of the space vector, which is more the difference in the direction rather than the position. If the position of point A is constant and the B point is farther away from the origin of the axis, then the cosine similarity cosθ is constant, because the angle is constant, and the distance between A and b two is obviously changing, which is the difference between Euclidean distance and cosine similarity.

According to the calculation and measurement characteristics of Euclidean distance and cosine similarity, respectively, it is applicable to different data analysis models: Euclidean distance can embody the absolute difference of individual numerical characteristics, so it is more used to analyze the difference from the numerical size of dimension, such as analyzing the similarity or difference of user value by using the user behavior index. , and the cosine similarity is more from the direction of the difference, but the absolute value is not sensitive, more used to use the user Content scoring to distinguish user interest similarity and differences, while correcting the user may exist between the measurement standards of the problem (because the cosine similarity is not sensitive to absolute values).

Mahout does not specifically give an implementation based on the cosine similarity.

4. Adjust cosine similarity--adjusted cosine similarity in the introduction of the cosine similarity, the cosine similarity is more about distinguishing the difference from the direction and not sensitive to the absolute value. Therefore, it is impossible to measure the difference in the value of each dimension, resulting in a situation such as user rating of content, 5 points, X and y two users scoring two content respectively (4,5), The result of using cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y prefer, the cosine similarity to the value of the results of the error, the need to correct this irrationality, there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean, such as X and y of the score mean value Are 3, then adjusted for ( -2,-1) and (after), and then with the cosine similarity calculation, get-0.8, the similarity is negative and the difference is not small, but obviously more in line with the reality.

5. Spearman related--spearman Correlation

Spearman correlation can be understood as the Pearson correlation between the ranked user preferences values. Mahout in action has this explanation: Suppose for each user, we find his least favorite item, rewrite his rating to "1", then find the next least favorite item, rewrite the score value to "2", and so on. Then we calculate the Pearson correlation coefficients for these converted values, which is the spearman correlation coefficient.

The calculation of Spearman correlation has discarded some important information, that is, the real scoring value. But it retains the intrinsic nature of the user's preferences-the sort (ordering), which is calculated based on the sort (or rank, rank).

Review the value of user1~5 in the preceding table for item101~103, the similarity calculated by the Spearman correlation coefficient is:

We found that the calculated similarity value was either 1 or 1, because it depended on whether the user's preferences and User1 preferences tended to be "uniformly varied" or "contrary to trend change". Mahout the implementation of the Spearman correlation coefficient, which can refer to spearmancorrelationsimilarity, its execution efficiency is not very high, because the Spearman correlation calculation takes time to calculate and store a sort of preference value (ranks), The exact time depends on the magnitude of the data. Because of this, spearman correlation coefficients are generally used for academic research or for small-scale computations. [Java] view plain copy usersimilarity similarity1 = new spearmancorrelationsimilarity (model); Construct a Spearman correlation-based similarity

Results:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More