Theoretical basis for choosing the computing method of machine learning similarity

Source: Internet
Author: User

In the classification clustering algorithm, it is often used to calculate the distance of two input variables (usually the form of eigenvector), that is, the similarity measure. Different similarity measures for the results of the algorithm, some times, the difference is very big. Therefore, it is necessary to choose an appropriate similarity measurement method according to the characteristics of the input data.

So X= (x1,x2,.., xn) T,Y= (y1,y2,... yn) t is two input vectors,

1. Euclidean distance (Euclidean distance)-euclideandistancemeasure.

?

Corresponds to the distance between the point of the vector in the high-dimensional space, which is said.
Because of the dimensional inconsistency of the components of the eigenvectors, it is often necessary to standardize the components so that they are independent of the units, such as the use of European distances for indicators of height (cm) and weight (kg) of two units may invalidate the results.
Advantages : Simple, widely used (if also counted as a merit), can well reflect the difference in numerical value, Kmeans often through Euclidean distance calculation
disadvantage : The correlation between components is not considered, and the results can be disturbed by multiple components that embody a single feature.

2. Markov distance (Mahalanobis distance)-mahalanobisdistancemeasure

?

C=e[(x-x average) (y-y average)] is the covariance matrix of input vector X for this class. (T is the transpose symbol, E takes the average when the sample is therefore n-1)

Applicable occasions:
1) measure the degree of difference between the X and y of a random variable with a covariance matrix of C that obeys the same distribution and has a variance of two
2) measure the degree of difference between X and a class of mean vectors, and determine the attribution of the sample. At this point, Y is the class mean vector.
Advantages :
1) independent of component dimension
2) The correlation between samples is excluded.
disadvantage : Different characteristics can not be treated differently, may exaggerate the weak characteristics.

3. Minkowski distance (Minkowsk distance)-minkowskidistancemeasure(default p=3)

?

Can be seen as Euclidean distance index promotion, has not seen good application examples, but generally, the promotion is a kind of progress, special,

When P=1, also known as the Manhattan distance , also called the absolute distance, the Manhattan distance from the city block distance, is the result of summing up the distances on multiple dimensions. manhattandistancemeasure.

?

When q=∞, known as Chebyshev distance ,chebyshevdistancemeasure

Chebyshev the distance from the king in chess, we know that the chess King can only go to the surrounding 8 in a step, so if you want to go from the chessboard (x1, y1) to B (x2, y2) at least a few steps to walk? Extended to multidimensional space, in fact, Chebyshev distance is when p tends to infinity when the distance of the Ming:

4. Hamming distance (Hamming distance)-Mahout None

In information theory, the Hamming distance between two equal-length strings is the number of different characters in the corresponding position of two strings. In other words, it is the number of characters that need to be replaced to transform a string into another string.

For example:

The Hamming distance between 1011101 and 1001001 is 2.
The Hamming distance between 2143896 and 2233796 is 3.
The Hamming distance between "toned" and "Roses" is 3.

5.Tanimoto coefficients ( also known as generalized Jaccard coefficients )-tanimotodistancemeasure.

It is usually applied to the x as a Boolean vector, that is, when each component takes only 0 or 1. At this point, the representation of the public feature of X, Y, and the ratio of the characteristics occupied by X, Y.

5.Jaccard coefficients

The Jaccard coefficient is primarily used to calculate the similarity between individuals in symbolic or boolean measurements, because the characteristic attributes of an individual are identified by a symbol metric or a Boolean value, so it is not possible to measure the size of the difference, but only the "is the same" result, So the Jaccard coefficient is only concerned about whether the characteristics of the individual are consistent with each other. If you compare the Jaccard similarity coefficients of x and Y, compare only the same number in Xn and yn, the formula is as follows:


7. Pearson correlation coefficient (Pearson correlation coefficient)-pearsoncorrelationsimilarity

that is, correlation coefficient r in correlation analysis, the cosine angle of space vector is computed for x and y based on their overall normalization. The formula is as follows:

?

8. Cosine similarity (cosine similarity)-cosinedistancemeasure

?

Is the cosine of the angle between the two vectors.

The cosine similarity is used to measure the difference between the two individuals by the cosine of the two vectors in the vector space. The cosine similarity focuses more on the direction of the two vectors than on distances or lengths, compared to distance measurements.

Pros: Not affected by Axis rotation, zoom in and zoom out.

9. Adjust cosine similarity-adjusted cosine similarity

Although the cosine similarity to the individual existence of prejudice can be modified, but because can only distinguish between the individual in the dimensions of the difference, can not measure the difference of the value of each dimension, will lead to such a situation: for example, the user to the content rating, 5 points, X and y two users of two content ratings are (4,5) , the result of using cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y like, the cosine similarity to the value of the non-sensitivity of the results of the error, need to correct this irrationality, there is the adjustment of the cosine similarity, that is The values on all the dimensions are subtracted by one mean, such as x and Y, the score mean is 3, then adjusted to ( -2,-1) and (to), then the cosine similarity calculation, get-0.8, the similarity is negative and the difference is not small, but obviously more realistic.

To adjust the cosine similarity and cosine similarity, Pearson correlation coefficients are applied more in the recommender system. In the project-based recommendation,Grouplens has a paper showing that the tuning cosine similarity performance is better than the latter two.

10. Weight-based distance calculation method:

weighteddistancemeasure , Weightedeuclideandistancemeasure , weightedmanhattandistancemeasure

Euclidean distance and cosine similarity

Euclidean distance is a measure of the absolute distance between points of space, directly related to the location coordinates of each point (that is, the value of the individual feature dimension), and the cosine similarity measures the angle of the space vector, which is more the difference in the direction rather than the position. If the position of point A is constant and the B point is farther away from the origin of the axis, then the cosine similarity cosθ is constant, because the angle is constant, and the distance between A and b two is obviously changing, which is the difference between Euclidean distance and cosine similarity.

The difference between Euclidean distance and cosine similarity is viewed with three-dimensional coordinate system:

According to the calculation and measurement characteristics of Euclidean distance and cosine similarity, respectively, it is applicable to different data analysis models: Euclidean distance can embody the absolute difference of individual numerical characteristics, so it is more used to analyze the difference from the numerical size of dimension, such as using User behavior Index to analyze the similarity or difference of user value; Cosine similarity is more to differentiate from the direction, but not sensitive to absolute values, more used to use User content scoring to distinguish the similarity and difference of user interest, and also fixed the problem that the measurement standards may exist between users (because the cosine similarity is not sensitive to absolute values).

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Theoretical basis for choosing the computing method of machine learning similarity

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.