Distance algorithm metric

Source: Internet
Author: User

1. Euclidean distance

equation: Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):

   

It can also be expressed in the form of a vector operation:

   

application: The analysis of the difference in the numerical size of the dimension, such as the use of user behavior indicators to analyze the similarity or difference in user value.

2. Cosine distance

equation: angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n).

Similarly, for two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure how similar they are to each other.

   

That

   

Apply to: Differentiate differences from the direction, the user to the content score to distinguish the similarity and difference of interest, fixed the problem that the measure standard that may exist among users is not unified (because cosine distance is insensitive to absolute value).

3. Jaccard Similarity measurement

Definition (1): Jaccard similarity coefficient

The proportion of the intersection elements of two sets a and B in the Jaccard of a A, is called the two-set similarity coefficient, denoted by the symbol J (A, B).

Jaccard similarity coefficient is an indicator of the similarity of two sets.

Definition (2): Jaccard distance

The concept opposite to the Jaccard similarity coefficient is the jaccard distance (jaccarddistance). Jaccard distances can be expressed in the following formula:

The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.

Definition (3): Application of Jaccard similarity coefficient and Jaccard distance

The Jaccard similarity coefficient can be used to measure the similarity of samples.

Sample A and sample B are two n-dimensional vectors, and the values for all dimensions are 0 or 1. For example: A (0111) and B (1011). We treat the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element.

P: The number of dimensions for both sample A and B are 1

Q: Sample A is 1 and sample B is the number of dimensions of 0

R: Sample A is 0 and sample B is the number of dimensions of 1

S: The number of dimensions for both sample A and B are 0

Then the Jaccard similarity coefficient of sample A and B can be expressed as:

Here p+q+r can be understood as the number of elements of the set of A and B, and P is the number of elements of the intersection of A and B.

The Jaccard distance between sample A and B is expressed as:

4. Adjust cosine similarity algorithm

The cosine similarity is more about distinguishing the difference from the direction, and is not sensitive to absolute values, so it is impossible to measure the difference in values on each dimension, which can lead to a situation where:

Users scored on the content, by 5, X and y two users scored for two content (4,5), and the cosine similarity resulted in 0.98, which is very similar to each other. But from the score on the X does not seem to like 2 of this content, and Y is more like, cosine similarity to the value of the non-sensitivity of the results of the error, need to correct this irrationality there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean , such as x and Y score mean is 3, Then the adjustment is ( -2,-1) and (after), and then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more realistic.

Various other distances, see: http://blog.csdn.net/shiwei408/article/details/7602324

Distance algorithm metric

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.