Distance and similarity measurement

Source: Internet
Author: User

Similarity measurement or distance functions are very important for algorithms such as clustering and nearby search. As mentioned above, page deduplication is also an example of similarity applications. However, how to define a proper similarity or distance function depends entirely on the task at hand. Generally, defining a distance function d (x, y) must meet the following criteria:

1. d (x, x) = 0; // The distance to oneself is 0.
2. d (x, y)> = 0 // The distance must be non-negative.
3. Symmetry, d (x, y) = d (y, x) // if the distance from A to B is A, then the distance from B to a should also be
4. Triangle law (the sum of the two is greater than the third side) d (x, k) + d (K, Y)> = d (x, y)

There are many distance functions that meet these four conditions. Generally, there are several common types, usually from a relatively intuitive image, such as a straight line distance between two points on a plane. The following describes several widely used distance or similarity measurement functions, including Euler's distance, cosine function cosine, Pearson Function, jaccard index, and edit distance. If an object d (such as a document) is represented as an n-dimensional vector (D1, D2 ,...., Every dimension is a feature of an object, so these measurement functions are easily applied.

1. NORM and Euler's distance
Euler's distance comes from the Euclidean ry (that is, the ry we first came into contact with in elementary school) and can also become a norm in mathematics. If an object corresponds to a point in a space, each dimension is a dimension of the space. In special cases, if n = 1, then we learned that the distance between the two points on the straight line is | x1-x2 |. It is a natural idea to promote it to the high latitude situation that we cannot add up the distance of each dimension. This forms the legendary norm:

Check whether it is easy. If there is a single norm, there are two and three norm... Infinite norm. In fact, we all know the two-point distance formula of two-dimensional space and three-dimensional space. It is a two-dimensional three-dimensional form.

All right, one click, p-norm)


Infinite norm:

The distance formula (2-NORM) of two points in a space is the most commonly used distance formula.Euclidean distance. Easy to use.

2. Cosine Similarity
Cosine similarity is favored. When learning vector ry, you should have come into contact with this magic formula.

The numerator is the dot product of two vectors. | A | it is the length of the vector. The magic of this formula is that the function changes from-with the angle. The cosine of the vector angle is the similarity between two vectors. Cosine similarity says that if the angle between the two vectors is fixed, their similarity will remain unchanged no matter how many times the extension of a vector. Therefore, before applying cosine similarity, We need to normalize every dimension of the object. In the search engine technology, cosine similarity is well applied in the calculation of similarity between queries and documents. For a query statement (for example, "how is the weather tomorrow"), each dimension of the query statement is the TF-IDF of the corresponding word.

One extension of cosine similarity is the tonimoto coefficient:

In fact, it's no big deal. T (a, B) has a denominator greater than or equal to cos similarity, but only a and B have the same length. This means that the tonimoto coefficient considers the length difference between two vectors, and the larger the length difference, the smaller the similarity.

3. jacard Index

The intuitive concept of jacard similarity comes from the fact that there are many similarities between the two sets. Obviously, jacard is best applied to discrete variable ry. First look at the formula (do not feel dizzy)

The numerator is the set intersection, the denominator is the set union, draw a picture, and immediately understand what is going on.

A similar formula to jacard index is dice 'coefficient, which is also intuitive,

4. Pearson Correlation Coefficient

People who have learned probability theory know that there are mean, contrast, and correlation coefficient. The correlation coefficient is the one that describes whether two sets of variables are linearly related. The advantage of correlation coefficient is that it has nothing to do with the length of the variable, which is similar to cosine. There is an application, such as a product recommendation system. To recommend a product to user A, you must first grade the product and find k users similar to user. However, some people may naturally like high scores, and some prefer low scores. To eliminate the correlation coefficient of this problem, it is a good measurement method. Column formula,

This formula seems to have something to do with cosine. I will not discuss how to associate it. See correlation

5. Edit the distance or levenshtein distance

Edit the similarity between two strings. String A can be measured by deleting, adding, and modifying it into string B (generally, the number of steps that can be used to convert string a to string B can be weighted for each edit step) wikipedia has a very good description, so I will not go into details here. Levenshtein distance. The relevant string similarity method is Jaro-Winkler distance.

6 similar to simrank

Simrank is from graph theory. Two variables are similar because they link the same or similar node. This algorithm needs to be discussed. Next time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.