Distance and similarity measurement

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Similarity measurement or distance functions are very important for algorithms such as clustering and nearby search. As mentioned above, page deduplication is also an example of similarity applications. However, how to define a proper similarity or distance function depends entirely on the task at hand. Generally, defining a distance function d (x, y) must meet the following criteria:

1. d (x, x) = 0; // The distance to oneself is 0.
2. d (x, y)> = 0 // The distance must be non-negative.
3. Symmetry, d (x, y) = d (y, x) // if the distance from A to B is A, then the distance from B to a should also be
4. Triangle law (the sum of the two is greater than the third side) d (x, k) + d (K, Y)> = d (x, y)

There are many distance functions that meet these four conditions. Generally, there are several common types, usually from a relatively intuitive image, such as a straight line distance between two points on a plane. The following describes several widely used distance or similarity measurement functions, including Euler's distance, cosine function cosine, Pearson Function, jaccard index, and edit distance. If an object d (such as a document) is represented as an n-dimensional vector (D1, D2 ,...., Every dimension is a feature of an object, so these measurement functions are easily applied.

1. NORM and Euler's distance
Euler's distance comes from the Euclidean ry (that is, the ry we first came into contact with in elementary school) and can also become a norm in mathematics. If an object corresponds to a point in a space, each dimension is a dimension of the space. In special cases, if n = 1, then we learned that the distance between the two points on the straight line is | x1-x2 |. It is a natural idea to promote it to the high latitude situation that we cannot add up the distance of each dimension. This forms the legendary norm:

Check whether it is easy. If there is a single norm, there are two and three norm... Infinite norm. In fact, we all know the two-point distance formula of two-dimensional space and three-dimensional space. It is a two-dimensional three-dimensional form.

All right, one click, p-norm)

Infinite norm:

The distance formula (2-NORM) of two points in a space is the most commonly used distance formula.Euclidean distance. Easy to use.

2. Cosine Similarity
Cosine similarity is favored. When learning vector ry, you should have come into contact with this magic formula.

The numerator is the dot product of two vectors. | A | it is the length of the vector. The magic of this formula is that the function changes from-with the angle. The cosine of the vector angle is the similarity between two vectors. Cosine similarity says that if the angle between the two vectors is fixed, their similarity will remain unchanged no matter how many times the extension of a vector. Therefore, before applying cosine similarity, We need to normalize every dimension of the object. In the search engine technology, cosine similarity is well applied in the calculation of similarity between queries and documents. For a query statement (for example, "how is the weather tomorrow"), each dimension of the query statement is the TF-IDF of the corresponding word.

One extension of cosine similarity is the tonimoto coefficient:

In fact, it's no big deal. T (a, B) has a denominator greater than or equal to cos similarity, but only a and B have the same length. This means that the tonimoto coefficient considers the length difference between two vectors, and the larger the length difference, the smaller the similarity.

3. jacard Index

The intuitive concept of jacard similarity comes from the fact that there are many similarities between the two sets. Obviously, jacard is best applied to discrete variable ry. First look at the formula (do not feel dizzy)

The numerator is the set intersection, the denominator is the set union, draw a picture, and immediately understand what is going on.

A similar formula to jacard index is dice 'coefficient, which is also intuitive,

4. Pearson Correlation Coefficient

People who have learned probability theory know that there are mean, contrast, and correlation coefficient. The correlation coefficient is the one that describes whether two sets of variables are linearly related. The advantage of correlation coefficient is that it has nothing to do with the length of the variable, which is similar to cosine. There is an application, such as a product recommendation system. To recommend a product to user A, you must first grade the product and find k users similar to user. However, some people may naturally like high scores, and some prefer low scores. To eliminate the correlation coefficient of this problem, it is a good measurement method. Column formula,

This formula seems to have something to do with cosine. I will not discuss how to associate it. See correlation

5. Edit the distance or levenshtein distance

Edit the similarity between two strings. String A can be measured by deleting, adding, and modifying it into string B (generally, the number of steps that can be used to convert string a to string B can be weighted for each edit step) wikipedia has a very good description, so I will not go into details here. Levenshtein distance. The relevant string similarity method is Jaro-Winkler distance.

6 similar to simrank

Simrank is from graph theory. Two variables are similar because they link the same or similar node. This algorithm needs to be discussed. Next time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Distance and similarity measurement

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Distance and similarity measurement

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support