Some common distances and some common measure

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Distance has some properties. If D (P1, P2) is the distance between P1 and P2, the following properties are true:

(1) Non-negative

(A) For all P1 and P2, D (P1, P2) ≥ 0,

(B) if and only when p1 = P2, D (P1, P2) = 0.

(2) Symmetry

For all P1 and P2, D (P1, P2) = D (P2, P1 ).

(3) triangular Inequality

For all P1, P2, and P3, D (P1, P3) ≤ d (P1, P2) + P (P2, P3 ).

A measurement that satisfies the preceding three properties is called a measurement ).

Below are some common distances:

(1) min kowski distance (Minkoski distance
): It is a measure of Euclidean space. It is a generalization of the distance between Manhattan and Euclidean. For the two vertices p1 (x1, x2,..., xn), P2 (Y1, Y2,..., yn) in the space, the distance between them is defined:

This is also called LR-norm, when r = 1 is the L1-norm or hamington distance, r = 2 is called the L2-norm or Euclidean distance. When r tends to be infinite, we get the distance from cherbihov:

Given the distance between P1, P2, and R, the Euclidean space can be obtained as follows:

Def euclidean_space_distance (P1, P2, R): sum = 0 for I in range (LEN (P1 )): sum = sum + POW (ABS (P1 [I]-P2 [I]), R) return POW (sum, 1/R)

(2) jaccard distance

The distance between two sets A and B is defined as 1-j (A, B), where J (A, B) is jaccard similarity coefficient (jaccard similarity coefficient or jaccard index ), the jaccard similarity coefficient of the Two sets is the number of elements at the intersection of the two sets divided by the number of elements in the union of the two sets:

Def jaccard_similarity_coefficient (A, B): C =. intersection (B) return 1-(LEN (C)/(LEN (A) + Len (B)-len (c )))

(3) cosine Similarity

Generally, a document is represented by a vector. Each attribute of a vector represents the frequency at which a specific word appears in the document. Defines cosine similarity: If X and Y are two document vectors, then:

"." Indicates the dot product of a vector. | x | indicates the length of vector X. Cosine similarity does not measure the magnitude of importance of Data Objects (when magnitude is important, Euclidean distance may be a better choice ).

Def consine_similarity (x, y ): dot_product = 0 square_x = 0 square_y = 0 for I in range (LEN (x )): dot_product = dot_product + X [I] * Y [I] square_x = square_x + X [I] * X [I] square_y = square_y + Y [I] * Y [I] return dot_product/(POW (square_x, 0.5) * POW (square_y, 0.5) X = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0) Y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2) Print (consine_similarity (x, y ))

(4) generalized jaccard coefficient (also called tanimoto coefficient)

It can be used for document data. In the case of binary attributes, the formula is equivalent to the jaccard coefficient, which is expressed by EJ:

As for the so-called dual attribute, the Statute is the jaccard coefficient. For example, we will know:

X = (1, 1, 0, 0, 1), Y = (0, 1, 1, 0, 0)

By the way, for two n-dimensional vectors x and y, each attribute is a binary attribute (only 0 or 1 can be taken ),

M11 indicates that X takes 1, and y also takes 1 as the number of attributes.

M10 indicates the number of properties where X is equal to 1 and Y is equal to 0.

M01 indicates the number of properties where X is 0 and Y is 1.

M00 indicates that X is 0, and Y is also 0.

0 is not taken into account when calculating the jaccard coefficient, because for example, for two document vectors, 0 actually accounts for a lot, that is, the document vectors are actually sparse, if we consider the case where both values are 0, the two documents will be similar and unreasonable because they are more than 0.

Therefore:

So for the two vectors x and y given above

J (x, y) = 1/(2 + 1 + 1) = 1/4

EJ (x, y) = 1/(3 + 2-1) = 1/4

(5) Pearson correlation coefficient (Pearson correlation coefficient)

It is used to determine the degree of correlation between two variables (the variable here can be called a vector or a dataset). The formula is as follows:

def preason_correlation_coefficient (x, y ): sumx = 0 Sumy = 0 sum_xsquare = 0 sum_ysquare = 0 sum_product = 0 N = Len (X) for I in range (n ): sumx = sumx + X [I] Sumy = Sumy + Y [I] sum_xsquare = sum_xsquare + X [I] * X [I] sum_ysquare = sum_ysquare + Y [I] * Y [I] sum_product = sum_product + X [I] * Y [I] num = sum_product-(sumx * Sumy/N) den = POW (sum_xsquare-pow (sumx, 2)/n) * (sum_ysquare-pow (Sumy, 2)/n), 0.5) If den = 0: return 0 return num/DEN X = (1, 2, 3) Y = (2, 5, 6) Print (preason_correlation_coefficient (x, y)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Some common distances and some common measure

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Some common distances and some common measure

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support