Some common distances and some common measure

Source: Internet
Author: User

Distance has some properties. If D (P1, P2) is the distance between P1 and P2, the following properties are true:

(1) Non-negative

(A) For all P1 and P2, D (P1, P2) ≥ 0,

(B) if and only when p1 = P2, D (P1, P2) = 0.

(2) Symmetry

For all P1 and P2, D (P1, P2) = D (P2, P1 ).

(3) triangular Inequality

For all P1, P2, and P3, D (P1, P3) ≤ d (P1, P2) + P (P2, P3 ).

A measurement that satisfies the preceding three properties is called a measurement ).

 

Below are some common distances:

(1) min kowski distance (Minkoski distance
): It is a measure of Euclidean space. It is a generalization of the distance between Manhattan and Euclidean. For the two vertices p1 (x1, x2,..., xn), P2 (Y1, Y2,..., yn) in the space, the distance between them is defined:

This is also called LR-norm, when r = 1 is the L1-norm or hamington distance, r = 2 is called the L2-norm or Euclidean distance. When r tends to be infinite, we get the distance from cherbihov:

 

Given the distance between P1, P2, and R, the Euclidean space can be obtained as follows:

Def euclidean_space_distance (P1, P2, R): <br/> sum = 0 <br/> for I in range (LEN (P1 )): <br/> sum = sum + POW (ABS (P1 [I]-P2 [I]), R) <br/> return POW (sum, 1/R)

 

(2) jaccard distance

The distance between two sets A and B is defined as 1-j (A, B), where J (A, B) is jaccard similarity coefficient (jaccard similarity coefficient or jaccard index ), the jaccard similarity coefficient of the Two sets is the number of elements at the intersection of the two sets divided by the number of elements in the union of the two sets:

Def jaccard_similarity_coefficient (A, B): <br/> C =. intersection (B) <br/> return 1-(LEN (C)/(LEN (A) + Len (B)-len (c )))

 

(3) cosine Similarity

Generally, a document is represented by a vector. Each attribute of a vector represents the frequency at which a specific word appears in the document. Defines cosine similarity: If X and Y are two document vectors, then:

"." Indicates the dot product of a vector. | x | indicates the length of vector X. Cosine similarity does not measure the magnitude of importance of Data Objects (when magnitude is important, Euclidean distance may be a better choice ).

Def consine_similarity (x, y ): <br/> dot_product = 0 <br/> square_x = 0 <br/> square_y = 0 <br/> for I in range (LEN (x )): <br/> dot_product = dot_product + X [I] * Y [I] <br/> square_x = square_x + X [I] * X [I] <br/> square_y = square_y + Y [I] * Y [I] <br/> return dot_product/(POW (square_x, 0.5) * POW (square_y, 0.5) <br/> X = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0) <br/> Y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2) <br/> Print (consine_similarity (x, y ))

 

 

(4) generalized jaccard coefficient (also called tanimoto coefficient)

It can be used for document data. In the case of binary attributes, the formula is equivalent to the jaccard coefficient, which is expressed by EJ:

As for the so-called dual attribute, the Statute is the jaccard coefficient. For example, we will know:

X = (1, 1, 0, 0, 1), Y = (0, 1, 1, 0, 0)

By the way, for two n-dimensional vectors x and y, each attribute is a binary attribute (only 0 or 1 can be taken ),

M11 indicates that X takes 1, and y also takes 1 as the number of attributes.

M10 indicates the number of properties where X is equal to 1 and Y is equal to 0.

M01 indicates the number of properties where X is 0 and Y is 1.

M00 indicates that X is 0, and Y is also 0.

0 is not taken into account when calculating the jaccard coefficient, because for example, for two document vectors, 0 actually accounts for a lot, that is, the document vectors are actually sparse, if we consider the case where both values are 0, the two documents will be similar and unreasonable because they are more than 0.

Therefore:

So for the two vectors x and y given above

J (x, y) = 1/(2 + 1 + 1) = 1/4

EJ (x, y) = 1/(3 + 2-1) = 1/4

 

(5) Pearson correlation coefficient (Pearson correlation coefficient)

It is used to determine the degree of correlation between two variables (the variable here can be called a vector or a dataset). The formula is as follows:

def preason_correlation_coefficient (x, y ): <br/> sumx = 0 <br/> Sumy = 0 <br/> sum_xsquare = 0 <br/> sum_ysquare = 0 <br/> sum_product = 0 <br/> N = Len (X) <br/> for I in range (n ): <br/> sumx = sumx + X [I] <br/> Sumy = Sumy + Y [I] <br/> sum_xsquare = sum_xsquare + X [I] * X [I] <br/> sum_ysquare = sum_ysquare + Y [I] * Y [I] <br/> sum_product = sum_product + X [I] * Y [I] </P> <p> num = sum_product-(sumx * Sumy/N) <br/> den = POW (sum_xsquare-pow (sumx, 2)/n) * (sum_ysquare-pow (Sumy, 2)/n), 0.5) </P> <p> If den = 0: <br/> return 0 <br/> return num/DEN <br/> X = (1, 2, 3) <br/> Y = (2, 5, 6) <br/> Print (preason_correlation_coefficient (x, y)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.