Distance has some properties. If D (P1, P2) is the distance between P1 and P2, the following properties are true:
(1) Non-negative
(A) For all P1 and P2, D (P1, P2) ≥ 0,
(B) if and only when p1 = P2, D (P1, P2) = 0.
(2) Symmetry
For all P1 and P2, D (P1, P2) = D (P2, P1 ).
(3) triangular Inequality
For all P1, P2, and P3, D (P1, P3) ≤ d (P1, P2) + P (P2, P3 ).
A measurement that satisfies the preceding three properties is called a measurement ).
Below are some common distances:
(1) min kowski distance (Minkoski distance
): It is a measure of Euclidean space. It is a generalization of the distance between Manhattan and Euclidean. For the two vertices p1 (x1, x2,..., xn), P2 (Y1, Y2,..., yn) in the space, the distance between them is defined:
This is also called LR-norm, when r = 1 is the L1-norm or hamington distance, r = 2 is called the L2-norm or Euclidean distance. When r tends to be infinite, we get the distance from cherbihov:
Given the distance between P1, P2, and R, the Euclidean space can be obtained as follows:
Def euclidean_space_distance (P1, P2, R): <br/> sum = 0 <br/> for I in range (LEN (P1 )): <br/> sum = sum + POW (ABS (P1 [I]-P2 [I]), R) <br/> return POW (sum, 1/R)
(2) jaccard distance
The distance between two sets A and B is defined as 1-j (A, B), where J (A, B) is jaccard similarity coefficient (jaccard similarity coefficient or jaccard index ), the jaccard similarity coefficient of the Two sets is the number of elements at the intersection of the two sets divided by the number of elements in the union of the two sets:
Def jaccard_similarity_coefficient (A, B): <br/> C =. intersection (B) <br/> return 1-(LEN (C)/(LEN (A) + Len (B)-len (c )))
(3) cosine Similarity
Generally, a document is represented by a vector. Each attribute of a vector represents the frequency at which a specific word appears in the document. Defines cosine similarity: If X and Y are two document vectors, then:
"." Indicates the dot product of a vector. | x | indicates the length of vector X. Cosine similarity does not measure the magnitude of importance of Data Objects (when magnitude is important, Euclidean distance may be a better choice ).
Def consine_similarity (x, y ): <br/> dot_product = 0 <br/> square_x = 0 <br/> square_y = 0 <br/> for I in range (LEN (x )): <br/> dot_product = dot_product + X [I] * Y [I] <br/> square_x = square_x + X [I] * X [I] <br/> square_y = square_y + Y [I] * Y [I] <br/> return dot_product/(POW (square_x, 0.5) * POW (square_y, 0.5) <br/> X = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0) <br/> Y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2) <br/> Print (consine_similarity (x, y ))
(4) generalized jaccard coefficient (also called tanimoto coefficient)
It can be used for document data. In the case of binary attributes, the formula is equivalent to the jaccard coefficient, which is expressed by EJ:
As for the so-called dual attribute, the Statute is the jaccard coefficient. For example, we will know:
X = (1, 1, 0, 0, 1), Y = (0, 1, 1, 0, 0)
By the way, for two n-dimensional vectors x and y, each attribute is a binary attribute (only 0 or 1 can be taken ),
M11 indicates that X takes 1, and y also takes 1 as the number of attributes.
M10 indicates the number of properties where X is equal to 1 and Y is equal to 0.
M01 indicates the number of properties where X is 0 and Y is 1.
M00 indicates that X is 0, and Y is also 0.
0 is not taken into account when calculating the jaccard coefficient, because for example, for two document vectors, 0 actually accounts for a lot, that is, the document vectors are actually sparse, if we consider the case where both values are 0, the two documents will be similar and unreasonable because they are more than 0.
Therefore:
So for the two vectors x and y given above
J (x, y) = 1/(2 + 1 + 1) = 1/4
EJ (x, y) = 1/(3 + 2-1) = 1/4
(5) Pearson correlation coefficient (Pearson correlation coefficient)
It is used to determine the degree of correlation between two variables (the variable here can be called a vector or a dataset). The formula is as follows:
def preason_correlation_coefficient (x, y ): <br/> sumx = 0 <br/> Sumy = 0 <br/> sum_xsquare = 0 <br/> sum_ysquare = 0 <br/> sum_product = 0 <br/> N = Len (X) <br/> for I in range (n ): <br/> sumx = sumx + X [I] <br/> Sumy = Sumy + Y [I] <br/> sum_xsquare = sum_xsquare + X [I] * X [I] <br/> sum_ysquare = sum_ysquare + Y [I] * Y [I] <br/> sum_product = sum_product + X [I] * Y [I] </P> <p> num = sum_product-(sumx * Sumy/N) <br/> den = POW (sum_xsquare-pow (sumx, 2)/n) * (sum_ysquare-pow (Sumy, 2)/n), 0.5) </P> <p> If den = 0: <br/> return 0 <br/> return num/DEN <br/> X = (1, 2, 3) <br/> Y = (2, 5, 6) <br/> Print (preason_correlation_coefficient (x, y)