Summary:

1. Common distance Algorithms

1.1 Euclidean distance (Euclidean Distance) and standardization of European-style distances (standardized Euclidean Distance)

1.2 Mahalanobis distance (Mahalanobis Distance)

1.3 Manhattan Distance (Manhattan Distance)

1.4 Chebyshev distance (Chebyshev Distance)

1.5 Minkowski distance (Minkowski Distance)

2. Common similarity (coefficient) algorithm

2.1 Chord Similarity (cosine similarity) and adjusted cosine similarity (adjusted cosine similarity)

2.2 Pearson correlation coefficient (Pearson Correlation coefficient)

2.3Jaccard similarity coefficient (Jaccard coefficient)

2.4Tanimoto coefficients (generalized jaccard similarity coefficients)

2.5 Logarithmic likelihood similarity/logarithmic likelihood similarity ratio

2.6 article entropy, Gini coefficient, etc.

Content:

1. Common distance Algorithms

1.1 Euclidean distance (Euclidean Distance)

Formula:

The idea of a standard Euclidean distance: The data for each dimension is now standardized: normalized value = (value before normalization-component mean)/component's standard deviation, then Euclidean distance is calculated

Standardization of European-style distances (standardized Euclidean distance)

Formula:

1.2 Mahalanobis distance (Mahalanobis Distance)

Formula:

Relationship: If the covariance matrix is a diagonal matrix, the formula becomes a normalized Euclidean distance.

Features: Dimension independent, exclude the interference between the correlations between variables.

Extended

1.3 Manhattan Distance (Manhattan Distance)

Formula:

Definition: In layman's terms, imagine you're driving from one intersection in Manhattan to another, and is the distance between two points straight? Obviously not, unless you can cross the building. The actual driving distance is the "Manhattan Distance", which is the source of the Manhattan distance name, while the Manhattan distance is also known as the city Block distance (distance).

1.4 Chebyshev distance (Chebyshev Distance)

Formula:

1.5 Minkowski distance (Minkowski Distance)

Defined:

Relationship: The distance of the Ming is the generalization of Euclidean distance, and it is the generalization of the formula of multiple distance measurement. The p=1 degenerated into a Manhattan distance, the p=2 degenerated to Euclidean distance, and the Chebyshev distance is the form of the limit of the distance from the Ming.

2. Common similarity (coefficient) algorithm

2.1 Chord Similarity (cosine similarity)

Formula:

Definition: The more similar the two vectors are, the smaller the cosine absolute value, and the negative two vectors negative correlation.

Insufficient: can only distinguish between the individual in the dimension difference, cannot measure each dimension value difference (for example the user to the content rate, 5 points, X and y two users to the two content rating respectively is () and (4,5), The result of using the cosine similarity is 0.98, the two are very similar, but from the score on the X does not seem to like the 2 content, and y prefer, the cosine similarity to the value of the non-sensitivity of the result of the error, need to correct this irrationality)

Adjust cosine similarity (adjusted cosine similarity)

Formula:, which is the average of the *u*-th user ' s ratings.

2.2 Pearson correlation coefficient (Pearson Correlation coefficient)

Definition: The Pearson correlation coefficient between two variables is defined as quotient of covariance and standard deviation between two variables

Extended

2.3Jaccard similarity coefficient (Jaccard coefficient)

Formula:

Definition: The Jaccard coefficient is primarily used to calculate the similarity between individuals in symbolic or boolean measurements, because the characteristic attributes of an individual are identified by a symbol metric or a Boolean value, so the size of the difference can not be measured, only the "is the same" result is obtained, So the Jaccard coefficient is only concerned about whether the characteristics of the individual are consistent with each other.

2.4Tanimoto coefficients (generalized jaccard similarity coefficients)

Formula:

2.5 Logarithmic likelihood similarity/logarithmic likelihood similarity ratio

Logarithmic likelihood similarity

Formula:

Definition: The likelihood Ratio the likelihood Ratio for a hypothesis is the Ratio of the maximum value of the likelihood function Over the subspace represented by the hypothesis to the maximum value of the likelihood function over the entire parameter Space.

Logarithmic likelihood similarity rate

Formula:

Defined:

Details extension

2.6 article entropy, Gini coefficient, etc.

Common distance algorithm and similarity degree (correlation coefficient) calculation method