1. Euclidean distance
equation: Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):
It can also be expressed in the form of a vector operation:
application: The analysis of the difference in the numerical size of the dimension, such as the use of user behavior indicators to analyze the similarity or difference in user value.
2. Cosine distance
equation: angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n).
Similarly, for two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure how similar they are to each other.
That
Apply to: Differentiate differences from the direction, the user to the content score to distinguish the similarity and difference of interest, fixed the problem that the measure standard that may exist among users is not unified (because cosine distance is insensitive to absolute value).
3. Jaccard Similarity measurement
Definition (1): Jaccard similarity coefficient
The proportion of the intersection elements of two sets a and B in the Jaccard of a A, is called the two-set similarity coefficient, denoted by the symbol J (A, B).
Jaccard similarity coefficient is an indicator of the similarity of two sets.
Definition (2): Jaccard distance
The concept opposite to the Jaccard similarity coefficient is the jaccard distance (jaccarddistance). Jaccard distances can be expressed in the following formula:
The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.
Definition (3): Application of Jaccard similarity coefficient and Jaccard distance
The Jaccard similarity coefficient can be used to measure the similarity of samples.
Sample A and sample B are two n-dimensional vectors, and the values for all dimensions are 0 or 1. For example: A (0111) and B (1011). We treat the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element.
P: The number of dimensions for both sample A and B are 1
Q: Sample A is 1 and sample B is the number of dimensions of 0
R: Sample A is 0 and sample B is the number of dimensions of 1
S: The number of dimensions for both sample A and B are 0
Then the Jaccard similarity coefficient of sample A and B can be expressed as:
Here p+q+r can be understood as the number of elements of the set of A and B, and P is the number of elements of the intersection of A and B.
The Jaccard distance between sample A and B is expressed as:
4. Adjust cosine similarity algorithm
The cosine similarity is more about distinguishing the difference from the direction, and is not sensitive to absolute values, so it is impossible to measure the difference in values on each dimension, which can lead to a situation where:
Users scored on the content, by 5, X and y two users scored for two content (4,5), and the cosine similarity resulted in 0.98, which is very similar to each other. But from the score on the X does not seem to like 2 of this content, and Y is more like, cosine similarity to the value of the non-sensitivity of the results of the error, need to correct this irrationality there is the adjustment of the cosine similarity, that is, all the dimensions of the value minus a mean , such as x and Y score mean is 3, Then the adjustment is ( -2,-1) and (after), and then the cosine similarity calculation, get 0.8, the similarity is negative and the difference is not small, but obviously more realistic.
Various other distances, see: http://blog.csdn.net/shiwei408/article/details/7602324
Distance algorithm metric