Distance metrics and Python implementations (ii)

Source: Internet
Author: User

7. Angle cosine (cosine)

It can also be called cosine similarity. The angle cosine of the geometry can be used to measure the difference in the direction of two vectors, which is borrowed from the machine learning to measure the difference between sample vectors.
(1) The angle cosine formula of vector A (x1,y1) and Vector B (x2,y2) in two-dimensional space:

(2) Angle cosine of two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n)
Similarly, for two n-dimensional sample points a (x11,x12,..., x1n) and B (x21,x22,..., x2n), a concept similar to the angle cosine can be used to measure how similar they are to each other.

That

The cosine value range is [ -1,1]. The angle of the two vectors is obtained, and the cosine value corresponding to the angle is obtained, and the cosine value can be used to characterize the similarity of the two vectors. The smaller the angle, the nearer to 0 degrees, the closer the cosine is to the 1, the more similar the direction of their orientation. When the direction of the two vectors is exactly opposite the angle cosine takes the minimum-1. When the cosine value is 0 o'clock, the two vectors are orthogonal and the angle is 90 degrees. It can be seen that the cosine similarity is independent of the amplitude of the vector and is only related to the direction of the vector.

Import NumPy as Npx=np.random.random (Ten)y=np.random.random# method One: Solve the D1=np.dot (x, y)/( Np.linalg.norm (x) *np.linalg.norm (y))# Method Two: Solve the import pdistx=np.vstack ([x, y]) according to the SCIPY library d2=1 -pdist (X,'cosine')             

When two vectors are exactly equal, the cosine value is 1, as the following code calculates the d=1.

D=1-pdist ([x,x],'cosine')  

8. Pearson correlation coefficient (Pearson correlation)

(1) Pearson's definition of correlation coefficient

The cosine similarity mentioned earlier is only related to the vector direction, but it is affected by the translation of the vector, and if X is shifted to x+1 in the angle cosine formula, the cosine value will change. How can translation invariance be achieved? The Pearson correlation coefficient (Pearson correlation) is used and is sometimes called a correlation coefficient .

If the angle cosine formula is written:

Represents the angle cosine between the vector x and the vector y, the Pearson correlation coefficient can be expressed as:

Pearson correlation coefficients have translational invariance and scale invariance, and the correlation of two vectors (dimensions) is calculated.

Implementations in Python:

Import NumPy as Npx=np.random.random (y=np.random.random)# method One: Solve the x_=x-Np.mean (x) y based on the formula _=y-Np.mean (y) D1=np.dot (x_,y_)/(Np.linalg.norm (x_) *np.linalg.norm (y_))# method Two: Based on NumPy library solution x=  Np.vstack ([x, Y]) d2=np.corrcoef (x) [0][1]          

Correlation coefficient is a method to measure the correlation between x and y of random variables, and the range of correlation coefficients is [ -1,1]. The greater the absolute value of the correlation coefficient, the higher the correlation of x and Y. When x is linearly correlated with y, the correlation coefficient is 1 (positive linear correlation) or 1 (negative linear correlation).

9. Hamming distance (Hamming distance)
(1) Definition of Hamming distance
The Hamming distance between two equal-length strings S1 and S2 is defined as the minimum number of replacements required to change one of them into another. For example, the Hamming distance between the string "1111" and "1001" is 2.
Application: Information coding (in order to enhance the fault tolerance, should make the minimum Hamming distance between the coding as large as possible).

Implementations in Python:

Import NumPy as NPimport pdistx=np.random.random (Ten) >0.5Y=np.random.random (Ten) >0.5x=  Np.asarray (x,np.int32) y=np.asarray (y,np.int32)# method One: Solve D1=np.mean by formula (x!=y)#  Method two: Solve x=np.vstack ([x, Y]) d2=pdist (×,'hamming')    According to the SCIPY library  

jaccard similarity coefficient (jaccard similarity coefficient)
(1) Jaccard similarity coefficient
The proportion of the intersection elements of two sets a and B in the Jaccard of a A, is called the two-set similarity coefficient, denoted by the symbol J (A, B).

Jaccard similarity coefficient is an indicator of the similarity of two sets.
(2) Jaccard distance
The concept opposite to the Jaccard similarity coefficient is the jaccard distance (jaccard distance). Jaccard distances can be expressed in the following formula:

The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.
(3) Application of Jaccard similarity coefficient and Jaccard distance
The Jaccard similarity coefficient can be used to measure the similarity of samples.
Sample A and sample B are two n-dimensional vectors, and the values for all dimensions are 0 or 1. For example: A (0111) and B (1011). We treat the sample as a collection, 1 means that the collection contains the element, and 0 indicates that the collection does not contain the element.

Implementations in Python:

ImportNumPy as NPfrom scipy.spatial.distance import  Pdistx=np.random.random (ten) >0.5y=np.random.random (Ten) >0.5x=np.asarray (x,np.int32) Y=np.asarray (y,np.int32) # method one: Solve by formula up= Np.double (Np.bitwise_and ((x! = y), np.bitwise_or (x! = 0, y!= 0)). SUM ()) down=np.double ( Np.bitwise_or (x! = 0, y!= 0). SUM ()) d1= (Up/down) # method two: Solve X=np.vstack according to the SCIPY library ([x, Y]) d2=pdist (x, ' jaccard " ) 

Brecotis distance (Bray Curtis Distance)

Bray Curtis Distance is mainly used in ecology and environmental science to calculate the distance between coordinates. The distance value is between [0,1]. It can also be used to calculate the differences between samples.

Sample data:

Calculation:

Implementations in Python:

Import NumPy as NPimport pdistx=np.array ([11,0,7,8, 0]) Y=np.array ([24,37,5,18,1])#  Method one: Solve Up=np.sum (Np.abs (yx)) down=np.sum (x) +np.sum (y) d1= (up/# Method two: Solve scipy based on x= Library by formula) Np.vstack ([x, Y]) d2=pdist (×,'braycurtis')         

Transferred from: https://www.cnblogs.com/denny402/p/7028832.html

Distance metrics and Python implementations (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.