Pearson Similarity Calculation example (R language)

Source: Internet
Author: User
Tags square root

To sort out the recent Pearson similarity calculation in the collaborative filtering recommendation algorithm, incidentally learning the simple use of the next R language, and reviewing the knowledge of probability statistics.
I. Theory of probability and review of statistical concepts
1) Expected value (expected Value) because each of these numbers is equal probability, it is considered an average of all the elements in an array or vector. You can use the function mean () in the R language.
2) Variance (Variance)
The squared difference is population variance population variance and sample variance variance, the difference is the population variance divided by N, the sample variance divided by N-1. Sample variance is commonly used in mathematical statistics, and the Var () function of R language is also the sample variance. The specific reason is that the sample variance is unbiased (unbiased), want to get to the root of Google.
3 The standard deviation (standard Variance) is simple, and the criterion deviation is the square root of the variance. The function in the R language is SD ().
4) covariance (covariance), also divided into the total covariance and sample covariance, the difference is the same as above. The function in the R language is cov (). Note that when there is an empty element (NA) in the vector, such as a row in a sparse matrix, cov (x, Y, use= ' complete ') is required. The variance can also be seen as a special case of covariance, i.e. VAR (x) =cov (x,x).
Here only to list the calculation formula, look at some dizziness, specific or see the following example, a look on the understand.

The position of similarity calculation in collaborative filtering recommendation algorithm
In the collaborative filtering recommendation algorithm, whether based on the user (user-based) or the item (item-based), the offline model (training learning process) is obtained by calculating the similarity between the user or the item. Then use the sorting and weighting algorithm to get the final list of recommended items top-n. The selection of different similarity algorithms will have a great impact on the final recommendation results.
1) cosine similarity (cosine-based similiarity)
2) Correlation similarity (correlation-based similiarity) The algorithm used for this similarity calculation is Pearson.
3) Correction of cosine similarity (adjusted cosine-based similiarity)

Introduction to the language introduction of R
Under Windows, the R language installation package address is: http://cran.r-project.org/bin/windows/base/after downloading EXE directly after installation, run the interactive console can be used.


Common functions can be found on the Web: http://jiaoyan.org/r/?page_id=4100
One thing to get used to is the way the R language is expressed, such as in console input: > x<-c (1:10) > X-mean (x)
[1] -4.5-3.5-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5
X-mean (x) means that each element in vector x subtracts the average of x mean (x), which can be said to be highly abstract and expressive. We can then use other functions to aggregate the results of the calculation: > Sum (X-mean (x)) [1] 0

Iv. Pearson Similarity (Pearson similiarity) Example of calculation
The following is an example of a user-item relationship in another article explaining the computational process of Pearson's similarity.


The original calculation formula for Pearson's similarity is: Do not continue to expand simplification.
1) define user array (vector) User1<-c (5.0, 3.0, 2.5) USER5&LT;-C (4.0, 3.0, 2.0)
2) Calculate Variance var (user1) =sum ((User1-mean (user1)) ^2)/(3-1) =1.75 var (user2) =sum ((User5-mean (USER5)) ^2)/(3-1) =1
3) Calculate standard deviation SD (user1) =sqrt (Var (user1)) =1.322876 SD (USER5) =sqrt (Var (user5)) =1
4) Calculate covariance cov (user1, User5) =sum ((User1-mean (user1)) * (User5-mean (USER5)))/(3-1) =1.25
5) Compute similarity cor (user1, User5) =cov (user1, User5)/(SD (User1) * (SD (USER5))) =0.9449112

Five, mathematical characteristics and existing problems
The following 1) and 2) are organized from Wikipedia:
1) Algebraic characteristics
Pearson correlation coefficients vary in range from 1 to 1. A value of 1 means that X and Y can be well-described by a linear equation, all data points fall well in a straight line, and y increases as x increases. A value of −1 means that all data points fall on a straight line, and Y decreases as X increases. A value of 0 means that there is no linear relationship between the two variables.
The change in the position and scale of the two variables does not cause a change in the coefficient, i.e. the invariant of the change (determined by the symbol). That is, if we move X to a + bx and move Y to C + DY, where a, B, C and D are constants, the correlation coefficients of two variables will not be changed (the conclusion is established in the overall and sample Pearson correlation coefficients). We found that a more general linear transformation would change the correlation coefficients.
2) Geometrical meaning
For data that is not centralized, the correlation coefficient is the same as the cosine of the angle of two possible regression lines Y=GX (x) and X=gy (y).
For the centralized data (that is, the data moves a sample mean to make it mean 0), the correlation coefficient can also be considered as the cosine of the angle theta by the two random variable vectors (see below).


3) There is a problem
This is why the User1 and User4 are more similar, even though User4 scored only for Item101 and 103, but the lines formed by these two scores are closer to the straight trend of User1 formation. At the same time, another problem is that if some geometric transformations do not affect the correlation coefficients, the level of the score is ignored, but the trend of the score is affected. This, of course, has no effect on the user-item purchase matrix, which is both 0 and 1 in the matrix.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.