To sort out the recent Pearson similarity calculation in the collaborative filtering recommendation algorithm, incidentally learning the simple use of the next R language, and reviewing the knowledge of probability statistics.
I. Theory of probability and review of statistical concepts
1) Expected value (expected Value) because each of these numbers is equal probability, it is considered an average of all the elements in an array or vector. You can use the function mean () in the R language.
2) Variance (Variance)
The squared difference is population variance population variance and sample variance variance, the difference is the population variance divided by N, the sample variance divided by N-1. Sample variance is commonly used in mathematical statistics, and the Var () function of R language is also the sample variance. The specific reason is that the sample variance is unbiased (unbiased), want to get to the root of Google.
3 The standard deviation (standard Variance) is simple, and the criterion deviation is the square root of the variance. The function in the R language is SD ().
4) Covariance (covariance),Also divided into the total covariance and sample covariance, the difference is ibid. The function in the R language is cov (). Note that when there is an empty element (NA) in the vector, such as a row in a sparse matrix, cov (x, Y, use= ' complete ') is required. The variance can also be seen as a special case of covariance, i.e. VAR (x) =cov (x,x).
Here only to list the calculation formula, look at some dizziness, specific or see the following example, a look on the understand.
The position of similarity calculation in collaborative filtering recommendation algorithm
In the collaborative filtering recommendation algorithm, whether based on the user (user-based) or the item (item-based), the offline model (training learning process) is obtained by calculating the similarity between the user or the item. Then use the sorting and weighting algorithm to get the final list of recommended items top-n. The selection of different similarity algorithms will have a great impact on the final recommendation results.
1) cosine similarity (cosine-based similiarity)
2) Correlation similarity (correlation-based similiarity) The algorithm used for this similarity calculation is Pearson.
3) Correction of cosine similarity (adjusted cosine-based similiarity)
Introduction to the language introduction of R
Under Windows, the R language installation package address is: http://cran.r-project.org/bin/windows/base/after downloading EXE directly after installation, run the interactive console can be used.
Common functions can be found on the Web: http://jiaoyan.org/r/?page_id=4100
One thing to get used to is the way the R language is expressed, for example, in console input:> x<-c (1:10) > X-mean (x)
[1] -4.5-3.5-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5
X-mean (x) means that each element in vector x subtracts the average of x mean (x), which can be said to be highly abstract and expressive. We can then use other functions to aggregate the result of the calculation:> sum (X-mean (x)) [1] 0
Iv. Pearson Similarity (Pearson similiarity) Example of calculation
The following is an example of a user-item relationship in another article explaining the computational process of Pearson's similarity.
the original calculation formula for Pearson's similarity is:, do not continue to expand the simplification.
1) define user array (vector) User1<-c (5.0, 3.0, 2.5) USER5<-C (4.0, 3.0, 2.0)
2) Calculate Variance var (user1) =sum ((User1-mean (user1)) ^2)/(3-1) =1.75var (User2) =sum ((User5-mean (USER5)) ^2)/(3-1) =1
3) Calculate standard deviation SD (user1) =sqrt (Var (user1)) =1.322876SD (USER5) =sqrt (Var (user5)) =1
4) Calculate covariance cov (user1, User5) =sum ((User1-mean (user1)) * (User5-mean (USER5)))/(3-1) =1.25
5) Compute similarity cor (user1, User5) =cov (user1, User5)/(SD (User1) * (SD (USER5))) =0.9449112
Five, mathematical characteristics and existing problems
The following 1) and 2) are organized from Wikipedia:
1) Algebraic characteristics
Pearson correlation coefficients vary in range from 1 to 1. A value of 1 means that X and Y can be well-described by a linear equation, all data points fall well in a straight line, and y increases as x increases. The value of the coefficient is 1 means that all data points fall on a straight line, and Y decreases as X increases. A value of 0 means that there is no linear relationship between the two variables.
The change in the position and scale of the two variables does not cause a change in the coefficient, i.e. the invariant of the change (determined by the symbol). That is, if we move X to a + bx and move Y to C + DY, where a, B, C and D are constants, the correlation coefficients of two variables will not be changed (the conclusion is established in the overall and sample Pearson correlation coefficients). We found that a more general linear transformation would change the correlation coefficients.
2) Geometrical meaning
For data that is not centralized, the correlation coefficient is the same as the cosine of the angle of two possible regression lines Y=GX (x) and X=gy (y).
For the centralized data (that is, the data moves a sample mean to make it mean 0), the correlation coefficient can also be considered as the cosine of the angle theta by the two random variable vectors (see below).
3) There is a problem
This is why the User1 and User4 are more similar, even though User4 scored only for Item101 and 103, but the lines formed by these two scores are closer to the straight trend of User1 formation. At the same time, another problem is that if some geometric transformations do not affect the correlation coefficients, the level of the score is ignored, but the trend of the score is affected. This, of course, has no effect on the user-item purchase matrix, which is both 0 and 1 in the matrix.
Pearson Similarity Calculation example (R language)