Study on the correlation coefficient of Pearson's accumulated moment
Pearson correlation coefficients (Pearson Correlation coefficient) are often used when doing similarity calculations, so how do you understand the coefficients? What is its mathematical nature and meaning?
Pearson correlation coefficient understanding has two angles
First, take the high school textbook as an example, the two sets of data is processed by the Z-fraction, then the product of the two sets of data and divided by the number of samples.
The Z-score generally represents the distance from the center point of the data in the normal distribution. equals the variable minus the average and dividing by the standard deviation. The standard deviation is equal to the sum of squares of the variable minus the average and divided by the number of samples. So we can refine the formula to:
The following is a Python implementation:
?
123456789101112131415161718192021222324252627282930313233 |
from
math
import
sqrt
#返回p1和p2的皮尔逊相关系数
def sim_pearson(prefs,p1,p2):
#得到双方曾评价过的物品列表
si
=
{}
for
item
in
prefs[p1]:
if item
in
prefs[p2]:
si[item]
=
1
#得到列表元素个数
n
=
len
(si)
#如果两者没有共同之处,则返回1
if
not
n:
return
1
#对所有偏好求和
sum1
=
sum
([perfs[p1][it]
for
it
in
si])
sum2
=
sum
([perfs[p2][it]
for
it
in si])
#求平方和
sum1Sq
=
sum
([
pow
(prefs[p1][it],
2
)
for
it
in
si])
sum2Sq
=
sum
([
pow
(prefs[p2][it],
2
)
for
it
in
si])
#求乘积之和
pSum
=
sum
([prefs[p1][it]
*
prefs[p2][it]
for
it
in
si])
#计算皮尔逊评价值
num
=
pSum
-
(sum1
*
sum2
/
2
)
den
= sqrt((sum1Sq
-
pow
(sum1,
2
)
/
n)
*
(sum2Sq
-
pow
((sum2,
2
)
/
2
)))
if
not
den:
return
0
r
=
num
/
den
return
r
|
Second, according to the university's linear mathematics level to understand, it is more complex can be seen as two sets of data, the cosine of the vector angle.
For data that is not centralized, the correlation coefficient is the same as the cosine of the angle of two possible regression lines Y=GX (x) and X=gy (y).
1, N numeric rows (x1, x2, x3,... xn) called n-dimensional vectors précis-writers to uppercase X
| X| =√X12+X22+X32+...+XN2 is defined as the modulus of Vector X, and the inner product of vector x and y is: x y=x1*y1+x2*y2+. Xn*yn
2. The vector angle cosine of vector x and y is calculated according to the following formula:
X Y
cosθ=
| X|x| y|
3, the vector angle of the cosine approximately 1 indicates the higher the similarity of the two vectors.
The following is a Python implementation:
?
123 |
import math,numpy def cosine_distance(u, v): return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) |
From the above explanations, Pearson's related constraints can also be understood:
Linear relationship between two variables
variable is a continuous variable
The variables are normally distributed, and the two-yuan distribution also conforms to the normal distribution
Two variables Independent
In practice statistics generally only two coefficients are output, one is the correlation coefficient is calculated the correlation coefficient size (between 1 to 1), and the other is an independent sample test coefficient, used to verify the sample consistency.
Study on the correlation coefficient of Pearson's accumulated moment