Derivation of linear regression formula
Coordinate distribution of many points, which can be simulated using a straight line of y = mx + B ,. The most suitable linear regression (Best fitting regression) is the least variance of Error, that is, Square error to the line: SEline. We need to find the value of SEline's minimum m and B, that is, find the m & B that minimizes SEline.
SEline = (y1-(mx1 + B) 2 + (y2-(mx2 + B) 2 +... + (Yn-(mxn + B) 2
= Y12-2y1 (mx1 + B) + (mx1 + B) 2 + y22-2y2 (mx2 + B) + (mx2 + B) 2 +
... + Yn2-2yn (mxn + B) + (mxn + B) 2.
= Y12-2y1mx1-2y1b + m2x12 + 2mx1b + b2 +... ...
= (Y12 + y22 +... + Yn2)-2 m (x1y1 + x2y2 +... + Xnyn)
-2b (y1 + y2 +... + Yn) + m2 (x12 + x22 +... + Xn2)
+ 2 mb (x1 + x2 +... + Xn) + nb2
If we know the distribution of all vertices, that is, when we know x and y, different m and B have different SEline, which is a three-weft surface, similar to a bowl, if the minimum SEline value is used, m and B can be obtained by performing partial guidance on m and B. Partial guidance is used to evaluate a certain independent variable.
From the second equation, we can know that the mean values of x and y are located on the straight line.
Coefficient of determination r2
Y = mx + B to minimize SEline. We need to measure the degree to which the regression line matches the data. That is, How much (what %) of the total variation in y is described by the variation in x (or by the regression line ).
Total variation of y is also equivalent to square error of mean:
How much of total variation is
NOTDescribe by the regression line:
SEline = (y1-f (x1) 2 + (y1-f (x2) 2 +... + (Yn-f (xn) 2
= (Y1-(mx1 + B) 2 + (y1-(mx2 + B) 2 +... + (Yn-(mxn + B) 2
What % variation isNOTDescribed by the variation in x or by the regression line. Return line y = mx + B. x is used to describe y.
What % of total variation is described by the variation in x:
R2: coefficient of determination,Coefficient of determination. The smaller the SEline, the more consistent with the regression line, the closer r2 is to 1; the larger the SEline, the closer r2 is to 0. R2 can be regarded as a parameter to measure the regression line.
Covariance
Covariance, Cov (X, Y) = E [(X-E (X) (Y-E (Y)], observe the difference between X and its mean X-E (X) and the synchronization relationship between Y and the mean difference E (Y), whether the X-E (X) rises, the Y-E (Y) also rises, the association between the two.
Cov (X, Y) = E [(X-E (X) (Y-E (Y)] = E [XY-XE (Y)-YE (X) + E (X) E (Y)]
Because E (X) is linear, there are Cov (X, Y) = E [XY]-E [XE (Y)]-E [YE (X)] + E [E (X) E (Y)], Here the dark red part is a constant,
Cov (X, Y) = E [XY]-E (Y) E (X)-E (X) E (Y) + E (X) E (Y) = E (XY)-E (X) E (Y)
For estimation with sample
Override the slope of the regression line, where Var (X) = E [(X-E (X) 2] = Cov (X, X)
Link: My Library