Lesson 2
- Lesson 2
- Induction
- Regression is the relationship between variables
- Correlation coefficient
- Rss
- Linear regression via R language
- Multivariate linear model
- Dummy variable Dummy variable
- Multivariate linear regression model
- Regression diagnosis
- Generalized linear model
- Unary and multivariate linear regression, statistical things more, statistical terminology a bunch of
- Fundamentals of the statistical basis of big data
- Logistic regression, which is divided into generalized linear regression model.
- Variable filtering, solving from a bunch of variables, and dimensionality reduction.
1. Induction
fit , the general selection of straight line or the number of low curve. (The test has an error, if the curve passes through each point, it is called overfitting.) Although it is very accurate within the model, it may be more inaccurate to make predictions)
Learning sets, predictions. Regression model, W=A+BH
- Linear regression:
- Unary linear (one of the arguments),
- Multiple linear (independent variable, one-time equation, is a surface, the super-plane in high-dimensional space);
- Nonlinear regression: two times, logistic and so on. Nonlinear with linear representation, called generalized linearity (e.g. logistic)
- Difficulty: The selection of variables (multivariate), dimensionality reduction is the difficulty in the regression model. The laws of the world are all very simple things,
multiple collinearity (some variables are soy sauce, how to judge, how to remove)
How to test the model is reasonable, need some test means.
2. Regression is the relationship between variables
- The relationship between the independent variable and the dependent variable
- function relationship: Deterministic, Y=A+BX (a intercept item, b slope)
- Correlation: Non-deterministic relationships
3. Correlation coefficient
Decide whether it is appropriate to do regression model, correlation coefficient to measure the strength of linear correlation.
Several concepts in the formula:
- Subscript, indicating the first few samples.
- X-Pull (averaging)
- Sigma (SUM) If the sum is all summed without a subscript.
- According to Cauchy inequality, it is less than 1. If it is close to 1, it is suitable for linear regression model
- Positive correlation coefficient, with increasing. Negative number, indicating ~ ~
4.RSS
- Which regression line works best: the more intuitive approach, the distance from the point to the straight lines, so that all points are the smallest distance.
- But the trouble is that the distance involves a radical, which is difficult to convert to extremum. is changed to a vertical line, or parallel to the Y axis, called the residuals
- Absolute value in mathematics is not good to find the extremum, so instead of square
- Rss:residual sum of squares, residual/error/residuals squared sum, measuring the difference between the false value of the predicted value
- RSS (least squares), two-time function to find the Extremum method.
- How to find the Extremum: to find the partial derivative, there are two independent variables, we need to find two partial derivative, then solve the two-Yuan equation group.
5. Linear regression via R language
- y=c (61,57,58,40,90,35,68)
- x=c (170,168,175,153,185,135,172)
- plot (x, y) #把散点画出来
- z= LM (y~x+1) #lm assumes y=ax+b, subsequent +1 can not write
- z=lm (y~x-1) # over Origin, no intercept
- plot (y~x+1)
-
S Ummary (z) #求解, the meaning of each field in the summary:
- (Intercept) intercept
- residual residuals,
- residual standard residuals Quasi-difference
- multiple r-squared: the correlation coefficient squared, the higher the correlation, the better.
- adjusted r-squared: Adjusted goodness of fit, limited function
- T value hypothesis test The metric t value, the size of the area other than
- Pr (>|t|) T, the smaller the better.
- f-statistic:f Statistics
- p-value the overall hypothesis test. I can't say I'm wrong. If not, the regression model is invalid
-
Plot (z) to make the Tula larger, there are multiple graphs, to press multiple carriage returns
- deviance (z) error squared and
- residuals (z) Calculate residuals
- print (z) printing model information
- Anova (z) method Analysis table
Analysis of Variance TableResponse: y Df Sum Sq Mean Sq F value Pr(>F) x 1 197.633 197.633 47.943 0.006176 **Residuals 3 12.367 4.122 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Predictions, where x is the argument, M is the value to predict, Z is the formula
M=data.frame (x=185)
Predict (Z,M)
6. Multivariate linear model
R language Input Swiss, built-in swisss datasets
swiss.lm=lm(Fertility~.,data=swiss) summary(swiss.lm)
With too much data, the residuals are expressed in four fractions: Min 1Q media 3Q Max
7. Dummy variable/dummy variable
Dummy variable \ Dummy variable, such as sex This categorical variable is represented by two dummy variables: Isman,iswoman
- Additive model, dummy variable used to adjust intercept: as W=a+bh+c*isman
- Multiplicative models, dummy variables are used to adjust the slope, such as W=a+bh+c*isman*h
- The mixed model, which affects the intercept and slope, w=a+bh+c*isman+d*iswoman+e*isman*h+f*iswoman*h+g
8. Multivariate linear regression model
Model corrections, see r-modeling 324 page
- Lm.new<-update (Lm.sol,. ~.+i (x2^2)) #I (x2^2) represents the square term of the X2
- Lm2.new<-update (Lm.new,. ~.-x2) #去掉X2的一次项
- Lm3.new<-update (Lm.new,. ~.+x1*x2) #增加考虑X1和X2的一次项
Description: This correction is made by the experience of the analyst and the naked eye. Statistically there is no mechanized, support variable selection method, and this has-stepwise regression. Here are a few things:
- Forward approach: Start with a unary regression and incrementally increase the variables
- Backward culling method: all variables, gradually culling
- Stepwise Screening Method: Combining the above two
Assessment method:
- RSS (residuals squared sum), corresponding to residual standard error for summary results
- r^2 (squared correlation coefficient), corresponding to the multiple of summary results r-squared
- AIC (Akaike information criterion) Red Pool information guidelines
s=lm(Fertility~.,data=swiss)s1=step(s,direction="forward"#已经没有变量可以增加了s1=step(s,direction="backward")s1=step(s,direction="both")
Manual regression, r-modeling 334 pages
9. Regression Diagnostics
- Does the sample conform to the normal distribution?
- Normality test: function shapiro.test (X$X1)
- The distribution of normality
- Learning set/Is there outliers? How to find Outliers
- is the linear model reasonable? Maybe the relationship between nature is more complicated.
- Whether the error satisfies the independence, equal variance (the error is not related to the Y size)
- If the sample is normally distributed, the residual residuals () is also normally distributed
- Multiple collinearity (arguments are not independent)
- The existence of multiple collinearity leads to a very uncertain result of the inverse matrix.
- Kappa value, Greek alphabet, multiplies the data of the sample by its matrix transpose, at the root of the feature, dividing the maximum by the minimum value
- K<100, indicating a small degree of collinearity, if 100< k< 1000, there is a strong multiple collinearity, k>1000, in the severe multi-collinear
10. Generalized linear model
- Nonlinear
- S-curve, statistically very famous, called Logistic curve
- GLM () Fitting Generalized linear model (Fitting generalized Linear Models)
Here is the Norell experiment:
norell<-data.frame(x=0:5, n=rep(70,6),success=c(0,9,21,47,60,63))norell$Ymat<- cbind(norell$success, norell$n-norell$success)glm.sol<-glm(Ymatfamilydata=norell)summary(glm.sol)
A method to convert generalized linear models to linear
- Logarithmic method, Y=a+b logx,lm.log=lm (Y~log (x))
- Exponential method, Y=a ebx,lm.exp=lm (log (y) ~x)
- Power function method, Y=a xb,lm.pow=lm (log (y) ~log (x))
Machine Learning Course 2-Notes