Machine Learning Course 2-Notes

Last Update:2015-06-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lesson 2

Lesson 2
- Induction
- Regression is the relationship between variables
- Correlation coefficient
- Rss
- Linear regression via R language
- Multivariate linear model
- Dummy variable Dummy variable
- Multivariate linear regression model
- Regression diagnosis
- Generalized linear model

Unary and multivariate linear regression, statistical things more, statistical terminology a bunch of
Fundamentals of the statistical basis of big data
Logistic regression, which is divided into generalized linear regression model.
Variable filtering, solving from a bunch of variables, and dimensionality reduction.

1. Induction

fit , the general selection of straight line or the number of low curve. (The test has an error, if the curve passes through each point, it is called overfitting.) Although it is very accurate within the model, it may be more inaccurate to make predictions)
Learning sets, predictions. Regression model, W=A+BH

Linear regression:
- Unary linear (one of the arguments),
- Multiple linear (independent variable, one-time equation, is a surface, the super-plane in high-dimensional space);
Nonlinear regression: two times, logistic and so on. Nonlinear with linear representation, called generalized linearity (e.g. logistic)
Difficulty: The selection of variables (multivariate), dimensionality reduction is the difficulty in the regression model. The laws of the world are all very simple things,
multiple collinearity (some variables are soy sauce, how to judge, how to remove)
How to test the model is reasonable, need some test means.

2. Regression is the relationship between variables

The relationship between the independent variable and the dependent variable
function relationship: Deterministic, Y=A+BX (a intercept item, b slope)
Correlation: Non-deterministic relationships

3. Correlation coefficient

Decide whether it is appropriate to do regression model, correlation coefficient to measure the strength of linear correlation.
Several concepts in the formula:

Subscript, indicating the first few samples.
X-Pull (averaging)
Sigma (SUM) If the sum is all summed without a subscript.
According to Cauchy inequality, it is less than 1. If it is close to 1, it is suitable for linear regression model
Positive correlation coefficient, with increasing. Negative number, indicating ~ ~

4.RSS

Which regression line works best: the more intuitive approach, the distance from the point to the straight lines, so that all points are the smallest distance.
But the trouble is that the distance involves a radical, which is difficult to convert to extremum. is changed to a vertical line, or parallel to the Y axis, called the residuals
Absolute value in mathematics is not good to find the extremum, so instead of square
Rss:residual sum of squares, residual/error/residuals squared sum, measuring the difference between the false value of the predicted value
RSS (least squares), two-time function to find the Extremum method.
How to find the Extremum: to find the partial derivative, there are two independent variables, we need to find two partial derivative, then solve the two-Yuan equation group.

5. Linear regression via R language

y=c (61,57,58,40,90,35,68)
x=c (170,168,175,153,185,135,172)
plot (x, y) #把散点画出来
z= LM (y~x+1) #lm assumes y=ax+b, subsequent +1 can not write
z=lm (y~x-1) # over Origin, no intercept
plot (y~x+1)
S Ummary (z) #求解, the meaning of each field in the summary:
- (Intercept) intercept
- residual residuals,
- residual standard residuals Quasi-difference
- multiple r-squared: the correlation coefficient squared, the higher the correlation, the better.
- adjusted r-squared: Adjusted goodness of fit, limited function
- T value hypothesis test The metric t value, the size of the area other than
- Pr (>|t|) T, the smaller the better.
- f-statistic:f Statistics
- p-value the overall hypothesis test. I can't say I'm wrong. If not, the regression model is invalid
Plot (z) to make the Tula larger, there are multiple graphs, to press multiple carriage returns
deviance (z) error squared and
residuals (z) Calculate residuals
print (z) printing model information
Anova (z) method Analysis table

Analysis of Variance TableResponse: y          Df  Sum Sq Mean Sq F value   Pr(>F)   x          1 197.633 197.633  47.943 0.006176 **Residuals  3  12.367   4.122                    ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Predictions, where x is the argument, M is the value to predict, Z is the formula
M=data.frame (x=185)
Predict (Z,M)

6. Multivariate linear model

R language Input Swiss, built-in swisss datasets

 swiss.lm=lm(Fertility~.,data=swiss) summary(swiss.lm)

With too much data, the residuals are expressed in four fractions: Min 1Q media 3Q Max

7. Dummy variable/dummy variable

Dummy variable \ Dummy variable, such as sex This categorical variable is represented by two dummy variables: Isman,iswoman

Additive model, dummy variable used to adjust intercept: as W=a+bh+c*isman
Multiplicative models, dummy variables are used to adjust the slope, such as W=a+bh+c*isman*h
The mixed model, which affects the intercept and slope, w=a+bh+c*isman+d*iswoman+e*isman*h+f*iswoman*h+g

8. Multivariate linear regression model

Model corrections, see r-modeling 324 page

Lm.new<-update (Lm.sol,. ~.+i (x2^2)) #I (x2^2) represents the square term of the X2
Lm2.new<-update (Lm.new,. ~.-x2) #去掉X2的一次项
Lm3.new<-update (Lm.new,. ~.+x1*x2) #增加考虑X1和X2的一次项

Description: This correction is made by the experience of the analyst and the naked eye. Statistically there is no mechanized, support variable selection method, and this has-stepwise regression. Here are a few things:

Forward approach: Start with a unary regression and incrementally increase the variables
Backward culling method: all variables, gradually culling
Stepwise Screening Method: Combining the above two

Assessment method:

RSS (residuals squared sum), corresponding to residual standard error for summary results
r^2 (squared correlation coefficient), corresponding to the multiple of summary results r-squared
AIC (Akaike information criterion) Red Pool information guidelines

s=lm(Fertility~.,data=swiss)s1=step(s,direction="forward"#已经没有变量可以增加了s1=step(s,direction="backward")s1=step(s,direction="both")

Manual regression, r-modeling 334 pages

ADD1 ()
DROP1 ()

9. Regression Diagnostics

Does the sample conform to the normal distribution?
- Normality test: function shapiro.test (X$X1)
- The distribution of normality
Learning set/Is there outliers? How to find Outliers
is the linear model reasonable? Maybe the relationship between nature is more complicated.
Whether the error satisfies the independence, equal variance (the error is not related to the Y size)
- If the sample is normally distributed, the residual residuals () is also normally distributed
Multiple collinearity (arguments are not independent)
- The existence of multiple collinearity leads to a very uncertain result of the inverse matrix.
- Kappa value, Greek alphabet, multiplies the data of the sample by its matrix transpose, at the root of the feature, dividing the maximum by the minimum value
- K<100, indicating a small degree of collinearity, if 100< k< 1000, there is a strong multiple collinearity, k>1000, in the severe multi-collinear

10. Generalized linear model

Nonlinear
S-curve, statistically very famous, called Logistic curve
GLM () Fitting Generalized linear model (Fitting generalized Linear Models)

Here is the Norell experiment:

norell<-data.frame(x=0:5, n=rep(70,6),success=c(0,9,21,47,60,63))norell$Ymat<- cbind(norell$success, norell$n-norell$success)glm.sol<-glm(Ymatfamilydata=norell)summary(glm.sol)

A method to convert generalized linear models to linear

Logarithmic method, Y=a+b logx,lm.log=lm (Y~log (x))
Exponential method, Y=a ebx,lm.exp=lm (log (y) ~x)
Power function method, Y=a xb,lm.pow=lm (log (y) ~log (x))

Machine Learning Course 2-Notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning Course 2-Notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning Course 2-Notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support