Regression, Regression Problems

Source: Internet
Author: User
Document directory
  • Estimated simple regression equation, estimation of simple regression equations
  • Coefficient of determination, coefficient of determination
  • Significance test for Linear Regression: Significance Test of Linear Regression
  • Confidence Interval for linear regression, confidence interval of Linear Regression
  • Prediction Interval for Linear Regression: Prediction Interval of Linear Regression
  • Residual plot, residual Diagram
  • Standardized residual
  • Normal probability plot of residuals, normal probability of residual
  • Estimated Multiple Regression Equation
  • Multiple coefficient of determination
  • Adjusted Coefficient of determination, adjustment coefficient
  • Significance test for MLR
  • Confidence Interval for MLR
  • Prediction Interval for MLR
  • Estimated Logistic Regression Equation
  • Significance test for Logistic Regression

Refer to http://www.r-tutor.com/elementary-statistics/simple-linear-regression

 

Simple linear regression

A simple linear regression model that describes the relationship between two variables X and Y can be expressed by the following equation. The numbers α and beta are called parameters, and there is the error term.

For example, in the data set faithful, it contains sample data of two random variables named waiting and eruptions. the waiting variable denotes the waiting time until the next eruptions, and eruptions denotes the duration. its linear regression model can be expressed:

Linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the quantitative relationship between two or more variables. It is widely used. Based on the relationship between independent variables and dependent variables, the analysis can be divided into linear regression analysis and nonlinear regression analysis.

 

Estimated simple regression equation, estimation of simple regression equations

If we choose the parameters α and β in the simple linear regression model so as to minimize the sum of squares of the error term values, we will have the so called estimated simple regression equation. it allows us to compute fitted values (fit value) of Y based on values of X.

 

Problem

Apply the simple linear regression model for the data set faithful, and estimate the next eruption duration if the waiting time since the last eruption has been 80 minutes.

Predict the next eruption time based on existing Samples

Solution

We apply the LM function to a formula that describes the variable eruptions by the variable waiting, andSave the Linear Regression ModelIn a new variable eruption. lm.

> Eruption. lm = LM (eruptions ~ Waiting, Data = faithful)

Then weExtract the parameters of the estimated Regression EquationWith the coefficients function.

> Coeffs = coefficients (eruption. lm); coeffs
(Intercept) waiting
-1.874016 0.075628

We now fit the eruption duration using the estimated regression equation.

> Waiting = 80 # the waiting time
> Duration =Coeffs [1] + coeffs [2] * Waiting
> Duration
(Intercept)
4.1762

Answer

Based on the simple linear regression model, if the waiting time since the last eruption has been 80 minutes, we recommend CT the next one to last 4.1762 minutes.

 

Coefficient of determination, coefficient of determination

The coefficient of determination of a linear regression model isQuotient of the variancesOf the fitted values and observed values of the dependent variable. if we denote Yi as the observed values of the dependent variable, as its mean, and as the fitted value, then the coefficient of determination is:

R2 = Σ-(y using I -- using Y)
(Yi-Policy) 2 "src =" http://www.r-tutor.com/sites/default/files/images/simple-regression5x.png ">

It is used to measure the difference between the fitting value (predicted value) and the observed value (True Value). To put it bluntly, let's look at the linear regression hypothesis...

Problem

Find the coefficient of determination for the simple linear regression model of the data set faithful.

Solution

We apply the LM function to a formula that describes the variable eruptions by the variable waiting, and save the linear regression model in a new variable eruption. lm.

> Eruption. lm = LM (eruptions ~ Waiting, Data = faithful)

Then we extract the coefficient of determination from the R. squared attribute of its summary.

> Summary (eruption. lm) $ R. squared
[1] 0.81146

Answer

The coefficient of determination of the simple linear regression model for the data set faithful is 0.81146.

 

Significance test for Linear Regression: Significance Test of Linear Regression

Assume that the error term exceeds in the linear regression model is independent of X, and is normally distributed, with zero mean and constant variance. we can decide whether there is any significant relationship between x and y by testing the null hypothesis that Beta = 0.

What is the significance test? To put it bluntly, whether the linear relationship you assume exists or is far-fetched does not matter at all.

For x = a + by, if B may be 0, x is a constant and has nothing to do with Y.

Therefore, for this linear regression, B 0 must be a small probability event, otherwise this linear regression assumption is not true.

Note: The basic idea of the significance test can be explained by the small probability principle. the small probability principle indicates that a small probability event is almost impossible in a test. If an event actually occurred in a test. we can only assume that our assumptions about the overall situation are incorrect.

So here is a hypothesis test problem. Testing the null hypothesis that β = 0. If this assumption is not true, this linear regression is significant.

Problem

Decide whether there is a significant relationship between the variables in the linear regression model of the data set faithful at. 05 significance level.

Solution

We apply the LM function to a formula that describes the variable eruptions by the variable waiting, and save the linear regression model in a new variable eruption. lm.

> Eruption. lm = LM (eruptions ~ Waiting, Data = faithful)

Then we print out the F-statistics of the significance test with the summary function.

> Summary (eruption. lm)

Call:
Lm (formula = eruptions ~ Waiting, Data = faithful)

Residuals:
Min 1q median 3q Max
-1.2992-0.3769 0.0351 0.3491 1.1933

Coefficients:
Estimate STD. Error T value PR (> | T |)
(Intercept)-1.87402 0.16014-11.7 <2e-16 ***
Waiting 0.07563 0.00222 34.1 <2e-16 ***
---
Signif. Codes: 0 '***** '000000' ** '000000' * '000000'. '000000' 1

Residual standard error: 0.497 on 270 degrees of freedom
Multiple r-squared: 0.811, adjusted R-squared: 0.811
F-statistic: 1.16e + 03 on 1 and 270 DF, p-value: <2e-16

Answer

AsP-ValueIs much less than 0.05, we reject the null hypothesis that Beta = 0. Hence there is a significant relationship between the variables in the linear regression model of the data set faithful.

I didn't understand the process. In short, the result is reject null hypothesis.

 

Confidence Interval for linear regression, confidence interval of Linear Regression

Assume that the error term exceeds in the linear regression model is independent of X, and is normally distributed, with zero mean and constant variance. For a given value of X, the Interval Estimate forMeanOf the dependent variable, is called the confidence interval.

As you can see from the example, regression is a prediction, so it is impossible to be completely accurate. Therefore, it is more reasonable to give a confidence interval than to give a single value. it is called a confidence interval because the range is based on the confidence level.

Problem

In the data set faithful, develop95% confidence interval of the mean eruption durationFor the waiting time of 80 minutes.

Solution

We apply the LM function to a formula that describes the variable eruptions by the variable waiting, and save the linear regression model in a new variable eruption. lm.

> Attach (faithful) # attach the data frame
> Eruption. lm = LM (eruptions ~ Waiting)

Then we create a new data frame that sets the waiting time value.

> Newdata = data. Frame (waiting = 80)

We now apply the predict function and set the predictor variable in the newdata argument. We also set the Interval Type as "Confidence", and use the default 0.95 confidence level.

> Predict (eruption. lm, newdata, interval = "Confidence ")
Fit LWR UPR
1 4.1762 4.1048 4.2476
> Detach (faithful) # clean up

Answer

The 95% confidence interval of the mean eruption duration for the waiting time of 80 minutes is between 4.1048 and 4.2476 minutes.

I think this example uses a highlevel function like predict, which is not conducive to your understanding of this problem.

 

Prediction Interval for Linear Regression: Prediction Interval of Linear Regression

Assume that the error term exceeds in the simple linear regression model is independent of X, and is normally distributed, with zero mean and constant variance. for a given value of X, the Interval Estimate of the dependent variable Y is called the Prediction Interval.

The difference is that the range above is the estimated y average, and here is the estimated y range.

Predict (eruption. lm, newdata, interval = "predict ")

The only difference between the Code and the above is that the interval type is changed to predict

 

Residual plot, residual Diagram

The residual data of the simple linear regression model is the difference between the observed data of the dependent variable Y and the fitted values variable.

Problem

Plot the residual of the simple linear regression model of the data set faithful against the independent variable waiting.

Solution

We apply the LM function to a formula that describes the variable eruptions by the variable waiting, and save the linear regression model in a new variable eruption. lm. then we compute the residual with the resid function.

> Eruption. lm = LM (eruptions ~ Waiting, Data = faithful)
> Eruption. Res = resid (eruption. lm)

We now plot the residual against the observed values of the variable waiting.

> Plot (faithful $ waiting, eruption. Res,
+ Ylab = "residuals", xlab = "waiting time ",
+ Main = "Old Faithful eruptions ")
> Abline (0, 0) # The Horizon

Graphical representation of the difference between the observed value and the estimated value.

 

Standardized residual

The standardized residual is the residual divided by its standard deviation.

Standard deviation of residual "src =" http://www.r-tutor.com/sites/default/files/images/simple-regression9x.png ">

 

Normal probability plot of residuals, normal probability of residual

The normal probability plot is a graphical tool for comparing a data set with the normal distribution. we can use it with the standardized residual of the linear regression model and see if the error term operated is actually normally distributed.

 

Multiple linear regression

A multiple linear regression (MLR) model that describes a dependent variable Y by independent variables x1, x2 ,..., XP (p> 1) is expressed by the equation as follows, where the numbers α and β K (k = 1, 2 ,..., p) are the parameters, and parameters is the error term.

K "src =" http://www.r-tutor.com/sites/default/files/images/multiple-regression0x.png ">

For example, in the built-in Data Set stackloss from observations of a chemical plant operation, if we assign stackloss as the dependent variable, and assign air. flow (cooling air flow), water. temp (inlet water temperature) and acid. conc. (acid concentration) as independent variables, the multiple linear regression model is:

Linear regression of multiple independent variables is complex in the real world and often determined by multiple factors. This is more practical than simple linear regression.

The content is basically the same as that of SLR.

Estimated Multiple Regression Equation

If we choose the parameters α and β K (k = 1, 2 ,..., p) in the multiple linear regression model so as to minimize the sum of squares of the error term limit, we will have the so called estimated multiple regression equation. it allows us to compute fitted values of Y based on a set of values of XK (k = 1, 2 ,..., p ).

ˆ Y = a + bkxk
K "src =" http://www.r-tutor.com/sites/default/files/images/multiple-regression2x.png ">

Problem

Apply the multiple linear regression model for the data set stackloss, and predict the stack loss if the air flow is 72, water temperature is 20 and acid concentration is 85.

Solution

We apply the LM function to a formula that describes the variable stack. loss by the variables air. flow, water. temp and acid. conc. and we save the linear regression model in a new variable stackloss. lm.

> Stackloss. lm = LM (stack. Loss ~
+ Air. Flow + water. Temp + acid. conc .,
+ Data = stackloss)

We also wrap the parameters inside a new data frame named newdata.

> Newdata = data. Frame (air. Flow = 72, # wrap the parameters
+ Water. Temp = 20,
+ Acid. conc. = 85)

Lastly, we apply the predict function to stackloss. LM and newdata.

> Predict (stackloss. lm, newdata)
1
24.582

 

Multiple coefficient of determination

The coefficient of determination of a multiple linear regression model is the quotient of the variances of the fitted values and observed values of the dependent variable. if we denote Yi as the observed values of the dependent variable, as its mean, and as the fitted value, then the coefficient of determination is:

R2 = Σ-(y using I -- using Y)
(Yi-Policy) 2 "src =" http://www.r-tutor.com/sites/default/files/images/multiple-regression5x.png ">

 

Adjusted Coefficient of determination, adjustment coefficient

The adjusted coefficient of determination of a multiple linear regression model is defined in terms of the coefficient of determination as follows, where N is the number of observations in the data set, and P is the number of independent variables.

N-p-1 "src =" http://www.r-tutor.com/sites/default/files/images/multiple-regression6x.png ">

Why do I need to adjust it ...?

 

Significance test for MLR

Assume that the error term exceeds in the multiple linear regression (MLR) model is independent of XK (k = 1, 2 ,..., p), and is normally distributed, with zero mean and constant variance. we can decide whether there is any significant relationship between the dependent variable Y and any of the independent variables XK (k = 1, 2 ,..., p ).

Whether y is significantly related to any independent variable X, which is similar to the preceding SLR.

 

Confidence Interval for MLR

Assume that the error term exceeds in the multiple linear regression (MLR) model is independent of XK (k = 1, 2 ,..., p), and is normally distributed, with zero mean and constant variance. for a given set of values of XK (k = 1, 2 ,..., p), the Interval Estimate forMeanOf the dependent variable, is called the confidence interval.

 

Prediction Interval for MLR

Assume that the error term exceeds in the multiple linear regression (MLR) model is independent of XK (k = 1, 2 ,..., p), and is normally distributed, with zero mean and constant variance. for a given set of values of XK (k = 1, 2 ,..., p), the Interval Estimate of the dependent variable Y is called the prediction interval.

 

Logistic regression, rogis Regression

We use the logistic regression equation to predict the probability of a dependent variable taking the dichotomy values 0 or 1. suppose x1, x2 ,..., XP (p> 1) are the independent variables, α and β K (k = 1, 2 ,..., p) are the parameters, and E (Y) is the expected value of the dependent variable Y, then the logistic regression equation is:

E (y) = 1/(1 + E-(α + k β kxk) "src =" http://www.r-tutor.com/sites/default/files/images/logistic-regression0x.png ">

For example, in the built-in Data Set mtcars, the data column am represents the transmission type of the automobile model (0 = automatic, 1 = manual ). with the logistic regression equation, we can model the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.

Non-linear regression, the relationships between transactions in the world cannot be linear, so we need to study non-linear regression.

 

Estimated Logistic Regression Equation

Using the generalized linear model, an estimated logistic regression equation can be formulated as below. the coefficients A and BK (k = 1, 2 ,..., p) are determined according to a maximum likelihood approach, and it allows us to estimate the probability of the dependent variable Y taking on the value 1 for given values of XK (k = 1, 2 ,..., p ).

Estimate of P (y = 1 | X1,... XP) = 1/(1 + E-(A + kbkxk) "src =" http://www.r-tutor.com/sites/default/files/images/logistic-regression2x.png ">

Problem

By use of the logistic regression equation of vehicle transmission in the data set mtcars, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120hp engine and weights 2800 lbs.

Solution

We apply the function GLM to a formula that describes the transmission type (AM) by the horsepower (HP) and weight (wt ). this creates a generalized linear model (GLM) in the binomial family.

> Am. GLM = GLM (formula = am ~ HP + wt,
+ Data = mtcars,
+ Family = binomial)

We then wrap the test parameters inside a data frame newdata.

> Newdata = data. Frame (HP = 120, Wt = 2.8)

Now we apply the function predict to the generalized linear model am. GLM along with newdata. We will have to select Response Prediction type in order to obtain the predicted probability.

> Predict (AM. GLM, newdata, type = "response ")
1
0.64181

 

Significance test for Logistic Regression

We can decide whether there is any significant relationship between the dependent variable Y and the independent variables XK (k = 1, 2 ,..., p) in the logistic regression equation. in particle, if any of the null hypothesis that β K = 0 (k = 1, 2 ,..., p) is valid, then XK is statistically insignificant in the logistic regression model.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.