From an R language case study linear regression

Source: Internet
Author: User

Introduction to Linear regression

As shown, if the arguments (also called independent variable) and the dependent variable (also called dependent variable) are drawn on two-dimensional coordinates, each record corresponds to a point. The most common application scenario for linear back-regulation is to use a straight line to fit a known point and predict its Y value for a given x value. And all we have to do is find a suitable curve, which is to find the right slope and the longitudinal moment.

SSE & RMSE

SSE in refers to sum of squared error, that is, the sum of squares of the difference between the predicted value and the actual value, which can be used to judge the error of the model. However, there are some drawbacks in using SSE to characterize the model, for example, it depends on the number of points, and it is not good to set their units. So we have another value to weigh the error of the model. RMSE (Root-mean-square Error).

It is normalized by N, and its units are the same as the variable units.

Case

Many studies have shown that the global average temperature has risen in the past few decades, resulting in rising sea levels and extreme weather that can affect countless people. The case in this paper attempts to study the relationship between global average temperature and some other factors.
The data climate_change.csv used herein can be downloaded by the reader.
Https://courses.edx.org/c4x/MITx/15.071x_2/asset/climate_change.csv
This dataset contains data from May 1983 to December 2008.
In this example, we use data from May 1983 to December 2006 as a training data set, followed by data as a test data set.

Data

Load Data First

Data interpretation

    • Year years M

    • Month months T

    • The difference between the global average temperature and a reference value in the current period of the EMP

    • CO2, n2o,ch4,cfc.11,cfc.12: Atmospheric concentrations of these gases aerosols

Model selection

The linear regression model retains two parts.

    • Select the target feature. There are multiple feature in our data, but not all feature are helpful for predictions, or not all feature need to work together to make predictions, so we need to sift through the smallest feature combinations that best predict close to the facts.

    • Determine the feature coefficient (coefficient). After the feature is selected, we want to determine the weight of each feature on the predicted result, which is coefficient

Select Model with instance

Initial selection of all feature
Select all feature as the first model1 and use the summary function to calculate its adjusted R2 to 0.7371.

Remove feature individually
Remove any feature in the Model1, and note the corresponding adjusted R2 as follows

Feature Adjusted R2
CO2 + CH4 + N2O + cfc.11 + cfc.12 + TSI + aerosols 0.6373
MEI + CH4 + N2O + cfc.11 + cfc.12 + TSI + aerosols 0.7331
MEI + CO2 + N2O + cfc.11 + cfc.12 + TSI + aerosols 0.738
MEI + CO2 + CH4 + cfc.11 + cfc.12 + TSI + aerosols 0.7339
MEI + CO2 + CH4 + N2O + cfc.12 + TSI + aerosols 0.7163
MEI + CO2 + CH4 + N2O + cfc.11 + TSI + aerosols 0.7172
MEI + CO2 + CH4 + N2O + cfc.11 + cfc.12 + aerosols 0.697
MEI + CO2 + CH4 + N2O + cfc.11 + cfc.12 + TSI 0.6883

This round gets temp ~ MEI + CO2 + N2O + cfc.11 + cfc.12 + TSI + aerosols

Remove the 1 feature from the MODEL2 and note the corresponding adjusted R2 as follows

Feature Adjusted R2
CO2 + N2O + cfc.11 + cfc.12 + TSI + aerosols 0.6377
MEI + N2O + cfc.11 + cfc.12 + TSI + aerosols 0.7339
MEI + CO2 + cfc.11 + cfc.12 + TSI + aerosols 0.7346
MEI + CO2 + N2O + cfc.12 + TSI + aerosols 0.7171
MEI + CO2 + N2O + cfc.11 + TSI + aerosols 0.7166
MEI + CO2 + N2O + cfc.11 + cfc.12 + aerosols 0.698
MEI + CO2 + N2O + cfc.11 + cfc.12 + TSI 0.6891

Any combination of adjusted R2 is smaller than the previous round, so select the previous round of the feature combination as the final model, i.e. temp ~ MEI + CO2 + N2O + cfc.11 + cfc.12 + TSI + aerosols
The coefficient of each feature can be calculated by summary (MODEL2) as follows.

Introduction to Linear regression

In linear regression, data is modeled using a linear predictive function, and unknown model parameters are estimated by data. These models are called linear models. The most commonly used linear regression modeling is the affine function of x, given the X-value of Y's conditional mean.
Linear regression is the first type of regression analysis that has been rigorously studied and widely used in practical applications. This is because a model that relies linearly on its unknown parameters is easier to fit than a nonlinear model that relies on its positional parameters, and the resulting estimated statistical characteristics are easier to determine.
The above definition comes from Wikipedia.

This error estimation function is to go to the sum of the estimates of X (i) and the squared sum of the true value Y (i) and as the error estimation function, the 1/2m in front of the multiplication is for the derivation of the time, the coefficient is gone. As for why Squared is chosen as the error estimation function, it has to be explained from the angle of probability distribution.
How to adjust θ so that J (θ) obtains the minimum value there are many methods, this article will focus on the gradient descent method and the normal equation method.

Gradient Descent

After the linear regression model is selected, the model can be used for prediction only if the parameter θ is determined. However, Theta needs to make J (θ) the smallest. So the problem boils down to the problem of finding the minimum value.
The gradient descent process is as follows:

1. First assign a value to θ, which can be random, or allow θ to be a full 0 vector.
2. Change the value of θ so that J (θ) is adjusted in the direction of gradient descent.

The gradient direction is determined by the partial derivative of θ from J (θ), which is the inverse direction of the partial derivative, because the minimum value is obtained. Update the formula to:

This method requires that all training data be evaluated for errors before the θ is updated. (α for learning speed)

Normal equation (normal equation)

 

From an R language case study linear regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.