R-language interpretation of multivariate linear regression model

Source: Internet
Author: User
Tags square root

Reprint: http://blog.fens.me/r-multi-linear-regression/

Objective

In this paper, an R language is followed to interpret a linear regression model. In many practical problems of life and work, there may be more than one factor affecting the dependent variable, such as a conclusion that the higher the level of knowledge, the higher the income Levels. This may include better education because of better family conditions, because in the First-tier cities, there is a better job opportunity, and the industry catches up with the big economic upward cycle. To interpret these laws is complex, multidimensional, multivariate regression analysis method is more suitable for interpreting the law of Life.

As this article is a non-statistical professional article, so when it appears that the textbook does not match the description, please take the textbook as the Subject. This paper aims to introduce the knowledge of multivariate linear regression and the implementation of R language in simplified Language.

Directory

    1. Introduction to Multivariate linear regression
    2. Meta-linear Regression Modeling
    3. Model optimization
    4. Case: Black Series Futures Day candlestick data validation
1. Introduction to Multivariate linear regression

In contrast to linear regression, multivariate linear regression is a statistical method used to determine the relationship between 2 or more 2 variables . The basic analysis method of multivariate linear regression is similar to the one-element linear regression method, We first need to select the multivariate data set and define the mathematical model, then make the parameter estimation, make a significant test of the estimated parameters, residual analysis, anomaly detection, and finally determine the regression equation to predict the model.

Since the multivariate regression equation has multiple independent variables, which is different from the unary regression equation, one of the most important operations is the optimization of the independent variables, the most significant independent variables are selected, and the Non-significant independent variables are Removed. In the R language, it is very convenient to use the optimization function, which can help us to improve the regression model.

The modeling process of multivariate linear regression is started BELOW.

2. Multivariate linear regression modeling

People who have done commodity futures research know that the black breed is a relationship with the upstream and downstream of the industrial chain. Iron ore is the raw material of steelmaking, coking coal and Coke is the energy resources of steelmaking, hot coil is hot rolled plate is made of slab as raw material after heating steel plate, rebar is the surface ribbed rebar.

Because of the relationship between the industrial chain, if we want to predict the price of threaded steel, then the factors affecting the price of rebar can be related to raw materials, energy resources and similar Materials. For example, if iron ore prices rise, rebar should be to increase prices.

2.1 Datasets and mathematical models

Starting with data introduction, this time the data set, I choose the futures of the Black series of commodities futures, including the Dalian Futures exchange of coking coal (JM), coke (J), iron ore (I), Shanghai Futures Exchange rebar (RU) and hot coil (HC).

The data set is March 15, 2016, the day open trading data, The price data for the minute line of the black system of 5 futures Contracts.

# Data set already exists DF variable in > head (df,20) x1 x2 x3 x4 y2016-03-15 09:01:00 754.5 616.5 426.5 2215 205 52016-03-15 09:02:00 752.5 614.5 423.5 2206 20482016-03-15 09:03:00 753.0 614.0 423.0 2199 20442016-03-15 09:04:00 752.5 6 13.0 422.5 2197 20402016-03-15 09:05:00 753.0 615.5 424.0 2198 20432016-03-15 09:06:00 752.5 614.5 422.0 2195 20402016-03- 15 09:07:00 752.0 614.0 421.5 2193 20362016-03-15 09:08:00 753.0 615.0 422.5 2197 20432016-03-15 09:09:00 754.0 615.5 422. 5 2197 20412016-03-15 09:10:00 754.5 615.5 423.0 2200 20442016-03-15 09:11:00 757.0 616.5 423.0 2201 20452016-03-15 09:12: 00 756.0 615.5 423.0 2200 20442016-03-15 09:13:00 755.5 615.0 423.0 2197 20422016-03-15 09:14:00 755.5 615.0 423.0 2196 20 422016-03-15 09:15:00 756.0 616.0 423.5 2200 20452016-03-15 09:16:00 757.5 616.0 424.0 2205 20522016-03-15 09:17:00 758.5 618.0 424.0 2204 20512016-03-15 09:18:00 759.5 618.5 424.0 2205 20532016-03-15 09:19:00 759.5 617.5 424.5 2206 20532016-03 -15 09:20:00 758.5 617.5 423.5 2201 2050 

The dataset includes 6 columns:

    • index, for TIME
    • x1, quotation data for the 1-minute line of the Coke (j1605) contract
    • x2, quotation data for the 1-minute line of the coking coal (jm1605) contract
    • x3, quotation data for the 1-minute line of the iron ore (i1605) contract
    • x4, quote data for the 1-minute line of hot roll (hc1605) contracts
    • y, for the 1-minute line quotation data for rebar (RB1605) contract

Assuming that the price of the threaded steel is linearly related to the price of the other 4 commodities, we establish a multivariate linear regression model with rebar as the dependent variable, coking coal, coke, iron ore, and Hot coil as independent variables. expressed in formulas As:

y = a + b * x1 + c * x2 + d * x3 + e * x4 + ε
    • y, for the dependent variable, rebar
    • x1, for self-variable, coking coal
    • x2, for self-variable, Coke
    • x3, for self-variable, iron ore
    • x4, as a self-variable, hot coil
    • a, for Intercept
    • b,c,d,e, self-variable coefficients
    • ε, as residuals, is the sum of all other uncertainties affecting the value of which is not observable. It is assumed that ε obeys normal distribution n (0,σ^2).

Through the mathematical definition of multivariate linear regression model, We then make use of data sets to estimate the parameters of multivariate regression model.

2.2. Regression parameter estimation

In the above formula, the regression parameters a, b, c, d,e are not known to us, parameter estimation is the data to estimate these parameters, so as to determine the relationship between the independent variable and the dependent Variable. Our goal is to calculate a straight line so that the sum of the squares of the Y value of each point on the line and the Y value of the actual data is the minimum, i.e. (Y1 actual-y1 forecast) ^2+ (Y2 actual-y2 forecast) ^2+ ... + (yn Actual-yn forecast) ^2 value is Minimized. Parameter estimation, We only consider the part of Y with the linear variation of the X independent variable, and the residual ε is not observable, and the parameter estimation method does not need to consider the Residuals.

Similar to the linear regression, we use r language to realize the parameter estimation of the regression model of data, and use the LM () function to realize the modeling process of multivariate linear Regression.

# 建立多元线性回归模型> lm1<-lm(y~x1+x2+x3+x4,data=df)# 打印参数估计的结果> lm1Call:lm(formula = y ~ x1 + x2 + x3 + x4, data = df)Coefficients:(Intercept)           x1           x2           x3           x4     212.8780       0.8542       0.6672      -0.6674       0.4821  

So we get the equation for the relationship between Y and X.

y = 212.8780 + 0.8542 * x1 + 0.6672 * x2 - 0.6674 * x3 + 0.4821 * x4

2.3. the significance of regression equation test

Reference to the significance of linear regression, the significance of multivariate linear regression test, the same is required by t-test, f-test, and r^2 (r-squared) related system Test. In the R language, these three kinds of test methods have been implemented, we only need to interpret the results, we can summary () function to extract the results of the model Calculation.

> summary(lm1)Call:lm(formula = y ~ x1 + x2 + x3 + x4, data = df)Residuals:    Min      1Q  Median      3Q     Max -4.9648 -1.3241 -0.0319  1.2403  5.4194 Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 212.87796   58.26788   3.653 0.000323 ***x1            0.85423    0.10958   7.795 2.50e-13 ***x2            0.66724    0.12938   5.157 5.57e-07 ***x3           -0.66741    0.15421  -4.328 2.28e-05 ***x4            0.48214    0.01959  24.609  < 2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 2.028 on 221 degrees of freedomMultiple R-squared:  0.9725,Adjusted R-squared:  0.972 F-statistic:  1956 on 4 and 221 DF,  p-value: < 2.2e-16
    • T test: The independent variables are very significant * * *
    • F-test: also very significant, P-value < 2.2e-16
    • Adjusted r^2: very strong correlation of 0.972

finally, we get the final multivariate linear regression equation through the test of regression parameters and the test of regression equation:

y = 212.87796 + 0.85423 * x1 + 0.66724 * x2 - 0.66741 * x3 + 0.48214 * x4即螺纹钢 = 212.87796 + 0.85423 * 焦炭 + 0.66724 * 焦煤 - 0.66741 * 铁矿石 + 0.48214 * 热卷

2.4 Residual analysis and anomaly detection

After the significant test of the regression model, the residual analysis (the difference between the predicted value and the actual Value) is done to verify the correctness of the model, and the residuals must obey the normal distribution n (0,σ^2). The plot () function is used to generate 4 graphs for model diagnosis, which are visualized and analyzed visually.

> par(mfrow=c(2,2))> plot(lm1)

    • Residuals and fitted values (upper left), the data points between residuals and fitted values are evenly distributed on both sides of the y=0, showing a random distribution, and the red Line presents a smooth curve with no apparent shape characteristics.
    • Residual QQ graph (upper right), data points in a diagonal line, tending to a straight line, and by diagonal straight through, intuitive to meet the normal Distribution.
    • The normalized residual square root and the fitting value (lower left), The data points are evenly distributed on both sides of the y=0, showing a random distribution, The red line shows a smooth curve and no obvious shape characteristics.
    • Standardized residuals and leverage values (lower right), without a red contour, indicate that there are no outliers in the data that specifically affect regression results.

conclusion, There is no obvious anomaly, and the residuals meet the Hypothesis.

2.5. model Predictions

We get the formula of the multivariate linear regression equation, we can predict the Data. We can use the Predict () function of the R language to calculate the predicted value y0 and the corresponding prediction interval, and visualize the actual and predicted values Together.

> par(mfrow=c(1,1))  #设置画面布局# 预测计算> dfp<-predict(lm1,interval="prediction")# 打印预测时> head(dfp,10)                fit      lwr      upr2014-03-21 3160.526 3046.425 3274.6262014-03-24 3193.253 3078.868 3307.6372014-03-25 3240.389 3126.171 3354.6072014-03-26 3228.565 3114.420 3342.7102014-03-27 3222.528 3108.342 3336.7132014-03-28 3262.399 3148.132 3376.6662014-03-31 3291.996 3177.648 3406.3442014-04-01 3305.870 3191.447 3420.2942014-04-02 3275.370 3161.018 3389.7232014-04-03 3297.358 3182.960 3411.755# 合并数据> mdf<-merge(df$y,dfp) # 画图> draw(mdf)

Legend Description

    • y, actual price, Red Line
    • fit, Forecast price, Green Line
    • lwr, Lowest price forecast, Blue Line
    • upr, High price forecast, Purple Line

see, the actual price Y and the predicted price fit, most of the time is very close. One of our models is trained!

3. Model Optimization

In the above, we have found a very good model very well. For model optimization, You can adjust the model using the update () function in the R language. Let's first examine the relationship between each argument x1,x2,x3,x4 and the dependent variable y.

pairs(as.data.frame(df))

, we can find the relationship between X2 and y, which is probably the most deviating from the linear. so, we try to adjust the multivariate linear regression model and remove the x2 variable from the original model.

# 模型调整> lm2<-update(lm1, .~. -x2)> summary(lm2)Call:lm(formula = y ~ x1 + x3 + x4, data = df)Residuals:    Min      1Q  Median      3Q     Max -6.0039 -1.3842  0.0177  1.3513  4.8028 Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 462.47104   34.26636   13.50  < 2e-16 ***x1            1.08728    0.10543   10.31  < 2e-16 ***x3           -0.40788    0.15394   -2.65  0.00864 ** x4            0.42582    0.01718   24.79  < 2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 2.142 on 222 degrees of freedomMultiple R-squared:  0.9692,Adjusted R-squared:  0.9688 F-statistic:  2330 on 3 and 222 DF,  p-value: < 2.2e-16

When the Self-variable x2 is removed, the self-variable X3 t-test instead becomes larger, while adjusted r-squared becomes smaller, so we have a problem with this Adjustment.

If the internal logic analysis of production and raw materials, coking coal and coke belong to the upstream and downstream relations. Coking coal is a raw material for the production of coke, Coke and other coking coals through the formation of the product, the general production of 1 tons of coke needs 1.33 tons of coking coal, of which Coke accounted for at least 30%.

We change the relationship between coking coal and coke, increase the relationship of x1*x2 to the model, and see the Effect.

# 模型调整> lm3<-update(lm1, .~. + x1*x2)> summary(lm3)Call:lm(formula = y ~ x1 + x2 + x3 + x4 + x1:x2, data = df)Residuals:    Min      1Q  Median      3Q     Max -4.8110 -1.3501 -0.0595  1.2019  5.3884 Coefficients:              Estimate Std. Error t value Pr(>|t|)    (Intercept) 7160.32231 7814.50048   0.916    0.361    x1            -8.45530   10.47167  -0.807    0.420    x2           -10.58406   12.65579  -0.836    0.404    x3            -0.64344    0.15662  -4.108 5.63e-05 ***x4             0.48363    0.01967  24.584  < 2e-16 ***x1:x2          0.01505    0.01693   0.889    0.375    ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 2.029 on 220 degrees of freedomMultiple R-squared:  0.9726,Adjusted R-squared:  0.972 F-statistic:  1563 on 5 and 220 DF,  p-value: < 2.2e-16

It was found from the results that the t-test of the original x1,x2 and intercept was not significant after the addition of the x1*x2 Column. Continue to adjust the model to remove x1,x2 two arguments from the Model.

# 模型调整> lm4<-update(lm3, .~. -x1-x2)> summary(lm4)Call:lm(formula = y ~ x3 + x4 + x1:x2, data = df)Residuals:    Min      1Q  Median      3Q     Max -4.9027 -1.2516 -0.0167  1.2748  5.8683 Coefficients:              Estimate Std. Error t value Pr(>|t|)    (Intercept)  6.950e+02  1.609e+01  43.183  < 2e-16 ***x3          -6.284e-01  1.530e-01  -4.108 5.61e-05 ***x4           4.959e-01  1.785e-02  27.783  < 2e-16 ***x1:x2        1.133e-03  9.524e-05  11.897  < 2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 2.035 on 222 degrees of freedomMultiple R-squared:  0.9722,Adjusted R-squared:  0.9718 F-statistic:  2588 on 3 and 222 DF,  p-value: < 2.2e-16

Judging from the adjusted results, the effect is Good. however, There is no improvement over the initial model.

For the process of model adjustment, if we manually adjust the test, it will generally be based on business knowledge to Operate. If it is calculated according to the data indicators, we can use the Stepwise regression optimization method provided in the R language to determine whether the parameter optimization is required by the AIC Indicator.

#对lm1模型做逐步回归> step(lm1)Start:  AIC=324.51y ~ x1 + x2 + x3 + x4       Df Sum of Sq    RSS    AIC               908.8 324.51- x3    1     77.03  985.9 340.90- x2    1    109.37 1018.2 348.19- x1    1    249.90 1158.8 377.41- x4    1   2490.56 3399.4 620.65Call:lm(formula = y ~ x1 + x2 + x3 + x4, data = df)Coefficients:(Intercept)           x1           x2           x3           x4     212.8780       0.8542       0.6672      -0.6674       0.4821  

By calculating the AIC indicator, the LM1 model AIC has an hour of 324.51, and each time one of the arguments is removed, the value of the AIC becomes larger, so we still do not adjust it better.

The model of the LM3 model is adjusted to make the stepwise Regression.

#对lm3模型做逐步回归> step(lm3)Start:  AIC=325.7               #当前AICy ~ x1 + x2 + x3 + x4 + x1:x2        Df Sum of Sq    RSS    AIC- x1:x2  1      3.25  908.8 324.51                905.6 325.70- x3     1     69.47  975.1 340.41- x4     1   2487.86 3393.5 622.25Step:  AIC=324.51               #去掉x1*x2项的AICy ~ x1 + x2 + x3 + x4       Df Sum of Sq    RSS    AIC               908.8 324.51- x3    1     77.03  985.9 340.90- x2    1    109.37 1018.2 348.19- x1    1    249.90 1158.8 377.41- x4    1   2490.56 3399.4 620.65Call:lm(formula = y ~ x1 + x2 + x3 + x4, data = df)Coefficients:(Intercept)           x1           x2           x3           x4     212.8780       0.8542       0.6672      -0.6674       0.4821  

With the Aic's judgment, the AIC is the smallest after removing the x1*x2 item, and the final test results tell us that the original model is the Best.

4. Case: Black Series Futures Day candlestick data validation

finally, we test with the daily candlestick data of the 5 futures contracts above to find a multivariate regression relationship.

> lm9<-lm(y~x1+x2+x3+x4,data=df)  # 日K线数据> summary(lm9)Call:lm(formula = y ~ x1 + x2 + x3 + x4, data = df)Residuals:     Min       1Q   Median       3Q      Max -173.338  -37.470    3.465   32.158  178.982 Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 386.33482   31.07729  12.431  < 2e-16 ***x1            0.75871    0.07554  10.045  < 2e-16 ***x2           -0.62907    0.14715  -4.275 2.24e-05 ***x3            1.16070    0.05224  22.219  < 2e-16 ***x4            0.46461    0.02168  21.427  < 2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 57.78 on 565 degrees of freedomMultiple R-squared:  0.9844,Adjusted R-squared:  0.9843 F-statistic:  8906 on 4 and 565 DF,  p-value: < 2.2e-16

Basic statistics for the Dataset.

> summary(df)     Index                           x1               x2        Min.   :2014-03-21 00:00:00   Min.   : 606.5   Min.   :494.0   1st Qu.:2014-10-21 06:00:00   1st Qu.: 803.5   1st Qu.:613.1   Median :2015-05-20 12:00:00   Median : 939.0   Median :705.8   Mean   :2015-05-21 08:02:31   Mean   : 936.1   Mean   :695.3   3rd Qu.:2015-12-16 18:00:00   3rd Qu.:1075.0   3rd Qu.:773.0   Max.   :2016-07-25 00:00:00   Max.   :1280.0   Max.   :898.0         x3              x4             y        Min.   :284.0   Min.   :1691   Min.   :1626   1st Qu.:374.1   1st Qu.:2084   1st Qu.:2012   Median :434.0   Median :2503   Median :2378   Mean   :476.5   Mean   :2545   Mean   :2395   3rd Qu.:545.8   3rd Qu.:2916   3rd Qu.:2592   Max.   :825.0   Max.   :3480   Max.   :3414  

For the Japanese candlestick data, the black system of 5 varieties, the same has a very strong correlation, then we can apply this conclusion to the actual Transaction.

R-language interpretation of multivariate linear regression model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.