We know that there are some applicable conditions for linear regression models: 1. Linear, 2. No autocorrelation, 3. Residuals conform to normal distribution, 4. Variance uniformity. When the data fails to meet these conditions, we either convert the data to conform to the conditions of the linear regression, or adjust the model to fit the original data. In short, this is a process of adapting data and models to each other. Here are some of the ways to deal with these four conditions when they are not satisfied:
First, non-linear situation
One of the most important prerequisites for a linear regression model is that the data is linearly trending, which can be judged by doing a residual plot after the completion of the scatter plot or fitting, which can be handled in two ways when the data does not conform to the linear trend.
1. Variable linearization
The basic idea is to determine the function type first, and then pass the linear transformation to make it linear trend. Determine the function type, can be based on professional knowledge, literature, experience and so on to determine, also can observe the scatter chart, when there is more than one type can be selected, you can do more than a few regression to compare, select the best person. Once the function type has been determined, it can be transformed, for example:
Here are some common types of functions and their transformation relationships
Variable transformation methods are very convenient for explicit function relationships, but the drawbacks are obvious:
1. Not all function types can be linearly transformed
2. When there is more than one type to choose from, it is cumbersome to determine, and if you choose errors, linear transformations and regressions can still proceed as usual, but the end result is wrong.
3. Linear regression using the least squares fitting, although the variable transformation can guarantee the residual sum of squares and minimum, but for the original value, it is not necessarily the optimal equation.
2. Nonlinear regression model
In view of the limitation of variable linearization, nonlinear regression can be used directly, the parameter estimation of nonlinear regression is similar to that of linear regression, and the function of estimating error (or loss function) is given first, then the function is minimized and the parameter value is obtained at this time.
Since the equation is nonlinear, it is not possible to directly calculate the parameter value estimated by the least squares method, which is usually estimated by Gauss-Newton method, which is to do Taylor series expansion of nonlinear equation, make the nonlinear equation approximate linearization in a certain initial value, and then use least squares to estimate the parameters. The estimated parameter values are brought into the equation, the Taylor series is expanded, and the approximate linearization variance is estimated using the least squares method, so repeated until the parameter estimates converge.
Note that the initial value setting has a significant effect on the regression results, so be careful to set the initial value.
=======================================================
Second, the variance is not homogeneous situation
Homogeneity is also one of the basic assumptions of the linear model, that is, the variance of the random error is a fixed value in the whole analysis process, does not change with the change of the independent variable, but the actual analysis often encounters the general variance is not a fixed constant, but changes, this is the case of variance is not homogeneous, also known as variance.
First of all, the difference of variance
The variance is mainly in the cross-sectional data, and the reasons are as follows:
1. The measurement error of the interpreted variable varies with time
2. There are some ignored explanatory variables
3. Mathematical form error of the model
4. Data is grouped data
5. Certain economic practices
Variance can be categorized into three types
1. monotonically increasing: Variance increases with x increase
2. Monotonically decreasing: Variance increases with x increase
3. Complexity: no obvious rules
When variance is present, the following problems occur if you still use ordinary least squares for parameter estimation
1. The estimated parameters are not valid
Although unbiased, it is not valid because, in the validity certificate, the variance is assumed to be the fixed value
2. The significance of the test of the variable loses its meaning
Because the variable significance test constructs the T statistic, also is the Variance invariable Foundation, other tests also is same
3. Model Prediction Failure
Because the predicted value of the confidence interval also contains the parameter equation of the estimate, this value is not accurate, then the confidence interval is not accurate.
In short, when the variance occurs, any information inferred based on the constant variance will be invalidated, and the prediction error of Y will be greater, and the accuracy of prediction can be reduced.
The method of judging the difference of variance:
Since the variance is different from the random error of the observed value, the correlation between the variance of the random error term and the observed value of the variable can be examined directly by judging the heterogeneity of variance.
1. Graphic method
Make a scatter plot of the variance-random error term to see if it is a straight line with a slope of 0
2. Parker-Grise Test
Establish an equation between the random error and the observed value of the variable, such as
Choosing the different function forms of the variable x, estimating the equation and making a significant test, if there is a function form to make the equation significant, it shows that the original model has different variance, and the common function forms such as
If α is statistically significant, it indicates that there is a difference in variance
3. Godfield-Quandt test (g-q test)
On the basis of the F-test, the sample is sorted by the observed value and divided into two, the regression is done for sample 1 and Sample 2, and then the statistic of the sum of squares of the residuals of two samples is used to test. The statistic obeys the F-distribution, if there is an increment variance, then f is greater than 1, the descending variance f is less than 1, the same variance is f=1.
Step 1: Sort n Samples by the size of the observed values
Step 2: Remove the C=N/4 observations in the middle of the sequence and divide the remaining observations into smaller and larger identical two sub-samples with a sample capacity of (N-C)/2 per sub-sample
Step 3: Use ordinary least squares for each sub-sample for regression estimation, and calculate the sum of the respective residuals squared
Step 4: Set the original hypothesis to be the same variance, construct f statistic
At a given significant level α condition, the critical value Fα is determined, if F>fα rejects the original hypothesis, indicating the existence of the variance
4. White test
First, the normal least squares regression of the model is obtained, and a residual sum of squares is given, and then the sum of squares of the residuals is returned.
If it is the same variance, then
The R2 is the sum of the squared sum of the residuals, and H is the number of explanatory variables, and the middle symbol is the asymptotic distribution.
Correction of variance
As mentioned earlier, it is not advisable to use the ordinary least square method when the variance occurs, and the weighted least squares WLS can be used to estimate the difference. The weighted least square method is weighted to the original model, the smaller residual squared is given a larger weight, and the larger residual squared is given a smaller weight, which can eliminate the variance and then use the ordinary least squares method to estimate the parameters. For example:
A data model, tested to know:
It can be seen that the variance of the model varies with the X and there is the variance, so we use
Remove the model, get a new model, also known as the weight function
The original model multiplied by the weights, and then the ordinary least squares, this is the weighted least squares, here is the weight of the estimate is also unbiased, effective, but this is only the theoretical optimal weight function, in the actual analysis, the optimal weight function is often unknown, can only be based on experience to assume that, Therefore, the weighted least squares estimation can only be approximate unbiased.
In short, the weighted least squares is a common method to solve the variance, but it is not the only method, and the weighted least squares weighting function selection is very important, it will directly affect the analysis results, the determination of the weight function, there is no uniform standards and methods, each model has its own unique optimal weight function, Sometimes it is even necessary to compare multiple weighting functions to choose the best one from them. Some statistical software, such as SPSS has a default weight function, when the need for manual determination, it was proposed to the ordinary least squares obtained residual variance and variable x to determine a parabola, used as a weight function. It is also suggested that the use of iterative algorithm for weighting and so on, the need for further discussion.
Handling when a linear regression hypothesis condition is violated