A review of regression prediction and R language Realization Part1 regression basis

Source: Internet
Author: User

A review of Part1 regression basis


There are many kinds of regression methods, the most common is linear regression (there are also one and multivariate), polynomial regression, nonlinear regression. In addition, we will briefly explain the methods of testing the predicted results.

Linear regression

Unary linear regression is the simplest and most common regression model, similar to the one-dimensional equation in junior mathematics, and its basic model is as follows:


Our common one-dimensional linear regression equation generally has no last item, and to be exact, we also ignore the last item in the actual application. The practical significance of the last UI is that it refers to all other factors that affect the dependent variable y except for the argument x, when applying regression prediction, we assume that the UI is a random variable with a mean value of zero, that the variance is constant, that the different UI is independent from each other and independent of the argument X.

Multivariate linear regression, which is similar to the multivariate one-time equation, refers to the existence of two or more than two independent variables, will have a linear effect on the dependent variable y, the linear effect of this statement does not know whether there is, meaning is a relationship. The multivariate linear regression model is as follows:


N is n arguments that have an effect on the dependent variable Y. Binary linear regression and ternary linear regression are more common in practical applications, and because of the complexity of the variables, the relationship is more complicated but it is simply defined as a linear relationship, so the error may be greater when used as a prediction.

In predicting the linear regression method, we need to get the parameters according to the observed data. The common methods for estimating B parameters are the least squares method and the maximum likelihood estimator.

In simple terms, the least squares method is that the estimated values can fit the observed values well, and the sum of the difference between the estimate and the observed value is minimized. The maximum likelihood estimation method is based on: the most probable event is the most likely occurrence. Taking the unary linear regression as an example, it shows how the two algorithms are estimated.

Least squares

Based on the concept of least squares, the sum of squares and values of the difference between the estimate and the observed value is minimized, even if the following formula can take the minimum value:

According to the principle of finding extreme value in calculus, we only need to find the bias guide separately, and make it equal to 0 to take the minimum. The values you can calculate are as follows:


Maximum likelihood estimation method

For a reference http://blog.csdn.net/ppn029012/article/details/8908104, simple linear regression is commonly used in the above least squares method. The process of maximum likelihood estimation is no longer described in detail here.

Polynomial regression

Polynomial regression, in simple terms, is the relationship between the polynomial of the self-variable x and the dependent variable y, whose model is as follows:

Nonlinear regression

In real life, many problems are not simple linear relations, in this case, to choose the right curve to describe the actual problem. The polynomial regression above is a kind of non-linear regression. This paper introduces several common nonlinear regression relationships, and the drawing software http://fooplot.com/:

1. Power function

B>0, the graph is as follows, the figure of the three lines are a=1,b=0.5;a=1,b=1;a=1,b=2 when the case.

B<0, the graph is as follows, the figure of the three lines are a=1,b=-0.5;a=1,b=-1;a=1,b=-2 when the case.

2. exponential function and logarithmic function

3. Parabolic function

This is a polynomial regression, two-term, is a very common in the reality of a method to describe the problem model. It feels like the middle school has a lot of time to deal with. The graph of this model is as follows:

When A=1,b=-2,c=1

When A=-1,b=2,c=-1

4. S-Shape function

is also called a logical function. This function graph is very characteristic, is suitable to describe the actual problem, has the interest to look at its explanation http://zh.wikipedia.org/wiki/%E9%82%8F%E8%BC%AF%E5%87%BD%E6%95%B8.

Validation methods

The following is a brief description of several methods for validating regression results.

1. Standard error

The standard error is the average squared error between the estimate and the observed value, which is calculated as:

2. Coefficient of decision

The value range of the 0~1 is 1 minus the ratio of the non-explanatory deviation to the actual variance, the closer the value of r^2 is to 1, the better the fitting degree of the regression line to the observed value; Conversely, the closer the value of r^2 is to 0, the worse the fitting degree of the regression line to the observed value. The formula is calculated as follows:

3. Correlation coefficient

The value range of the correlation coefficient is -1~1, in fact, it is the root value of the front-determined coefficient, which is different from the coefficient of the correlation coefficient can have positive negative. When the correlation coefficient is close to 1 or 1, the fitting degree is good, and the fitting degree is not close to 0. The formula is calculated as follows:


4. F-Test

In the above equation, the total dispersion can be decomposed into two parts: the regression deviation and the residual residual difference. The degree of freedom N-1 can also be decomposed into two parts: regression degrees of freedom 1 and residual degrees of freedom n-2. The test statistic F is compared with the regression deviation and residual residuals divided by their degrees of freedom respectively. The formula is calculated as follows:


Here F obeys the F (1,n-2) distribution, takes the significance level, if, then indicates that the regression model is significant, otherwise the regression model is not significant to predict.

A simple description of the degree of freedom represents how much a set of data can be freely formatted. N-1 is the usual method of calculation, and more accurately it should be the number of n-x,n that represents "processing", and X represents the number of parameters that actually need to be computed.

5. T test

The significance of regression coefficients is tested for t values, which are calculated as follows:

Among them, T obeys degrees of freedom n-2 t distribution, take a significant level, if, then the regression coefficient b is significant.


Reference: Statistical prediction and decision Xu Guoxiang (all formulas are derived from here)

Have any questions suggest welcome to point out, thank you!

A review of regression prediction and R language Realization Part1 regression basis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.