Regression analysis algorithm

Source: Internet
Author: User

Regression analysis

1. Fundamentals of regression analysis

The so-called regression analysis method, which is based on mastering a large amount of observational data, uses mathematical statistics to establish the regression relation function expression between the dependent variable and the independent variable (referred to as the regression equation). Regression analysis is a predictive modeling technique that studies the relationship between the dependent variable (target) and the independent variable (predictor), the relationship between the dependent variable and the uncertainty of the independent variable (correlation relationship). This technique is commonly used for predictive analysis, time-series models, and causal relationships between discovery variables.

2. Why Use regression analysis?

As mentioned above, regression analysis estimates the relationship between two or more variables. The benefits of regression analysis are many. Specific as follows:

1. It shows the significant relationship between the independent variable and the dependent variable;
2. It shows the intensity of the influence of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the interactions between variables that measure different scales, such as the link between price movements and the number of promotional activities.

3. How many kinds of regression techniques are there?

There are a variety of regression techniques used to predict. These techniques are mainly three measures (the number of arguments, the type of the variable, and the shape of the regression line). There are 7 types: linear regression, logistic regression, polynomial regression, stepwise regression stepwise regression, Ridge regression ridge regression, Lasso regression lasso regression, elasticnet regression

4. Linear Regression linear regression

It is one of the most well-known modeling techniques. Linear regression is often one of the first techniques that people learn when predicting models. In this technique, the dependent variable is continuous, the argument can be continuous or discrete, and the nature of the regression line is linear.

Linear regression establishes a relationship between the dependent variable (Y) and one or more independent variables (X) using the best fit line (that is, the regression). It is represented by an equation, that is, Y=a+b*x + E, where A is the Intercept, B is the slope of the line, and E is the error term. This equation can predict the value of a target variable based on a given predictor variable.

The difference between unary linear regression and multivariate linear regression is that multivariate linear regression has (>1) independent variables, whereas unary linear regression usually has only 1 independent variables. The question now is "How do we get a best fit floss?" ”。

1) Get the best fit line (values A and B)

This problem can be done easily using the least squares method. Least squares is also the most common method used to fit the regression line. For observational data, it calculates the best fit line by minimizing the sum of squares of vertical deviations for each data point to the line. Because the deviations are squared in addition, positive and negative values are not offset.

2) Principle of least squares

Here it is assumed that there is a linear correlation between the variable y and x. With n pairs of observations, the construction of linear function y=ax+b, according to the above elaboration, the use of the least squares method to solve the regression function parameters is to find the appropriate parameters (a, B) to minimize the value, and then in the formula, respectively, the s pair (A, b) of the partial derivative, and make it equal to zero, can be obtained parameter

Points:

There must be a linear relationship between the independent variable and the dependent variable

Multivariate regression has multiple collinearity, self-correlation and variance.

Linear regression is very sensitive to outlier values. It can seriously affect the regression line and ultimately affect the predicted value.

Multiple collinearity increases the variance of the coefficient estimates, making the estimation very sensitive under slight variations in the model. The result is an unstable coefficient estimate.

In the case of multiple arguments, we can use forward selection, backward culling, and stepwise filtering to select the most important arguments

3) significance test of linear regression equation

After the establishment of linear regression equation, it is generally necessary to carry out a significant test, commonly used methods include: variance decomposition, correlation analysis, F-Test, T-Test and D-W test. The following mainly introduces F-Test and T-Test.

F-Test method

In a unary linear regression model, if b=0, the change in x does not cause a change in Y, that is, Y does not have a linear correlation with X. Therefore, the significance test of the linear regression equation can be accomplished by the F-test of the regression equation.

Proposed: b=0,: B0, is established, that is, there is no linear correlation between Y and X, the statistic


Obey the F-distribution of degrees of freedom (1,n-2). After given the significance level of the test, it can be obtained by the F distribution table when the critical value is established, if for a set of samples calculated statistics F value is greater than, then negative, that is, B0, indicating that there is a linear correlation between x and Y. Therefore, the correlation test of regression equation can be done according to the following steps F test.

Step 1: Make assumptions: b=0,: B0,

Step 2: At the time of establishment, the statistic F (1,n-2) for a given significance level, check F distribution table to get the critical value of the test.

Step 3: Calculate the SSR and SSE for a set of samples and get the F value from this.

Step 4: Compare F with the value, if f>, reject the 0 hypothesis. We think that there is a linear correlation between x and Y, otherwise it is accepted that there is no linear correlation between x and Y.

T test

Although the correlation coefficient r is a measure of how closely the linear relationship between the variable y and x is, the correlation coefficient r is calculated from the sample data and thus has a certain randomness, the smaller the sample capacity, the greater the randomness. Therefore, we also need to judge the correlation coefficients by the correlation coefficient r of the sample. Because of the complexity of the distribution density function of the correlation coefficient R, R is needed to be transformed in practical application. Make


Then the statistic T obeys the T (n-2) distribution. So the question of whether the overall linear correlation becomes the hypothesis test of the overall correlation coefficient =0, that is, the T-test of the statistic T is OK.

According to a set of samples to calculate the T value, and then according to the significance of the level and degree of freedom given by the problem n-2, check the T distribution table, find the corresponding critical value/2. If, it indicates that T is statistically significant, that is, there is a linear relationship between the two variables in the population. Otherwise, it is assumed that there is no linear relationship between the two variables.

2.Logistic Regression Logistic regression

Logistic regression is used to calculate the probability of "event =success" and "Event =failure". We should use logistic regression when the type of the dependent variable belongs to a variable of two yuan (1/0, True/false, yes/no).

The logit function selects parameters by observing the maximum likelihood estimates of the sample, rather than minimizing the squared and error (as used in normal regression).

2.1 Maximum likelihood estimation parametric solution steps:

(1) Write out the likelihood function:


Here, N is the number of samples, and the likelihood function represents the probability that n samples (events) occur simultaneously.

(2) The logarithm of the likelihood function:


(3) The logarithmic likelihood function is given partial derivative of each parameter and it is 0, and the logarithmic likelihood equation group is obtained.

(4) The various parameters are solved from the equation group.

Points:

It is widely used for classification problems.

Logistic regression does not require that the independent variable and the dependent variable are linear relationships. It can handle various types of relationships because it uses a non-linear log conversion to predict the relative risk index or.

To avoid overfitting and less-fitting, we should include all important variables. There is a good way to ensure that this is the case by using a stepwise filtering method to estimate the logistic regression.

It requires a large sample size, because in the case of a small number of samples, the effect of maximum likelihood estimation is worse than the ordinary least squares.

Independent variables should not be interrelated, that is, there is no multiple collinearity. However, in analysis and modeling, we can choose to include the effect of the interaction of categorical variables.

If the value of the dependent variable is a fixed-order variable, it is called a sequential logistic regression.

If the dependent variable is multi-class, it is called multivariate logistic regression.

2.2 Logistic Regression modeling steps

1) Set indicator variables (dependent variables and arguments) for analysis purposes, then collect data

2) The probability of Y taking 1 is p=p (y=1| X), the probability of taking 0 is 1-p, the linear regression equation is listed with the argument and the regression coefficient in the model is estimated

3) The model test: According to the output of the variance Analysis table F and P values to test whether the regression equation is significant, if the P-value is less than the significance of the model through the test, the next regression coefficient can be tested; otherwise, to re-select the indicator variable, re-establish the regression equation

4) The significance of regression coefficient test: In multivariate linear regression, the regression equation significantly does not mean that each independent variable has a significant impact on Y, in order to remove those secondary, dispensable variables from the regression equation, in order to remove those secondary, dispensable variables from the regression equation, re-establish a simpler regression equation , it is necessary to make a significant test of each independent variable, and the result of the test is obtained by the parameter estimate table. By using stepwise regression method, the least significant dependent variable is removed, the regression equation is reconstructed, and the regression coefficients of the model and the participation are tested.

5) Model application: Enter the value of the argument, you can get the value of the predicted variable, or according to the value of the Predictor variable to control the value of the argument.

The modeling steps for the logistic regression model are as follows:


3. Polynomial-Regression polynomial regression

For a regression equation, if the exponent of the argument is greater than 1, then it is the polynomial regression equation.

In this regression technique, the best fit line is not a straight line. It is a curve used to composite a stronghold.

4. stepwise regression stepwise regression

We can use this form of regression when dealing with multiple arguments. In this technique, the choice of the independent variable is done in an automated process, including non-human manipulation. This feat is to identify important variables by observing statistical values, such as r-square,t-stats and AIC indicators. Stepwise regression fits the model by adding/removing the covariance based on the specified criteria at the same time. Some of the most commonly used stepwise regression methods are listed below:

The standard stepwise regression method does two things. That is, increase and remove the predictions required for each step.

The forward selection starts with the most significant predictions in the model, and then adds variables for each step.

The backward culling method starts at the same time as all predictions for the model and then removes the smallest significant variable at each step.

The purpose of this modeling technique is to maximize predictive power using the fewest number of predictor variables. This is also one of the ways to work with high-dimensional datasets.

5. Ridge Regression Ridge return

Ridge regression analysis is a technique for the existence of multiple collinearity (highly correlated independent variables) data. In the case of multiple collinearity, although the least squares (OLS) is fair for each variable, they vary greatly, leaving the observed values offset and away from the real value. Ridge regression reduces the standard error by adding a deviation to the regression estimate.

6. Lasso Regression Lasso Regression

Similar to ridge regression, Lasso (Leastabsolute Shrinkage and Selection Operator) also punishes the absolute size of the regression coefficients. In addition, it can reduce the degree of variation and improve the accuracy of the linear regression model.

7.ElasticNet regression

Elasticnet is a hybrid of lasso and ridge regression techniques. It uses L1 to train and L2 precedence as a regularization matrix. Elasticnet is useful when there are multiple related features. Lasso will randomly pick one of them, while Elasticnet will choose two.

How to choose the regression model correctly?

In a multi-class regression model, it is important to select the most appropriate technique based on the type of the argument and the dependent variable, the dimension of the data, and other basic characteristics of the data. Here are the key factors you need to choose the right regression model:

1) Data exploration is an essential part of building a predictive model. It should be a preferred step when choosing the right model, such as identifying the relationship and impact of variables. Compared with the advantages of different models, we can analyze different parameter parameters, such as statistical significance parameters, r-square,adjusted r-square,aic,bic and error items, and the other is the Mallows ' CP criterion. This is primarily done by comparing the model to all possible sub-models (or choosing them carefully) to check for deviations that may occur in your model.

2) Cross-validation is the best way to evaluate the predictive model. Here, divide your dataset into two copies (one for training and one for validation). Use a simple mean variance between the observed and predicted values to measure your predictive accuracy. If your dataset is multiple mixed variables, then you should not choose the Automatic model selection method, because you should not want to put all the variables in the same model at the same time.

3) It will also depend on your purpose. This may be the case where a less powerful model is easier to implement than a highly statistically significant model.

4) The regression regularization method (Lasso,ridge and Elasticnet) works well in the case of multiple collinearity between high-and data-set variables.

Regression analysis algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.