Logistic Regression-Logistic Regression algorithm summary **

Source: Internet
Author: User
Tags natural logarithm

There are a lot of similar articles from other places, and I don't know who is the original one. Because there are fewer original articles and fewer errors, I have modified this article and made a proper key mark (the content shown on the horizontal line is not big white and complicated, the subsequent processes are classified based on the operators obtained above)

 

Initial contact

Logistic Regression Classifier is no secret. In classification, the learned LR classifier is actually a set of weights w0, w1,..., wm.
When the test data in the test sample set comes, this set of weights is obtained in addition to the test data line:

Z = w0 + w1 * x1 + w2 * x2 +... + wm * xm. ① (X1, x2,..., xm is a feature of a sample data with the dimension of m)

Then we can find it in the form of the sigmoid function:

σ (z) = 1/(1 + exp (z )). ②

The value range of the sigmoid function is (-INF, + INF), and the value range is (0, 1 ). Therefore, the most basic LR classifier is suitable for classifying two types of objects.

So how can we obtain the weights of the LR classifier w0, w1,... and wm? This requires the concept of MLE and optimization algorithm for maximum likelihood estimation.

We regard the sigmoid function as the probability density function of the sample data. each sample point can be calculated using the formula ① and ② above.

 

Detailed description

1. Logic Back Gui Model

 

1.1 Logistic Regression Model

Consider vectors with p independent variables, and set the conditional probability to the probability of occurrence of an event based on the observed amount. The logistic regression model can be expressed

(1.1)

The function in the form on the right is called a logical function. The watermark image format is provided.

 

. If it contains a nominal variable, it is changed to a dummy variable. A nominal variable with k values will become a dummy variable for the K-1. In this way

(1.2)

The conditional probability for defining no event is

(1.3)

The ratio of event occurrence probability to event occurrence probability is

(1.4)

This ratio is called the odds of experiencing an event. Because 0 <p <1, odds> 0. Take the logarithm of odds to obtain a linear function,

(1.5 ),

 

1.2 Maximum Likelihood Function

Assume that there are n observation samples, and the observed values are the probability of obtaining yi = 1 (original) under the given conditions. The conditional probability of obtaining yi = 0 () under the same conditions is. Therefore, the probability of a observed value is

(1.6) ----- This formula is actually obtained by combining the first two equations, and there is no special difference

Because observations are independent, their joint distribution can be expressed as the product of each marginal distribution.

The above formula is called the likelihood functions of n observations. Our goal is to find the parameter estimation that maximizes the value of this likelihood function. Therefore, the key to the maximum likelihood estimation is to obtain the parameter so that the upper formula gets the maximum value.

Evaluate the logarithm of the above functions

(1.8)

The above formula is called the logarithm likelihood function. To estimate the maximum value of a parameter.

Evaluate the function and obtain the likelihood equation p + 1.

(1.9)

, J = 1, 2,..., p. ----- p indicates the number of independent vectors.

The above formula is called the likelihood equation. To understand the above nonlinear equations, Newton-lafson (Newton-Raphson) method is used for iterative solution.

1.3 Newton-laferson Iteration Method

Returns the second-order partial derivative, that is, the Hessian matrix is

(1.10)

If it is written as a matrix, H Represents the Hessian matrix, and X represents

(1.11)

Ling

(1.12)

. (Note: The first matrix needs to be transposed), that is, the matrix form of the likelihood equation.

The form of Newton Iteration Method is

(1.13)

Note that the matrix H in the above formula is symmetric positive definite, and the solution is to solve the matrix X in the linear equation HX = U. Perform cholesky decomposition on H.

The asymptotic variance and covariance of the maximum likelihood estimation can be estimated by the inverse matrix of the information matrix. The information matrix is actually a negative value of the second derivative, expressed. The variance and covariance of the estimated value are represented as, that is, the variance of the estimated value is the diagonal value of the inverse matrix of Matrix I, and the covariance of the estimated value and (and the covariance is equal? Puzzled ...) Is a value except the diagonal line. However, in most cases, we will use the standard variance of the estimated value to represent

, For j = 0, 1, 2 ,..., P (1.14)

Certificate -----------------------------------------------------------------------------------------------------------------------------------------------

2.Significance Test

The following describes whether the variables in the logistic regression model are significantly correlated with the response variables. Zero hypothesis: = 0 (it indicates that the independent variable has no effect on the possibility of event occurrence ). If the zero hypothesis is rejected, it indicates that the event occurrence possibility depends on the change.

2.1 Wald test

Wald is usually used for the significance test of the regression coefficient. The formula is

(2.1)

The standard error. The Wald statistic of this single variable follows the distribution where the degree of freedom is equal to 1.

To test the hypothesis: = 0, calculate the statistic

(2.2)

To remove the estimated values of the row and column, remove the standard errors of the row and column. Here, the Wald statistic is subject to the distribution of degrees of freedom equal to p. If you write the preceding formula as a matrix

(2.3)

Matrix Q is a constant matrix of zero in the first column. For example, if it is verified, then.

However, when the absolute value of the regression coefficient is very large, the Estimation Standard deviation of this coefficient will expand, resulting in the small statistical value of Wald, resulting in an increase in the probability of second errors. That is to say, it fails to reject the null hypothesis. Therefore, when we find that the absolute value of the regression coefficient is very large, we will not use the Wald statistical value to test the null hypothesis, but should use the likelihood ratio test instead.

2.2 Likelihood ratio (Likelihood ratio test) test

In a model, the result of multiplying the logarithm likelihood values of variables and non-variables by-2 is distributed. This test statistic is called likelihood ratio and is represented

(2.4)

The likelihood value is calculated using the formula (1.8 ).

If we need to test the hypothesis: = 0, calculate the statistic

(2.5)

Expression, indicating the number of observed values = 0, and the number of observed values = 1, then n indicates the number of all observed values. In fact, the right half of the upper right side indicates only the likelihood value. The statistic G follows the distribution where the degree of freedom is p.

2.3 Score Test

In the zero hypothesis: = 0, set the parameter's estimated value to, that is, the corresponding value is = 0. The formula for calculating Score statistics is

(2.6)

In the above formula, it indicates the first-price partial derivative value of the logarithm likelihood function (1.9) under = 0, and the logarithm likelihood function (1.9) under = 0) the second-price partial derivative value. The Score statistic follows the distribution where the degree of freedom is equal to 1.

2.4 model fitting Information

After the model is created, consider and compare the fit degree of the model. Three metric values can be used as the basis for fitting.

(1)-2 LogLikelihood

(2.7)

(2) Akaike Information Criterion (Akaike Information Criterion, abbreviated as AIC)

(2.8)

K represents the number of self-variables in the model, S represents the total number of response variable classes minus 1, and S = 2-1 = 1 for logistic regression. -The value range of 2LogL is 0 to. The smaller the value, the better the fitting. When the number of parameters in the model increases, the likelihood value increases and-2 logl decreases. Therefore, 2 (K + S) is added to the AIC formula to offset the influence of the number of parameters. When other conditions remain unchanged, a smaller AIC value indicates better fitting model.

(3) Schwarz criteria

This indicator adjusts the-2LogL value based on the number of independent variables and the number of observations. SC indicators are defined

(2.9)

Ln (n) is the natural logarithm of the observed number. This indicator can only be used to compare different models of the same data. When other conditions are the same, the smaller the AIC or SC value of a model, the better the model fitting.

 

3.Regression Coefficient explanation

3.1 occurrence ratio

Odds = [p/(1-p)], that is, the ratio of the probability of event occurrence to the probability of no occurrence. The occurrence rate (odds ration) is

(1) continuous independent variables. For each additional unit of the independent variable, odds ration is

(3.1)

(2) occurrence rate of binary classification independent variables. The value of a variable can only be 0 or 1, called dummy variable. When the value is 1, the occurrence rate of the value is

(3.2)

 

That is, the power of the corresponding coefficient.

(3) occurrence rate of classification independent variables.

If a classification variable contains m categories, the number of dummy variable needs to be set to s-1. The class omitted is referred to as reference category ). Set dummy variable as its coefficient. For reference classes, its occurrence rate is.

3.2 confidence interval of Logistic Regression Coefficient

For confidence level 1-, the 100% (1-) confidence interval of the parameter is

(3.3)

In the above formula, the critical Z value (critical value) under the normal curve is the standard error of coefficient estimation, and the two values are the lower limit and upper limit of the confidence interval respectively. When the sample size is large, the 0.05 confidence interval of the coefficient = 95% horizontal is

(3.4)

Certificate -----------------------------------------------------------------------------------------------------------------------------------------------

4.Variable Selection

4.1 forward selection: Based on the intercept model, add the independent variables meeting the specified significant level to the model one by one.

The specific selection procedure is as follows:

(1) constants (intercept) enter the model.

(2) Calculate the Score Test Value of the variable to enter the Model Based on formula (2.6) and obtain the corresponding P value.

(3) Find the smallest P value. If the P value is smaller than the significance level, the variable enters the model. If this variable is a single-sided dummy variable of a nominal variable, other single-sided variations of this nominal variable also enter the model. Otherwise, no variables can be selected into the model. Select Process Termination.

(4) return to (2) Continue the next selection.

4.2 backward selection: after the model includes all candidate variables, the independent variables that do not meet the retention requirements are deleted one by one.

The specific selection procedure is as follows:

(1) All variables enter the model.

(2) Calculate the Wald test value of all variables based on formula (2.1) and obtain the corresponding P value.

(3) Find the maximum P value. If the P value is greater than the significance level, the variable is excluded. For a single-sided variable whose minimum P value is greater than the significance level, other single-sided variables of the nominal variable are also deleted. Otherwise, it indicates that no variable can be removed and the selection process ends.

(4) return to (2) for the next elimination.

4.3 stepwise selection)

(1) Basic Idea: introduce independent variables one by one. Each time we introduce the independent variables with the most significant influence on Y, we test the old variables in the equation one by one to remove the variables with the least significant effects from the equation one by one, in the final equation, no variable with significant influence on Y is missed, and no variable with no significant influence on Y is included.

(2) filtering steps: firstly, the significance level of the introduced variable and the significance level of the excluded variable are given, and then the variables are filtered.

(3) basic steps

The process of filtering variables involves two basic steps: first, the process of introducing new variables from the variables in the equation; and second, the process of removing non-significant variables from the regression equation.

Suppose there are p independent variables that need to be considered to introduce regression equations.

① Set the maximum likelihood of only intercept items. Calculate the Score test value for each of the p independent variables,

The variable with the minimum P value is and has the same value as the dummy variable. If so, the variable enters the model. Otherwise, the variable is stopped. If this variable is a nominal dummy variable, other single-sided variables of this nominal variable also enter the model. The significance level of the introduced variable.

② In order to determine whether P-1 variables are important when the variables are in the model, they are fitted separately. The Score test value is calculated for the first 1 variables. The P value is set. The variable with the minimum P value is, and there is. If so, go to the next step, or stop. For single-sided variables, the method is like the previous step.

③ This step starts when the model contains variables and. Note that it is possible that the variable is no longer important after it is introduced. This step includes deleting the image. Calculate the Wald test value of the variable and the corresponding P value based on (2.1. Set to a variable with the maximum P value, that is, = max (). If this p value is greater than, the variable is deleted from the Model, otherwise it will stop. For a nominal variable, if the minimum P value of a single-sided variable is greater than, the nominal variable is deleted from the model.

④ In this case, each time you select a variable to enter, the system will perform the backward deletion check. The condition for cyclic termination is that all p variables enter the model or the p value of the variable in the model is less than, and the p value of the variable not included in the model is greater. Or after a variable enters the model, it is deleted in the next step to form a loop.

 

 

1.1 Logistic Regression Model

Consider vectors with p independent variables, and set the conditional probability to the probability of occurrence of an event based on the observed amount. The logistic regression model can be expressed

(1.1)

The function in the form on the right is called a logical function. The watermark image format is provided.

 

. If it contains a nominal variable, it is changed to a dummy variable. A nominal variable with k values will become a dummy variable for the K-1. In this way

(1.2)

The conditional probability for defining no event is

(1.3)

The ratio of event occurrence probability to event occurrence probability is

(1.4)

This ratio is called the odds of experiencing an event. Because 0 <p <1, odds> 0. Take the logarithm of odds to obtain a linear function,

(1.5 ),

 

1.2 Maximum Likelihood Function

Assume that there are n observation samples, and the observed values are the probability of obtaining yi = 1 (original) under the given conditions. The conditional probability of obtaining yi = 0 () under the same conditions is. Therefore, the probability of a observed value is

(1.6) ----- This formula is actually obtained by combining the first two equations, and there is no special difference

Because observations are independent, their joint distribution can be expressed as the product of each marginal distribution.

The above formula is called the likelihood functions of n observations. Our goal is to find the parameter estimation that maximizes the value of this likelihood function. Therefore, the key to the maximum likelihood estimation is to obtain the parameter so that the upper formula gets the maximum value.

Evaluate the logarithm of the above functions

(1.8)

The above formula is called the logarithm likelihood function. To estimate the maximum value of a parameter.

Evaluate the function and obtain the likelihood equation p + 1.

(1.9)

, J = 1, 2,..., p. ----- p indicates the number of independent vectors.

The above formula is called the likelihood equation. To understand the above nonlinear equations, Newton-lafson (Newton-Raphson) method is used for iterative solution.

1.3 Newton-laferson Iteration Method

Returns the second-order partial derivative, that is, the Hessian matrix is

(1.10)

If it is written as a matrix, H Represents the Hessian matrix, and X represents

(1.11)

Ling

(1.12)

. (Note: The first matrix needs to be transposed), that is, the matrix form of the likelihood equation.

The form of Newton Iteration Method is

(1.13)

Note that the matrix H in the above formula is symmetric positive definite, and the solution is to solve the matrix X in the linear equation HX = U. Perform cholesky decomposition on H.

The asymptotic variance and covariance of the maximum likelihood estimation can be estimated by the inverse matrix of the information matrix. The information matrix is actually a negative value of the second derivative, expressed. The variance and covariance of the estimated value are represented as, that is, the variance of the estimated value is the diagonal value of the inverse matrix of Matrix I, and the covariance of the estimated value and (and the covariance is equal? Puzzled ...) Is a value except the diagonal line. However, in most cases, we will use the standard variance of the estimated value to represent

, For j = 0, 1, 2 ,..., P (1.14)

Certificate -----------------------------------------------------------------------------------------------------------------------------------------------

2.Significance Test

The following describes whether the variables in the logistic regression model are significantly correlated with the response variables. Zero hypothesis: = 0 (it indicates that the independent variable has no effect on the possibility of event occurrence ). If the zero hypothesis is rejected, it indicates that the event occurrence possibility depends on the change.

2.1 Wald test

Wald is usually used for the significance test of the regression coefficient. The formula is

(2.1)

The standard error. The Wald statistic of this single variable follows the distribution where the degree of freedom is equal to 1.

To test the hypothesis: = 0, calculate the statistic

(2.2)

To remove the estimated values of the row and column, remove the standard errors of the row and column. Here, the Wald statistic is subject to the distribution of degrees of freedom equal to p. If you write the preceding formula as a matrix

(2.3)

Matrix Q is a constant matrix of zero in the first column. For example, if it is verified, then.

However, when the absolute value of the regression coefficient is very large, the Estimation Standard deviation of this coefficient will expand, resulting in the small statistical value of Wald, resulting in an increase in the probability of second errors. That is to say, it fails to reject the null hypothesis. Therefore, when we find that the absolute value of the regression coefficient is very large, we will not use the Wald statistical value to test the null hypothesis, but should use the likelihood ratio test instead.

2.2 Likelihood ratio (Likelihood ratio test) test

In a model, the result of multiplying the logarithm likelihood values of variables and non-variables by-2 is distributed. This test statistic is called likelihood ratio and is represented

(2.4)

The likelihood value is calculated using the formula (1.8 ).

If we need to test the hypothesis: = 0, calculate the statistic

(2.5)

Expression, indicating the number of observed values = 0, and the number of observed values = 1, then n indicates the number of all observed values. In fact, the right half of the upper right side indicates only the likelihood value. The statistic G follows the distribution where the degree of freedom is p.

2.3 Score Test

Hypothesis at zero

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.