Statistical learning Method (vi)--Logistic regression and maximum entropy model

Last Update:2015-09-11 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

/* First write the title, so you can often remind yourself * *

From elsewhere there are many articles similar to this and do not know who is original because of the original text by less than the error, so the following changes to this and made the appropriate emphasis mark (the line see the content is not large clear and somewhat complex, the following operating flow according to the preceding operator to classify)

Preliminary contact

Called the LR classifier (Logistic Regression Classifier), and there is nothing mysterious about it. In the case of classification, the LR classifier after learning is actually a set of weights w0,w1,..., Wm.
When the test data in the test sample set comes, this set of weights is calculated as a Z-value according to the linear sums of the test data:

z = w0+w1*x1+w2*x2+...+wm*xm. ① (where x1,x2,..., XM is the feature of a sample data, the dimension is M)

Then, according to the form of the sigmoid function, find out:

Σ (z) = 1/(1+exp (z)). Ii

Since the definition field of the sigmoid function is (-inf, +inf), the range is (0, 1). Therefore, the most basic LR classifier is suitable for classifying two kinds of targets.

So how does this set of weights for the LR classifier w0,w1,..., wm? This requires the concept of a maximum likelihood estimation of MLE and an optimization algorithm.

We consider the sigmoid function as the probability density function of the sample data, and each sample point can calculate its probability density by the above formula ① and ②.

Detailed description

1. Logic back return Model

1.1 Logistic regression model

A vector with p independent variables is considered, and the conditional probability is the probability of the occurrence of an event based on the observed amount. A logistic regression model can be expressed as

(1.1)

Functions on the right-hand side of the form are called logical functions. The function image form is given.

which If it contains a nominal variable, it becomes the dummy variable. A nominal variable with k values, which becomes a k-1 dummy variable. This way, there are

(1.2)

Defines the condition probabilities for events that do not occur

(1.3)

Then, the ratio of the event to the probability that the event does not occur is

(1.4)

This ratio is known as the event's occurrence ratio (the odds of experiencing an event), referred to as odds. Because 0<p<1, so odds>0. The logarithm of the odds is obtained by the linear function,

(1.5),

1.2 Maximum likelihood function

Suppose there are N observation samples, the observed values are set to the probability of getting Yi=1 (original) under given conditions. The conditional probability of getting yi=0 () under the same conditions is. Thus, the probability of getting an observed value is

(1.6)-----This formula is actually a synthesis of the first two equations, and nothing special.

Because the observations are independent, their joint distribution can be expressed as the product of each marginal distribution.

The above-described likelihood function is called n observation. Our goal is to be able to estimate the maximum number of parameters for this likelihood function. Therefore, the key of maximum likelihood estimation is to find out the parameters, so that the maximum value is obtained.

To calculate the logarithm of the above function

(1.8)

The above is called a log-likelihood function. In order to estimate the value that can make the maximum parameter possible.

The derivative of this function obtains the p+1 equation.

(1.9)

, j=1,2,.., p.-----p as the number of independent vectors

The upper type is called the likelihood equation. In order to understand the nonlinear equation, the Newton-Raphaeson (Newton-raphson) method is used to solve the iterative problem.

1.3 Newton-Raphaeson iterative method

Second-order partial derivative, that is, the Hessian matrix is

(1.10)

If it is written as a matrix, the H represents the Hessian matrix, and X indicates

(1.11)

Make

(1.12)

The Re-order (note: The previous matrix needs to be transpose), that is, the matrix form of the likelihood equation.

In the form of Newton's iterative method

(1.13)

It is noted that the matrix h in the upper equation is symmetric positive definite, and the solution is the matrix X in solving the linear equation hx=u. Cholesky decomposition of H.

The asymptotic variance (asymptotic variance) and covariance (covariance) of the maximum likelihood estimator can be estimated by the inverse matrix of the information matrix (information matrix). And the information matrix is actually a negative value of the second derivative, expressed as. The variance and covariance of an estimate are expressed as, that is, the variance of the estimate is the value on the diagonal of the inverse matrix of the matrix I, and the covariance of the estimates and the covariance equals? Puzzled... ) is a value other than the diagonal. In most cases, however, we will use the standard variance of the estimate, expressed as

, for j=0,1,2,..., P (1.14)

--------------------------------------------------------------------------------------------------------------- --------------------------------

2 . The significance of the test

The following discussion discusses whether the independent variables are significantly correlated with the reaction variables in the logistic regression model. 0 hypothesis: =0 (indicates that an independent variable has no effect on the likelihood of an event occurring). If the 0 hypothesis is rejected, it indicates that the event occurrence possibility depends on the change.

2.1 Wald Test

When the regression coefficients are significantly tested, the Wald test is usually used and the formula is

(2.1)

Which, for the standard error. This single-variable wald statistic obeys a distribution of degrees of freedom equal to 1.

If you need to test the hypothesis: = 0, calculate the statistic

(2.2)

In this way, to remove the estimated values of rows and columns, the standard error of the row and column is removed accordingly. Here, Wald statistics obey the distribution of degrees of freedom equal to P. If the above is written in matrix form, there is

(2.3)

The matrix q is a constant matrix of zero for the first column. For example, if the test is.

However, when the absolute value of the regression coefficient is large, the estimated standard of this coefficient expands, thus causing the Wald statistic to become so small that the probability of the second type of error increases. In other words, it is not possible to reject the 0 hypothesis when it actually causes it to be rejected. So when the absolute value of the regression coefficients is found to be large, the Wald statistic is no longer used to test the 0 hypothesis, and the likelihood ratio test should be used instead.

2.2 Likelihood ratio (likelihood ratio test) test

In a model, the difference between the logarithm likelihood value of the containing variable and the non-variable is multiplied by the result of 2, and the distribution is obeyed. This test statistic is called the likelihood ratio (likelihood ratio), expressed as

(2.4)

The calculated likelihood value adopts the formula (1.8).

If you need to test the hypothesis: = 0, calculate the statistic

(2.5)

, which represents the number of observations of the =0, and the number of observations of the =1, then n represents the number of all observations. In fact, the right half of the upper-right side represents only the likelihood value. The distribution of the statistic G obeys the degrees of freedom to P

2.3 Score Test

In the 0 hypothesis: =0, the estimated value of the set parameter is, that is, the corresponding = 0. The formula for calculating score statistics is

(2.6)

In the above formula, the value of the partial derivative of the logarithmic likelihood function (1.9) at = 0, and the two value of the logarithmic likelihood function (1.9) at = 0. Score statistics obey the distribution of degrees of freedom equal to 1.

2.4 Model fitting Information

After the model is established, the fitting degree of the model is considered and compared. There are three measures that can be used as a basis for fitting judgments.

(1) -2loglikelihood

(2.7)

(2) Akaike Information Guidelines (Akaike Information Criterion, abbreviated as AIC)

(2.8)

where k is the number of independent variables in the model, S is the total number of reaction variable categories minus 1, s=2-1=1 for logistic regression. The -2LOGL range is 0 to, and the smaller the value, the better the fit. When the number of parameters in the model is larger, the likelihood value becomes larger, and the -2LOGL becomes smaller. Therefore, 2 (k+s) is added to the AIC formula to offset the effect of the number of parameters. In the case of other conditions, the smaller AIC value indicates that the fitting model is better.

(3) Schwarz guidelines

This indicator makes an additional adjustment to the -2LOGL value based on the number of independent variables and the number of observations. The SC indicator is defined as

(2.9)

where ln (n) is the natural logarithm of the observed quantity. This indicator can only be used to compare different models that are set for the same data. When other conditions are the same, the smaller the AIC or SC value of a model indicates that the model fits better.

3. regression coefficient interpretation

3.1 Occurrence ratio

odds=[p/(1-P)], which is the ratio of the probability of the occurrence of an event to the probability of not occurring. And the rate of occurrence (odds ration), i.e.

(1) Continuous independent variable. For an argument, odds ration for each additional unit

(3.1)

(2) The rate of occurrence of two categorical independent variables. The value of a variable can only be 0 or 1, called dummy variable. When the value is 1, for a value of 0, the rate of occurrence is

(3.2)

i.e. the power of the corresponding coefficient.

(3) The rate of occurrence of categorical independent variables.

If a categorical variable includes m categories, the number of dummy variable to be established is m-1, and the omitted category is referred to as the Reference Class (reference category). Set dummy variable as its coefficient, for reference class, its occurrence ratio is.

3.2 Confidence intervals of the logistic regression coefficients

For confidence Level 1-the Confidence interval of 100% (n) of the parameter is

(3.3)

In the above formula, as the critical z-value under the normal curve (critical value), the standard error of the coefficient estimation, and the two values are the lower and upper bounds of the confidence interval. When the sample is large, the 95% confidence interval of the coefficient of the =0.05 level is

(3.4)

--------------------------------------------------------------------------------------------------------------- --------------------------------

4. Variable Selection

4.1 Forward selection (Forward selection): On the basis of the intercept model, the independent variables conforming to the significant level are added to the model one at a time.

The specific selection procedure is as follows

(1) constant (i.e. intercept) enters the model.

(2) The score test value of the model variable to be entered is calculated according to the formula (2.6) and the corresponding P value is obtained.

(3) Find the minimum P-value, if this P-value is less than the significance level, then this variable enters the model. If this variable is a single-sided (dummy) variable of a nominal variable, then the other single-sided arguments of this nominal variable also enter the model. Otherwise, it indicates that no variable can be selected into the model. The selection process terminates.

(4) Go back (2) to the next option.

4.2 Back selection (backward selection): When the model includes all candidate variables, the independent variables that do not conform to the retention requirements are removed one at a time.

The specific selection procedure is as follows

(1) All variables enter the model.

(2) calculates the Wald test value of all variables according to the formula (2.1) and obtains the corresponding P-value.

(3) Find out the maximum p value, if this p value is greater than the significance level, then this variable is rejected. For a single-sided variable of a nominal variable whose minimum p-value is greater than the significant level, the other single-sided variables of this nominal variable are also deleted. Otherwise, it indicates that no variable can be removed and the selection process terminates.

(4) return (2) for the next round of culling.

4.3 stepwise regression (stepwise selection)

(1) Basic idea: Introduce the independent variable one by one. Each time the introduction of the most significant y effect of the independent variables, and the old variables in the equation to test, one by one, the non-significant variables are removed from the equation, the resulting equation does not miss the significant y effect of the variable, and does not include a non-significant y effect of the variable.

(2) Screening steps: First, the significance level of the introduced variable and the significance level of the culling variable are given, and then the filter variables are selected.

(3) The basic steps of the Stepwise screening method

The process of sifting through variables consists of two basic steps: one is to consider the steps of introducing new variables from variables that are not in the equation, and the other is to consider removing the non-significant variables from the regression equation.

Suppose there are p-variables that need to be considered for introducing a regression equation.

The maximum likelihood estimate for ① is set to only the intercept term. Calculates the score test value for each P-independent variable separately,

Variables with a minimum P-value are, and are, for single-sided (dummy) variables. If so, the variable enters the model, otherwise it stops. If this variable is a variable of the nominal variable simplex (dummy), then other single-sided variables of this nominal variable are also entered into the model. The significance level of the variable is introduced.

② to determine if other p-1 variables are important when the variables are in the model, they are fitted separately. The score test value is computed for p-1 variables, and its p value is set to. The variable with the minimum P-value is, and has. If, go to the next step, or stop. For a single-sided variable, it behaves like a step up.

③ This step starts with a variable in the model that already contains the. Note that it is possible that variables are not important after they are introduced. This step includes backward removal. According to (2.1) Calculate the variable with the Wald test value, and the corresponding P-value. Set to a variable with a maximum P-value, that is, =max (). If this P-value is greater than, this variable is removed from the model or stopped. For nominal variables, this nominal variable is removed from the model if the minimum P-value of a single-sided variable is greater than.

④ so, whenever a variable is moved forward, it is checked for backward removal. The condition of the loop termination is that all P variables enter the model or the P-value of a variable in the model is less than the P-value of a variable not contained in the model. Or a variable enters the model and is deleted at the next step to form a loop.

Statistical learning Method (vi)--Logistic regression and maximum entropy model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More