Logistic regression is a classification algorithm which can deal with two-tuple classification and multivariate classification. Although its name contains "regression" two words, but not a regression algorithm. So why is there a misleading word for "return"? Personally, although the logistic regression is a classification model, but its principle remains the shadow of the regression model, this paper makes a summary of the principle of logistic regression.
1. From linear regression to logistic regression
We know that the model of linear regression is to find the linear relationship coefficients between the output eigenvector Y and the input sample matrix X (\theta\), satisfying \ (\mathbf{y = x\theta}\). At this point our Y is continuous, so it is the regression model. What if we want Y to be discrete? One conceivable approach is that we do a function conversion for this Y again, which changes to \ (g (Y) \). If we make the value of \ (g (Y) \) Be Class A at the time of a real interval, and then a Class B in another real range, then a classification model is obtained. If there are only two categories of results, then there is a two-dollar classification model. The starting point for logistic regression is from here. Let's start by introducing the two-dollar logistic regression.
2. Two-dollar logistic regression model
In the previous section we mentioned that the result of a linear regression is a transformation on function g, which can be changed into a logistic regression. This function g in logistic regression we generally take as sigmoid function, the form is as follows:
\ (g (z) = \frac{1}{1+e^{-z}}\)
It has a very good property, that is, when Z tends to infinity, \ (g (z) \) tends to 1, and when z tends to be negative infinity, \ (g (z) \) tends to 0, which is very suitable for our classification probability model. In addition, it has a very good derivative property:
\ (g^{'} (z) = g (z) (1-g (z)) \)
This is easily obtained by the derivation of the \ (g (z) \) function, which we will use later.
If we make the z in \ (g (z) \) as: \ ({z = x\theta}\), we get the general form of the two-element logistic regression model:
\ (H_{\theta} (x) = \frac{1}{1+e^{-x\theta}}\)
where x is the sample input, \ (H_{\theta} (x) \) is the model output, which can be understood as the probability size of a classification. and \ (\theta\) is the model parameter required for the classification model. For model output \ (H_{\theta} (x) \), we have this correspondence with our two-yuan sample output Y (assuming 0 and 1), if \ (H_{\theta} (x) >0.5\), i.e. \ (X\theta > 0\), then Y is 1. If \ (H_{\theta} (x) < 0.5\), which is \ (X\theta < 0\), Y is 0. y=0.5 is a critical condition, at which time \ (X\theta = 0\) is, the classification cannot be determined from the logistic regression model itself.
The lower the value of \ (H_{\theta} (x) \), the higher the probability that the classification is 0, and the higher the value, the greater the probability of classifying 1. If it is close to the critical point, the classification accuracy rate will decrease.
Here we can also write the model as a matrix pattern:
\ (H_{\theta} (X) = \frac{1}{1+e^{-x\theta}}\)
where \ (H_{\theta} (X) \) is the model output, which is the MX1 dimension. X is the sample feature matrix, which is the dimension of MXN. \ (\theta\) is the model coefficient for the classification, which is the vector of the nx1.
We understand the model of the two-tuple regression, then we will look at the loss function of the model, and our goal is to minimize the loss function to get the corresponding model coefficients \ (\theta\).
3. Loss function of two Yuan logistic regression
Recalling the loss function of linear regression, the loss function can be defined by using the square sum of the model error, since the linear regression is continuous. But the logistic regression is not continuous, and the experience of the definition of the natural linear regression loss function is not used. But we can use the maximum likelihood method to derive our loss function.
We know that according to the definition of the logistic regression of the second section two, we assume that our sample output is 0 or 12 classes. Then we have:
\ (P (Y=1|x,\theta) = H_{\theta} (x) \)
\ (P (Y=0|x,\theta) = 1-h_{\theta} (x) \)
To write these two formulas as a formula is:
\ (P (Y|x,\theta) = H_{\theta} (x) ^y (1-h_{\theta} (x)) ^{1-y}\)
Where y is the only value of 0 or 1.
It is represented by the matrix method:
\ (P (y| X,\theta) = H_{\theta} (x) ^y (E-h_{\theta} (x)) ^{1-y}\), where E is the unit matrix.
The probability distribution function expression of y is obtained, and we can use the likelihood function maximization to solve the model coefficients \ (\theta\) we need.
To facilitate the solution, we use the logarithmic likelihood function maximization, the logarithmic likelihood function to take the inverse is our loss function \ (J (\theta\)). which
The algebraic expression of a likelihood function is:
\ (L (\theta) = \prod\limits_{i=1}^{m} (H_{\theta} (x^{(i))) ^{y^{(i)}} (1-h_{\theta} (x^{(i)})) ^{1-y^{(i)}}\)
where M is the number of samples.
The expression of the logarithmic inverse of the likelihood function, that is, the loss function expression is:
\ (J (\theta) =-LNL (\theta) =-\sum\limits_{i=1}^{m} (y^{(i)}log (H_{\theta} (x^{(i)})) + (1-y^{(i)}) log (1-h_{\theta} (x^ {(i)}))) \)
The loss function is expressed more concisely by matrix method:
\ (J (\theta) =-y\bullet Logh_{\theta} (x)-(e-y) \bullet log (E-h_{\theta} (x)) \)
where e is the unit matrix, \ (\bullet\) is the inner product.
4. Optimization method of loss function for two-yuan logistic regression
For the loss function minimization of two-yuan logistic regression, there are many methods, the most common are gradient descent method, Axis descent method, Newton method and so on. The formula for each iteration of the gradient descent method (\theta\) is deduced here. Due to the complexity of the derivation of algebra, I am accustomed to using matrix method to make the optimization process of loss function, and the process of deriving two-yuan logistic regression gradient by matrix method is given here.
For \ (j (\theta) =-y\bullet Logh_{\theta} (x)-(1-y) \bullet log (E-h_{\theta} (x)) \), we can use the \ (j (\theta) \) to the \ (\theta\) vector to take the derivative:
\ (\frac{\partial}{\partial\theta}j (\theta) =-y \bullet X^t\frac{1}{h_{\theta} (x)}h_{\theta} (x) (1-h_{\theta} (x)) + ( e-y) \bullet X^t\frac{1}{1-h_{\theta} (x)}h_{\theta} (x) (1-h_{\theta} (x)) \)
In this step we use the chain rule of matrix derivation, and the following three matrix derivation formula:
\ (\frac{\partial}{\partial X}LOGX = 1/x\)
\ (\frac{\partial}{\partial z}g (z) = g (z) (1-g (z)) (g (z) is sigmoid function) \)
\ (\frac{\partial}{\partial\theta}x\theta = x^t\)
For the derivation formula that we have just made, we can get the following simplification:
\ (\frac{\partial}{\partial\theta}j (\theta) = X^t (H_{\theta} (X)-Y) \)
Thus the iterative formula of each step vector \ (\theta\) in the gradient descent method is as follows:
\ (\theta = \theta-\alpha x^t (H_{\theta} (X)-Y) \)
where \ (\alpha\) is the step size of the gradient descent method.
In practice, we generally do not worry about optimization methods, most machine learning libraries have built-in various logistic regression optimization methods, but it is necessary to understand at least one optimization method.
5. Regularization of two-yuan logistic regression
Logistic regression also faces a fitting problem, so we have to consider regularization as well. There are common L1 regularization and L2 regularization.
The L1 regularization loss function expression of logistic regression is as follows, compared with normal logistic regression loss function, the norm of L1 is added as penalty, super parameter \ (\alpha\) as penalty coefficient, and the size of penalty is adjusted.
The L1 regularization loss function expression for binary logistic regression is as follows:
\ (J (\theta) =-y\bullet Logh_{\theta} (x)-(e-y) \bullet log (1-h_{\theta} (x)) + \alpha| | \theta| | _1\)
where \ (| | \theta| | _1\) is the L1 norm of \ (\theta\).
The optimization method of L1 regularization loss function for logistic regression is commonly used with the axis descent method and the minimum angle regression method.
The L2 regularization loss function expression for binary logistic regression is as follows:
\ (J (\theta) =-y\bullet Logh_{\theta} (x)-(e-y) \bullet log (1-h_{\theta} (x)) + \frac{1}{2}\alpha| | \theta| | _2^2\)
where \ (| | \theta| | _2\) is the L2 norm of \ (\theta\).
The optimization method of L2 regularization loss function for logistic regression is similar to that of normal logistic regression.
6. The extension of two-yuan logistic regression: Multivariate logistic regression
In the previous sections, our logistic regression model and loss function are limited to two-yuan logistic regression, in fact, the model and loss function of two-element logistic regression can easily be generalized to multivariate logistic regression. For example, always think of a type is positive, the rest is 0 value, this method is the most commonly used one-vs-reset, referred to as OVR.
Review the two-dollar logistic regression.
\ (P (Y=1|x,\theta) = H_{\theta} (x) = \frac{1}{1+e^{-x\theta}} = \frac{e^{x\theta}}{1+e^{x\theta}}\)
\ (P (Y=0|x,\theta) = 1-h_{\theta} (x) = \frac{1}{1+e^{x\theta}}\)
Where y can only fetch 0 and 1. Then there are:
\ (Ln\frac{p (Y=1|x,\theta)}{p (y=0|x,\theta)} = x\theta\)
If we are going to generalize to multivariate logistic regression, the model has to be expanded slightly.
We assume that the K-element classification model, that is, the sample output y value is 1, 2, ... K.
Based on the two-dollar logistic regression experience, we have:
\ (Ln\frac{p (Y=1|x,\theta)}{p (y=k|x,\theta)} = x\theta_1\)
\ (Ln\frac{p (Y=2|x,\theta)}{p (y=k|x,\theta)} = x\theta_2\)
...
\ (Ln\frac{p (Y=k-1|x,\theta)}{p (y=k|x,\theta)} = x\theta_{k-1}\)
There is a K-1 equation above.
The equation with the sum of probabilities 1 is as follows:
\ (\sum\limits_{i=1}^{k}p (Y=i|x,\theta) = 1\)
The k equation is obtained and the probability distribution of K logistic regression is found.
The probability distribution of K-element logistic regression is obtained by solving the K-element one-time equation group as follows:
\ (P (Y=k|x,\theta) = E^{x\theta_k} \bigg/1+\sum\limits_{t=1}^{k-1}e^{x\theta_t}\) K =,... K-1
\ (P (Y=k|x,\theta) = 1 \bigg/1+\sum\limits_{t=1}^{k-1}e^{x\theta_t}\)
The derivation of the loss function of multivariate logistic regression and the optimization method are similar to the two-yuan logistic regression, which is not described here.
7. Summary
Logistic regression, especially two-yuan logistic regression is a very common model, the training speed is very fast, although the use of support vector machine (SVM) so the mainstream, but to solve the common classification problem is enough, training speed is much faster than SVM. If you want to understand the machine learning classification algorithm, then the first one should learn the classification algorithm personally think it should be a logistic regression. Understanding the logistic regression, other classification algorithms should not be so difficult to learn.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
Summary of the principle of logistic regression