Logistic regression
Typically a two-tuple classifier (also available for multivariate classification), such as the following classification problems
- Email:spam/not spam
- Tumor:malignant/benign
Suppose (hypothesis): $ $h _\theta (x) = g (\THETA^TX) $$ $ $g (z) = \frac{1}{1+e^{-z}}$$ where g (z) is called the sigmoid function, and its function image is shown, you can see the predicted value $y$ The value range is (0, 1), so for $h _\theta (x) \geq 0.5$, the model output $y = 1$, otherwise if $h _\theta (x) < 0.5$, the model output $y = 0$.
1. Explanation of the output
$h _\theta (x) $= The probability that the data belongs to the $y =1$ classification , that is, $ $h _\theta (x) = P\{y = 1| ( X \theta) \}$$ also because Y can only take 0 or 12 values, in other words, one data belongs to the 0 classification or 1 classification, assuming that the probability of belonging to the 1 classification is p, then of course it belongs to the probability of 0 classification is 1-p, so we have the following conclusions $ $P (y=1|x;\ Theta) + P (Y=0|x;\theta) = 1$$ $ $P (Y=0|x;\theta) = 1-p (y = 1|x; \theta) $$
2. Decision-making boundaries (decision Bound)
The function $g (z) $ is a monotone function,
- $h _\theta (x) \geq 0.5$ Predictive output $y=1$, equivalent to $\THETA^TX \geq 0$ predictive output $y=1$;
- $\theta (x) < 0.5$ predictive output $y=0$, equivalent to $\theta^tx < 0$ predictive output $y=0$;
This does not require specific to the sigmoid function, only need to solve $\THETA^TX \geq 0$ that can get the corresponding classification boundary. Examples of linear classification boundary and nonlinear classification boundary are given.
3. Price functions (Cost function)
The cost function we define in linear regression is that the least squares method is used to fit
$ $J (\theta) =\frac{1}{m}\sum\limits_{i=1}^{m}\text{cost} (H_\theta (x^{(i)}), Y) $$ $$\text{cost} (H_\theta (x), y) =\ Frac{1}{2} (H_\theta (x)-y) ^2$$
However, since the assumption here is the sigmoid function, if the above cost function is used directly, then $J (\theta) $ will be a non-convex function and cannot be solved with the gradient descent method to solve the minimum value, so we define a logistic cost function as
$$\text{cost} (H_\theta (x), y) =\begin{cases}-log (H_\theta (x)); y = 1\\-log (1-h_\theta (x)); y = 0\end{cases}$$
As shown in the function image, it can be seen that when the $y=1$ is predicted correctly ($h _\theta (x) =1$) The cost is zero, whereas the cost of predicting an error ($h _\theta (x) =0$) is very high and fits our expectations. Similarly, as can be seen from the right, when y=0, the prediction is correct ($h _\theta (x) =0$) cost function is 0, and conversely, when predicting errors ($h _\theta (x) =1$) The cost is very large. Indicates that the cost function definition is very reasonable.
4. A simplified cost function
The preceding cost function is a piecewise function, in order to make the calculation more convenient, you can write the piecewise function as a function, that is,
$$\text{cost} (H_\theta (x), y) =-y\log (H_\theta (x))-(1-y) \log (1-h_\theta (x)) $$
$ $J (\theta) =-\frac{1}{m}\sum\limits_{i=1}{m}y^{(i)}\log (H_\theta (x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta (x^{(i)}) )$$
Gradient Descent
With the cost function, the problem is transformed into a optimization problem with the minimum value, which can be solved by the gradient descent method, and the updated formula of the parameter $\theta$ is
$$\theta_j = \theta_j-\alpha \frac{\partial}{\partial \theta_j}j (\theta) $$
The partial derivative of $j (\theta) $ is $$\frac{\partial}{\partial \theta_j}j (\theta) = \frac{1}{m}\sum\limits_{i=1}^{m} (H_\theta (x^{( i)})-y^{(i)}) x_j^{(i)}$$ note in logistic regression, our hypothetical function $h_\theta (x) $ changed (added sigmoid function), the cost function $j (\theta) $ also changed (take negative logarithm instead of least squares), But from the above results we can see that the partial derivative results are exactly the same as the linear return. Then the parameter $\theta$ the same as the update formula, as follows
$$\theta_j = \theta_j-\alpha \frac{1}{m}\sum\limits_{i=1}^{m} (H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)}$$
Why does the partial derivative look the same as the formula in linear regression? In the Open class, there is no proof process, mainly the chain derivation rule which uses compound function multiple times, and the concrete proof process can be seen here.
Advanced optimization Algorithms
In addition to the gradient descent algorithm, advanced optimization algorithms can be used, such as the following concentration, the advantages of these algorithms do not need to manually select $\alpha$, faster than the gradient descent algorithm, the disadvantage is that the algorithm is more complex.
- Conjugate gradient (conjugate gradient method)
- BFGS (An implementation of inverse Newton method)
- L-BFGS (an improvement to the BFGS)
Logistic regression for multivariate classification
Logistic regression can be used for multivariate classification, using the so-called One-vs-all method, in particular, assuming that there are K categories {,..., k}, we first train an LR model to divide the data into 1 classes and not 1 classes, and then train the second LR model, Divide the data into categories 2 and not of Class 2, one analogy at a time, until the K LR model is trained.
For the new example, we take it into the K-trained model, which calculates its predictive value (as explained earlier, the size of the predicted value represents the probability of a classification), and selects the category with the largest predicted value as its predictive classification.
Regularization1. Overfitting (over fitting)
The above three images indicate the regression of the data with simple model, medium model and complex model, it can be seen that the model on the left is too simple to represent the data characteristics (called Under-fitting, underfitting), and the middle model can represent the characteristics of the model well. Use the most complex model on the right. All the data are on the regression curve, the surface can be very good to match the data, but when the new example prediction, and not very good performance of its trend, called overfitting (overfitting).
Overfitting usually refers to the model's ability to fit well into the training set data when there are too much features in the model (the cost function $j (\theta) is close to 0), but when the model is generalized (generalize) to the new data, the prediction of the model behaves poorly.
Overfitting Solutions
- Reduce the number of features:
- Manual selection of important features, discarding unnecessary features
- Using algorithms for selection (PCA algorithm, etc.)
- Regularization
- Keep the number of features constant, but reduce the order of magnitude or value of the parameter $\theta_j$
- This approach has many characteristics and the contribution of each feature to the results is very effective
2. Regularization of linear regression
In the original cost function to add the parameter penalty as shown in the following form, note that the penalty item starts from $j=1$, the No. 0 feature is the full 1 vector, no penalty is required.
Cost function:
$ $J (\theta) = \frac{1}{2m}\left[\sum\limits_{i=1}^{m} (H_\theta (x^{(i)})-y^{(i)}) ^2 + \lambda \sum\limits_{j=1}^{n}\ theta_j^{2}\right]$$
Gradient Descent parameter update:
$$\theta_0 = \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^{m} (H_\theta (x^{(i)})-y^{(i)}) x_0^{(i)}; j = 0$$
$$\theta_j = \theta_j-\alpha \left[\frac{1}{m}\sum\limits_{i=1}^{m} (H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)} + \frac{\ Lambda}{m}\theta_j \right]; J > 1$$
3. Regularization of logistic regression
Cost function: $ $J (\theta) =-\frac{1}{m} \sum\limits_{i=1}^{m}\left[y^{(i)}\log (H_\theta (x^{(i)}) + (1-y^{(i)}) \log (1-h_\ Theta (x^{(i)})) \right] + \frac{\lambda}{2m}\sum\limits_{j=1}^{n}\theta_j^{2}$$
Gradient Descent parameter update:
$$\theta_0 = \theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^{m} (H_\theta (x^{(i)})-y^{(i)}) x_0^{(i)}; j = 0$$
$$\theta_j = \theta_j-\alpha \left[\frac{1}{m}\sum\limits_{i=1}^{m} (H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)} + \frac{\ Lambda}{m}\theta_j \right]; J > 1$$
Reference documents
[1] Andrew Ng Coursera public class second week
[2] Logistic cost function derivative:http://feature-space.com/en/post24.html
Machine Learning Public Course notes (3): Logistic regression