The article is from Professor Andrew Ng of Stanford University's machine learning course, which is a personal study note for the course, subject to the contents of the original course. Thank Bo Master Rachel Zhang's personal notes, for me to do personal study notes provide a good reference and role models.
§3. Logistic Regression of Logistic regression
1 Classification classification
Firstly, the concept of classification problem is introduced-in the Classification (classification) problem, the $y$ that need to be predicted is the discrete value. For example, to determine whether an e-mail is spam, to determine whether an online transaction is a fraud, a tumour is a benign tumor or a malignant tumor, are classified issues.
For categories that have two categories (for example, the above three examples), you can mark both categories as positive (Positive Class) and negative (negative class). In practice, it is arbitrary to mark a category as either a positive or a negative class, but generally the positive class represents the possession of something, and the negative class represents the absence of something.
Classification problem can be divided into multi-class classification (Multiclass classification) problem and two-tuple (binary classification) problem.
Andrew Ng takes the classification of tumors as an example and explains the low effectiveness of linear regression methods in classification problems.
, the current data set, if the linear regression method is applied and $h_{\theta} (x) =0.5$ as a threshold to classify the tumor, that is, $h_{\theta} (x) =0.5$ on the horizontal axis of the projection point for the baseline, the left side of the prediction of benign tumors, the right prediction of malignant tumors, So the forecast effect is very good.
However, after adding the most right data point, the line representing $h_{\theta} (x) $ is changed from purple to Blue line, and prediction accuracy at $h_{\theta} (x) =0.5$ can be seen as a significant decrease.
If the linear regression algorithm is applied in the classification problem, then in the case of y={0,1}, there may also be $h_{\theta} (x) <0$ or $h_{\theta} (x) >1$, and H_{\theta} (x) May be much less than 0 or far greater than 1. Therefore, the classification problem is not suitable for the linear regression method to solve.
2 Logistic Regression of logistic regression
The following is the introduction of a logistic regression algorithm that satisfies $0<=h_{\theta} (x) $ to solve the above problem. Although the logical regression algorithm has a "regression" in its name, it is actually a classification algorithm.
The first is the introduction of logical functions (Logistic function), also known as the S-type function (Sigmoid functions)-in the $g (z) $ shown. The nature of the logical function is: infinity in the positive infinity approaching 1, in the negative infinity to nearly 0, at the z=0 value of 0.5.
Andrew Ng explains what $p (Y=1|x;\theta) $ represents, and then gives an important feature of $p (Y=1|x;\theta) $ with $p (Y=0|x;\theta) $--add equals 1.
Then the following examples are given, and the above knowledge points are examined.
3 Decision boundary decision boundary
decision Boundaries (decision boundary) The entire plane is divided into two prediction regions of Y=1 and Y=0, $h_{\theta} (x) $>0.5 for $\theta^{t}x>=0$ parts, and therefore y=1; for $\theta^{t}x<0$ parts , predicted as y=0.
The decision boundary is not a property of the training set, but a property that assumes itself and its parameters. Once the $\theta$ is given, the decision boundary is determined. Instead of defining decision boundaries with training sets, we use training sets to fit parameter $\theta$.
If the training set and the decision boundary are shown on the plane, the effect should be similar.
Another example in the next question, $5-x_{1}=\theta^{t}x$, when $5-x_{1}=\theta^{t}x>=0$ when there is $x_{1}<5$, so the image. And $x_{1}=5$ is the decision boundary of the predictive function.
The nonlinear decision boundary (non-linear decision boundaries), which has complex polynomial feature variables, obtains complex decision boundaries rather than simply separating the positive and negative samples with straight lines.
For example, the following conditions:
4 Cost function
The cost functions in the logistic regression model are as follows:
For Y=1: If the prediction is correct, then the cost is 0, and if the error is predicted, then the cost will tend to infinity as the predicted value tends to 0. That is , when predicting errors, we will punish the learning algorithm at a very high cost .
For y=0: Also similar, $Cost =0$ if $y =1$, $h _{\theta} (x) =1$
But as $h _{\theta} (x) \rightarrow 1$ $Cost \rightarrow \infty$
Captures intuition that if $h _{\theta} (x) = 1$ (Predict $P (Y=0|x;\theta) =0$), but Y=0,we would penalize learning algorithm by A very large cost.
5 simplified cost function and gradient descent algorithm simplified costing functions and gradient descent
Because Y has only two values: 0,1
So you can simplify the cost function as:
Next, our goal is to minimize the parameter $\theta$.
Before the gradient descent algorithm is mentioned, here is a similar usage:
Substituting the blue formula above can be
This algorithm appears to be the same as the gradient descent algorithm applied to linear regression, but in fact, the hypothesis of $h_{\theta} (x) $ in this equation differs from the $h_{\theta} (x) $ in the gradient descent algorithm applied to linear regression.
feature scaling also applies to logistic regression algorithms to make convergence faster .
6 Advanced Optimization Algorithm optimization
In addition to the gradient descent algorithm, the following three algorithms can be considered. Some of these three algorithms are not manually selected $alpha$, fast, but also correspondingly more complex.
In the implementation of the algorithm, it is recommended to try to invoke MATLAB or the existing libraries in octave.
For example:
In general, we can use the fminunc in octave to implement this algorithm, but in Fminunc, the dimensions of $\theta$ should be greater than 1.
7 Multi-Class classification problem Multiclass classification
multi-Class classification problem Multiclass classification refers to the classification problem with more than two classifications.
In the multi-class classification problem, it is actually produced a plurality of classifiers.
In this One-vs-all method, the logistic regression classifier is actually trained by all possible results of each classification I y=i.
Then choose a let H max I, no matter how much we have the highest probability value.
Notes Directory
(i) univariate linear regression Linear Regression with one Variable
(ii) multivariable linear regression Linear Regression with multiple Variables
(iii) Logistic Regression of logistic regression
(iv) Regularization and overfitting problems regularization/the problem of Overfitting
(v) Expression of neural networks neural networks:representation
(vi) Learning neural of neural networks networks:learning
(vii) Machine learning application recommendations Advice for applying machines learning
(eight) machine learning system Design Learning
(ix) SVM support vector machines
(10) Unsupervised learning unsupervised learning
(11) Reduced dimension dimensionality Reduction
(12) Anomaly detection anomaly Detection
(13) Recommended system Recommender Systems
(14) Mass machine learning large scale machines learning
Machine Learning (iii) logistic Regression of logistic regression