The logistic regression of machine learning

Source: Internet
Author: User

Organize the machine learning course from Adrew Ng week3

Directory:

    • Two classification problems
      • Model representation
      • Decision Boundary
    • Loss function
    • Multi-Classification problem
    • Over-fitting problems and regularization
      • What is overfitting
      • How to resolve a fit
      • Regularization method

1, two classification problems

What is a two classification problem?

    • Spam/Not Junk mail?
    • Scam Sites/non-fraudulent sites?
    • Malignant tumors/non-malignant tumors?

Expressed as an expression: $y \in\left \{0,1 \right \}$,

\begin{bmatrix}
0&: & nagetive & Class\\
1&: & Positive & Class
\end{bmatrix}

Can I use linear regression to deal with classification problems?

When using linear regression to classify the problem, a threshold value can be selected, for example, when $h_\theta (x) \geq \theta^tx$, the $y=1$ is predicted; when $h_\theta (x) < \theta^tx$, it predicts $y=0$;

When the sample has only 8 red forks on top and bottom, the red line is the result of linear regression, and when the threshold value is 0.5, the positive and negative classes can be separated according to the vertical bar of the magenta, no problem;

However, when adding a sample, in the Green Fork, the regression line becomes a green linear, when the selection of 0.5 is a threshold, the above 4 Red forks (positive Class) into the negative class inside, the problem is very large;

In addition, in the two classification problem, y=0 or y=1, and in linear regression, $h _\theta (x) $ can be greater than 1 or less than 0, which is also unreasonable; ($0

The above example concludes that it is unreasonable to use linear regression to classify the problem, and the result is not stable.

Representation of the logistic regression model

Without a linear regression model, use a logistic regression model:

$g (z) =\frac{1}{1+e^{-z}}$;$0<g (z) <1$. sigmoid function/Logistic function, the function image is as follows:

$h _\theta (x) =\frac{1}{1+e^{-\theta^tx}}$

Description: $h _\theta (x) =p (Y=1|x;\theta) $, representing the probability of estimating y=1; (probability that Y=1, given x, parameterized by $\theta$)

Linear decision Boundary

The boundaries of dividing two classes, such as design boundary, are $x_1+x_2=3$;

Non-linear decision boundary

The following boundaries are, $x _1^2+x_2^2=1$

Notice that the boundary is drawn when the parameter is determined, and it corresponds to the specified parameter.

2. Loss function

How to find the parameters of the model?

If the linear regression is considered, the loss function is the square loss, and for the simple function in linear regression, the loss function defined in this way is a convex function, which is easy to solve; but in logistic regression, the model is a complex nonlinear function ($g (z) =\frac{1}{1+e^{-z}}$). The loss function under the square loss is not a convex function, there are very many local minimal, so it is necessary to change the loss function for logistic regression.

Logistic regression loss function

$ $cost (H_\theta (x), y) =\left\{\begin{matrix}
-log (H_\theta (x)) & if \; Y=1 \ \
-log (1-h_\theta (x)) & if \; Y=0
\end{matrix}\right.$$

When Y=1, the function image as shown in the left image, when $h_\theta (x) =1$, cost=0; when $h_\theta (x) =0$, the cost tends to infinity;

When y=0, the function image as shown in the right image, when $h_\theta (x) =0$, cost=0; when $h_\theta (x) =1$, the cost tends to infinity;

The most important thing is that this function is convex!

Simplified loss function and gradient descent

$cost (H_\theta (x), y) =-ylog (H_\theta (x))-(1-y) log (1-h_\theta (x)) $

The loss function of logistic regression basically uses this, why use this function?

    • Parameters can be obtained by maximum likelihood estimation
    • Convex function
    • and the loss function above is equivalent.

So

$J (\theta) =-\frac{1}{m}[\sum_{i=1}^m y^{(i)}logh_\theta (x^{(i)}) + (1-y^{(i)}) log (1-h_\theta (x{(i)}))]$

Seek ginseng $\theta$:$\underset{\theta}{min}j (\theta) $

Given x, forecast y: $h _\theta (x) =\frac{1}{1+e^{-\theta^tx}}$

Gradient Descent

$\theta_j=\theta_j-\alpha \frac{\partial J (\theta)}{\partial \theta_j}=\theta_j-\alpha \sum_{i=1}^m (H_\theta (x^{(i )})-y^{(i)}) x_j^{(i)} $

The parameter update form here is the same as in linear regression, but notice that $h_\theta (x) $ is not the same;

Note that in the logical classification model, feature scaling is also useful ;

Advanced optimization Methods

In addition to the gradient descent algorithm, there are some more advanced, sophisticated, faster optimization methods: "Conjudge gradient, BFGS, L-bfgs"

3, multi-classification problems

Message categories: Friends, family, work ....

Weather: Sunny, cloudy, rain, snow ....

One idea of the classification problem is:one-vs-all;

As below, for a multi-classification problem with 3 classes, construct 3 classification functions, each time only one class and the other classes are distinguished, $h _\theta^{(i)} (x); i=1,2,3$:

Therefore, each classifier can get a probability of $y=i (i=1,2,3) $, the maximum probability of I is the category result, that is, the prediction is: $ \underset {I}{max} h_\theta^{(i)} (x); i=1,2,3$

4, over-fitting problems and regularization

Overfitting problem

, there are three models for house price forecasting:

The first model is very simple, the fitting is not very good, can be called "under-fitting", there is a relatively large deviation (bias);

The second model is a little more complicated than the first one, fitting well and can be thought of as "fitting just right";

The third model is very complex, fitting the perfect, can be called "over-fitting", but also a large variance (variance);

Cross-fitting said is the third picture of the problem, if we have a lot of features, the learning model can fit the training data very good ($J (\theta) \approx 0$), but in the fitting of new data, but do not do well, the generalization ability is weak;

Similarly, in logistic regression:

How do I solve a fit problem?

    • Reduce the number of feature
      • You can choose which feature to keep manually.
      • Some automatic model selection algorithms (Models selection algorithm)
    • Regularization
      • Keep all feature, but reduce magnitude/values of parameters
      • When there are a lot of feature, each of which contributes a little to the prediction, it is very useful

The loss function after regularization

As shown, logically, when the original loss function after adding a penalty, $\theta_3$ and $\theta_4$ will become very small, so although the model is complex, but the higher part is actually very small, similar to the low-order function;

Regularization "simplifies" the model so that the tendency of the model overfitting is reduced;

Regularization of linear regression:

$J (\theta) =\frac{1}{2m} [\sum_{i=1}^m (H_\theta (x^{(i)})-y^{(i)}) ^2 + \lambda \sum_{j=1}^n \theta_j^2]$

It is noted that when the $\lambda$ is very large, there can be a situation in which there is less fitting;

At this point the gradient descent algorithm is updated to:

$\theta_0=\theta_0-\alpha \frac{1}{m} (H_\theta (x^{(i))-y^{(i)}) x_0^{(i)} $

$\theta_j=\theta_j-\alpha [\frac{1}{m} (H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)} + \frac{\lambda}{m}\theta_j] $;j=1,2, .... N;

Note: The $\theta_0$ is not updated

Note that:

$\theta_j=\theta_j (1-\alpha\frac{\lambda}{m})-\alpha \frac{1}{m} (H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)} $

$-\alpha\frac{\lambda}{m}$ is an extremely close to 1 number, probably 0.99, so the regularization of the update strategy and the previous comparison, is to make $\theta_j$ smaller;

Normal equation

$$\theta= (X^tx+\lambda\begin{bmatrix}
0 & & &\\
& 1 & & \ \
& & 1 & \ \
& & &, ....
\end{bmatrix})) ^{-1}x^ty$$

In the non-regularization linear regression problem, the Normal equation has an irreversible problem, but can prove that $ (X^tx+\lambda\begin{bmatrix}
0 & & &\\
& 1 & & \ \
& & 1 & \ \
& & &, ....
\end{bmatrix})) $ is reversible;

Regularization of the logistic regression

As with the regularization of linear regression, just replace the model function ($h _\theta (x) $)

The logistic regression of machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.