A logic regression algorithm for machine learning

Source: Internet
Author: User
This content resource comes from Andrew Ng's Machine Learning course on Coursera, where he pays tribute to Andrew Ng.

The "Logic regression" study notes for the sixth course of machine learning at Stanford University, this course consists of 7 main parts:
1) Classification (category)
2) Hypothesis representation (modeling)
3) Decision boundary (decision boundary)
4 Price function (cost functions, costs function)
5) simplified cost function and gradient descent (simplified version price and gradient descent algorithm)
6) Advanced Optimization (other optimization algorithms)
7) Multi-Class Classification:one-vs-all (Multi-class classification problem)

1) Classification (category)
Examples of classification problems:
Mail: Junk e-mail/non-spam.
Online transactions: Whether fraud (yes/no).
Tumor: malignant/benign.
The above questions can be referred to as two classification problems, which can be defined in the following form:

0 of these are called negative classes, and 1 are called positive classes.

2 Modeling of classification problem
If the classifier is a regression model, and a model has been trained, you can set a threshold value:
If hθ (x) is ≥0.5, then the Y=1 is predicted, and y is the positive example;
If hθ (x) is <0.5, then the y=0 is predicted and y is a negative case;
If this is a linear regression model, the two classification problem for the tumor, the graphic representation is as follows:

It can be seen that linear regression is not good for predicting tumor types. For the two classification problem, the hypothesis output value hθ (x) of the linear regression model can be greater than 1 or less than 0.

We introduce a new model, the logical regression, the output variable azimuth of the model is always between 0 and 1.
The assumption of the logical regression model is: hθ (x) =g (ΘTX)
which
X represents eigenvector
G Represents a logical function (a logistic function) and is a commonly used logical functions, also known as S-functions (sigmoid function), and the formula is:
G (z) =11+e−z
The function image is:

Together, we get the assumption that the logical regression model:
G (z) =11+e−θtx
The function of hθ (x) is to compute the possibility of the output variable =1 for a given input variable based on the selected parameter, i.e. hθ (x) =p (y=1|x;θ)

For example, if hθ (x) = 0.7 is computed for a given x by the parameters that have been determined, then 70 of the probability y is a positive class, and the probability of y being a negative class is 1-0.7=0.3.

3) Decision boundary (decision boundary)
As described in the previous section, the logical regression model can be represented as follows:

Suppose the given threshold is 0.5, when hθ (x) ≥0.5, y = 1;
When hθ (x) <0.5, y = 0;
Revisit the graph of the sigmoid function, which is the graph of G (Z):
When G (z) ≥0.5, z≥0;
For hθ (x) =g (ΘTX) ≥0.5, then θtx≥0, which means estimating y=1 at this time;
Conversely, when predicting y = 0 o'clock,θtx<0;
We can think that Θtx = 0 is a decision boundary, and when it is greater than 0 or less than 0 o'clock, the logistic regression model predicts the different classification results respectively. For example

hθ (x) =g (θ0+θ1x1+θ2x2)

Θ0,θ1,θ2-3, 1, 1, respectively.

The model predicts Y=1 when the -3+x1+x2 is greater than or equal to 0, that is, the x1+x2 is greater than or equal to 3 o'clock.
We can draw a straight line x1+x2=3, which is the dividing line of our model, separating the area predicted by 1 from the region that predicts 0.

If our data were to be distributed in such a way, how would the model fit.

Because of the need to use curves to separate y=0 regions and Y=1 regions, we need two-time features:
hθ (x) =g (θ0+θ1x1+θ2x2+θ3x21+θ4x22)

Assuming that the parameter is [-1 0 0 1 1], then we get the decision boundary which is exactly the circular point at the origin with a radius of 1.
We can use very complex models to adapt to the judgment boundary of very complex shapes.

4 Price function (cost functions, costs function)
For a linear regression model, the cost function we define is the sum of squares of all model errors. Theoretically, we can also follow this definition for a logistic regression model, but the problem is that when we bring hθ (x) =11+e−θtx into a cost function that is defined, the cost function we get will be a non convex function (Non-convex functions)

We know that the cost function of linear regression is a convex function, a bowl-shaped shape, and the convex function has a good property: for the convex function, the local minimum point is the global minimum point, so as long as you can get a minimum value of such a function point, this point must be the global minimum value point.

Therefore, the cost function mentioned above is not feasible for logistic regression, and we need other forms of price functions to ensure that the costs of logistic regression are convex functions.
So the cost function of redefining the logical regression is:

Intuitive to explain this cost Function, first look at the case when the Y=1:

Intuitively, if y = 1, hθ (x) = 1, then the cost = 0, that is, the predicted value is exactly equal to the real value of the time cost = 0;
However, when hθ (x) →0, cost→∞
Intuitively, because the results of the predictions are diametrically opposed:
If hθ (x) = 0, which is the prediction P (y=1|x;θ) = 0, the probability of Y=1 is 0, but actually y = 1
Therefore, for this learning algorithm to give a large cost penalty.
The same is true for y=0:

Summary: The cost of this construction (hθ (x), y) function is characterized by: when the actual Y=1 and hθ also for the 1 o'clock error is 0, when Y=1 but hθ not 1 o'clock error with the hθ of the smaller and larger; when the actual y=0 and hθ also for 0 o'clock cost 0, when y= 0 but the hθ is not 0 o'clock error as the hθ becomes larger and larger.

5) simplified cost function and gradient descent (simplified version price and gradient descent algorithm)
The cost of the build (hθ (x), y) is simplified as follows:

Bring in the cost function to get:


For this formula, here's a little bit of a note that the formula in brackets is the maximum likelihood function for the maximum likelihood estimate of the logical regression, and the maximum likelihood function is obtained to obtain the estimated value of the parameter (\theta). Conversely, in order to find an appropriate parameter, you need to minimize the cost function, which is:
Minθj (Theta)
Similar to linear regression, we use the gradient descent algorithm to learn the parameter θ,
The algorithm is:

After the derivation of J (θ), the gradient descent algorithm is as follows:

Note: Although the resulting gradient descent algorithm appears to be the same as the gradient descent algorithm for linear regression, the hθ (x) =g (ΘTX) In this case is different from the linear regression, so it's actually not the same. In addition, it is still necessary to perform feature scaling before running the gradient descent algorithm.
Add: In logistic regression, the cost function is defined as J (θ), and the cost function J (θ) is required to be the smallest in the gradient descent process, which needs to be partial to θ, as follows:

6) Advanced Optimization (other optimization algorithms)
In addition to gradient descent algorithms, there are some algorithms that are often used to minimize the cost functions, which are more complex and superior, and usually do not require manual learning rates, and are usually faster than gradient descent algorithms. These algorithms are: Conjugate gradient (conjugate gradient), local optimization (Broyden Fletcher Goldfarb Shann,bfgs) and finite memory local optimization (LBFGS)
Minunc is a minimum value optimization function in both MATLAB and octave, we need to provide the cost function and the derivation of each parameter in the use, the following is a code example that uses the Fminunc function in octave:

function [Jval, gradient] = costfunction (theta)
    jval = [... code to compute J (theta) ...];
    Gradient = [...] code to compute derivative of J (theta) ...]; 
End;
options = optimset (' GradObj ', ' on ', ' maxiter ', ' 100 '); Initialtheta = zeros (2,1); [Opttheta, Functionval, exitflag] = Fminunc (@costFunction, Initialtheta, Options);

7) Multi-Class Classification:one-vs-all (Multi-class classification problem)
In many class classification problems, we have multiple classes (>2) in our training set, we cannot use only one two-variable (0 or 1) to judge the basis. For example, we have to predict four types of weather conditions: sunny, cloudy, rainy or snowy.

The following are possible scenarios for a multiple-class classification problem:


One way to solve such problems is to adopt a One-to-many (One-vs-all) approach. In a one-to-many approach, we transform the multiple classification problem into a two-meta classification problem. To make this transition possible, we mark one class in multiple classes as a positive class (Y=1) and then mark all other classes as negative classes, which are called H (1) θ (x). Then, similarly, we select another class to mark as a positive class (y=2), and then mark the other classes as negative classes, the model as H (2) θ (x), and so on.

Finally we get a series of models denoted:
H (i) θ=p (y=i|x;θ)

Finally, when we need to make predictions, we run all the classifiers and then select the highest possible output variable for each input variable.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.