Original address: http://blog.csdn.net/abcjennifer/article/details/7716281
This column (machine learning) includes linear regression with single parameters, linear regression with multiple parameters, Octave Tutorial, Logistic Regression, regularization, neural network, design of the computer learning system, SVM (Support vector machines), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. All of the content is from the Standford public class machine learning in the teacher Andrew's explanation. (Https://class.coursera.org/ml/class/index)
Third speaking-------Logistic Regression & Regularization
The content of this lecture:
Logistic Regression
=========================
(a), classification
(b), hypothesis representation
(iii), decision boundary
(iv), cost Function
(v), simplified cost Function and Gradient descent
(vi), Parameter optimization in Matlab
(vii), Multiclass Classification:one-vs-all
The problem of overfitting and how to solve it
=========================
(eight), the problem of overfitting
(ix), cost Function
(10), regularized Linear Regression
(11), regularized Logistic Regression
This chapter mainly describes the logistic regression and regularization solve the problem of overfitting, very very important, is a very common machine learning regression tool, the following two parts of the explanation.
Part I.: Logistic Regression
/************* (i) ~ (b), classification/hypothesis representation***********/
It is assumed that the tumor of the patient is malignant (malignant) or benign (benign) with tumor size changes.
The 8 data are given below:
Assuming that the hypothesis linear equation obtained by linear regression is shown in pink, a threshold:0.5 can be determined predict
Y=1, if H (x) >=0.5
Y=0, if H (x) <0.5
That is, the point of malignant=0.5 is projected down, the point on its right is predicted Y=1, the left prediction y=0, and well categorized.
So, what if the dataset is like this?
In this case, assuming that the linear regression is predicted as a blue line, then the linear equations obtained by the 0.5 boundary are not properly categorized. Because not satisfied
Y=1, h (x) >0.5
Y=0, h (x) <=0.5
At this point, we introduce the logistic regression model:
The so-called sigmoid function or logistic function is one of the functions G (z) shown
When Z>=0, G (z) >=0.5; when z<0, G (z) <0.5
Given the probability and =1 of the data x and the parameters θ,y=0 and Y=1, which are known by the middle formula
/***************************** (iii), decision boundary**************************/
The so-called decision boundary is the H (X) boundary that is able to classify all data points nicely.
As shown, assuming the hypothesis parameter θ=[-3,1,1]t of the shape such as h (x) =g (θ0+θ1x1+θ2x2), there is
Predict Y=1, if-3+x1+x2>=0
Predict Y=0, if-3+x1+x2<0
Just to be able to classify the datasets shown in the diagram nicely
Another Example:
Answer
In addition to linear boundary, there are nonlinear decision boundaries, such as
, the decision boundary that are classified is a circle with a radius of 1:
/******************** (Thu) ~ (Fri) simplified cost function and gradient descent< very important >*******************/
This section describes the simplified logistic regression system in how to implement gradient descents for logistic regression.
Assuming that our data points will only take 0 and 1, for a logistic regression model system, then the cost function is defined as follows:
Since y only takes 0, 1, then it can be written
If you do not believe the words can be y=0,y=1 separately, you can find this J (θ) and the above cost (hθ (x), y) is the same (*^__^*), then the rest of the work is to minimize the size of the J (θ) θ ~
In the first chapter we have talked about how to apply gradient descent, which is part of repeat, to update all the dimensions in θ simultaneously, while the derivative of J (θ) can be obtained by the following formula, as shown in handwriting:
Now bring it into the repeat:
It was our surprise to find that it was the same as the formula we got in the first chapter.
In other words, as shown in, whether the expression of H (x) is linear or logistic regression model, the following parameter update process can be obtained.
So how to use vectorization to do it? In other words, instead of updating the θj with a for loop, we use a matrix multiplication to update the entire θ at the same time. That is to solve the following problem:
The above formula gives the parameter matrix θ Update, then ask a question, the second said how to determine the learning rate α size is appropriate, then how to judge in the logistic regression system?
Q:suppose you is running gradient descent to fit a logistic regression model with parameterθ∈Rn+1 . Which of the following is a reasonable a-sure the learning rate Alpha is set properly and, gradient descent is running correctly?
A:
/************* (vi), Parameter optimization in matlab***********/
This section will do some optimization on the logistic regression, allowing for a faster gradient drop in the parameter. This section realizes the process of calculating the optimal parameters with the gradient method under Matlab.
First of all, in addition to the gradient descent method, we have a lot of ways to use, as shown, the left is another three methods, the right side is the common advantages and disadvantages of the three methods, no need to choose the learning rate α, faster, but more complex.
That is, MATLAB has helped us to achieve some of the optimization parameters θ method, then here we need to complete the task is to write the cost function, and tell the system, which method to use to optimize parameters. For example, we use the ' GradObj ', using the GradObj option to specify. Also returns a second output argument G that's the partial der Ivatives of the function df/dx, at the point X.
As shown, given the parameter θ, we need to give the cost Function. which
Jval is the expression of cost function, for example, with two points (1,0,5) and (0,1,5) for regression, then set the equation to hθ (x) =θ1x1+θ2x2;
Then there are costfunction J (θ): jval= (Theta (1)-5) ^2+ (Theta (2)-5) ^2;
In each iteration, the parameters are updated according to the method of gradient descent θ:θ (i)-=gradient (i), where gradient (i) is a function of the derivation of θi (θ), in this case gradient (1) =2* (theta (1)-5 ), gradient (2) =2* (Theta (2)-5). As shown in the following code:
function costfunction, defining jval=j (θ) and gradient to two θ:
[CPP]View Plaincopy
- function [Jval,gradient] = costfunction (theta)
- %costfunction Summary of this function goes here
- % detailed explanation goes here
- Jval= (Theta (1)-5) ^2+ (Theta (2)-5) ^2;
- Gradient = zeros (2,1);
- %code to compute derivative to theta
- Gradient (1) = 2 * (theta (1)-5);
- Gradient (2) = 2 * (Theta (2)-5);
- End
Writing function gradient_descent for parameter optimization
[CPP]View Plaincopy
- function [Opttheta,functionval,exitflag]=gradient_descent ()
- %gradient_descent Summary of this function goes here
- % detailed explanation goes here
- Options = optimset (' GradObj ',' on ',' maxiter ', 100);
- Initialtheta = Zeros (2,1)
- [Opttheta,functionval,exitflag] = Fminunc (@costFunction, initialtheta,options);
- End
Matlab main window called, get optimized thick parameters (θ1,θ2) = (5,5), that is hθ (x) =θ1x1+θ2x2=5*x1+5*x2
[CPP]View Plaincopy
- [Opttheta,functionval,exitflag] = Gradient_descent ()
- Initialtheta =
- 0
- 0
- Local minimum found.
- Optimization completed because the size of the gradient is less than
- The default value of the function tolerance.
- <stopping Criteria Details>
- Opttheta =
- 5
- 5
- Functionval =
- 0
- Exitflag =
- 1
The final results show the optimization parameters opttheta=[5,5], Functionval = costfunction (after iteration) = 0
/***************************** (vii), Multi-Class classification one-vs-all**************************/
The so-called One-vs-all method is to apply binary classification methods to the multi-class classification.
For example, I would like to divide into K class, then one of the classes as positive, and another (k-1) together as a negative, so that the K h (θ) parameter optimization, each time a hθ (x) is given θ and x, it belongs to the probability of positive class.
In this way, given an input vector x, the class that gets the maximum hθ (x) is the class that x is assigned to.
Part II:The problem of overfitting and how tosolve it
/************ (eight), the problem of overfitting***********/
The problem of overfitting:
Overfitting is the one that has been fitted, such as the right-hand side of the picture. For the two categories described above (logistic regression and linear regression) have overfitting problems, the following are explained by two pictures:
<linear Regression>:
<logistic Regression>:
How to solve the problem of fitting? Two methods:
1. Reduce the number of feature (manual definition of how many feature, algorithm select these feature)
2. Normalize (leave all feature, but for some feature define their parameter very small)
We will explain the regularization in detail below.
For the linear regression model, our problem is to minimize
The writing matrix indicates that
i.e. the loss function can be written as
There we can get:
After regularization, However,we has:
/************ (ix), cost function***********/
For regularization, the method below defines the θ3,θ4 parameter in the cost function as very large, so there is a very small θ3,θ4 after minimizing the cost function.
The writing formula is as follows: Add θ1~θn penalty in cost function.
Note the setting of lambda here, see the following topic:
Q:
A:λ very general to cause all θ≈0
Below, we divide linear regression and logistic regression to carry out regularization step separately.
/************ (10), regularized Linear regression***********/
<linear Regression>:
First look at the cost function formula above, how to apply gradient descent for parameter update.
For θ0, there is no penalty, the update formula is the same as the original
For other θj,j (θ), add one (λ/m) *θj after the derivation, see:
If you do not use the gradient descent method (gradient descent+regularization) instead of using the matrix calculation (normal equation) to find θ, then the θ of J (θ) min is obtained, so that all the derivative of the J (θ) derivative of θj is equal to 0, the formula is as follows:
And it has been shown that the above formula is reversible in parentheses.
/************ (11), regularized Logistic regression***********/
<logistic Regression>:
The cost function and overfitting of logisitic regression have been mentioned earlier, as shown in:
As with linear regression, we have added a penalty of θ to J (θ) to suppress overfitting:
Using the method of gradient descent, the derivation of J (θ) to θj is equal to 0, resulting
Here we find that the θ Update method is the same as the linear regression.
When using regularized logistic regression, which of these are the best-to monitor whether gradient descent are working Correctly?
Similar to invoking the example in MATLAB above, we can define the cost function of the logistic regression as follows:
In the figure, Jval represents the cost function expression, where the last item is the penalty for the parameter θ; The following is a gradient of the derivation of each θj, where θ0 is not in the penalty, so gradient is not changed, and Θ1~θn has one more (λ/m) *θj respectively;
At this point, regularization can solve the linear and logistic overfitting regression problem ~
Stanford Machine Learning---third speaking. The solution of logistic regression and overfitting problem logistic Regression & regularization