First, linear regression (direct)
As shown, judging by the tumor size data. The hypothesis function is based on the ability to see that the linear h (x) can effectively classify the above data, when H (x) >0.5, then the tumor patient, when H (x) <0.5, is normal. But the linear model will have one of the following conditions
At this time by adjusting the parameters of the linear model, the resulting linear model is a blue line, it will be found that the right side of the Red Cross is predicted to be normal, which is obviously unreasonable, and the consequences are serious (others sick, you predict normal, affect treatment ...) ), in addition to the two classification as an example, suppose label={0,1}, but we use the linear model eventually to predict Y may be very large or very small, which is obviously unreasonable. This introduces the so-called logistic regression (logistic regression).
2logistic regression (Logistic regression)
Logistic regression is actually changing our hypothesis (as shown)
0
θ0+θ1x1+θ2x2 θ= [- 3,1,1]t, there is
Predict Y=1, if-3+x1+x2>=0
Predict Y=0, if-3+x1+x2<0
Just to be able to classify the datasets shown in the diagram nicely
There are also non-linear decision boundaries that are similar.
The cost function of the logistic regression
Recall that the costfunction of linear regression is as follows
At this point, we can no longer use the cost function of the linear model to design the cost function of the logistic regression, because it involves the gradient descent of non-convex functions (easy to get into local minima), as shown in The graph at the bottom left is the hypothesis function of the logistic regression directly using the cost function of the linear model to get the costfunction graph, because the hypothesis function of the logistic regression itself is a nonlinear one, So the final cost function in this way is definitely a non-convex function, if the gradient descent method is used to optimize the parameters, it is easy to get into the local minimum value, affecting the final classification results.
It's time to design the cost function. As shown
At first I didn't understand why so design costfunction, later see Andrew's video has a detailed explanation, the coordinates of the horizontal axis represents the H (x), the vertical shaft represents the cost function, note that the above-mentioned coordinate chart is in the case of the y=1 of the present time, when H ( x) =1 is our predicted value of 1, and Y=1 (actual tag Value =1), this time we can predict correctly, and cost function=0, corresponds to the point in the coordinate chart (1,0), the intersection of the curve and H (x) axis is (1,0), when our h (x) = 0, that is, the prediction is 0, and Y=1, that is, the actual corresponding label should be 1, this is to indicate that the judgment is wrong, there is error, corresponding to the H (X) =0, the cost tends to infinity, that is y=1 conditions, if we can predict correctly, then cost=0, corresponding curve and H ( x) axis of intersection (1,0), when predicting errors (h (x) =0), at this time the cost is approaching infinity, that is to say, we will punish this error situation, give the cost function a very large number, and then adjust when the gradient drops. Similarly, when y=0, the situation is similar. As shown (the analysis process is similar to y=1)
The cost function explained above for the logistic is from the short tutorial of Coursera, because it is a short tutorial, so Professor Andrew did not make a detailed formula deduction proof, want to watch the small partners can go to NetEase Open class to find detailed tutorials and deduction. I have taken the time to see the details of Andrew's course below to derive the specific derivation and origin of the cost function of the logistic.
First of all our hypothesis for now we make the following assumptions: that is, h (x) represents the probability of y=1 in cases where x is a random variable and theata is a parameter. Thus we can exit the maximum likelihood function (the maximum likelihood function of y under the condition that X is a random variable, theata is a parameter) is as follows
All we have to do is find a suitable theata so that the value of L (Theata) is the largest, using the knowledge of high number inside, first set L (theata) =log (L (theata)), then use the inverse gradient descent method, then step by step iteration Update the value of Theata, Thus finding the maximum L (theata) theata value. As shown below (The L (theata) here is somewhat similar to the cost function above us, which is actually it):
Notice here that the Theataj, which is surprisingly similar to our linear model, has nothing to do with the logistic? Like the liner model? In fact, this is not the case, where H (x) is not linear, this is important.
OK, to here basically the logistic already finished, follow up continue to add, there is also a more efficient method, directly through the matrix operation to get updated theata, avoid so many iterations, behind continue to add!
Reference
http://blog.csdn.net/pakko/article/details/37878837
http://blog.csdn.net/abcjennifer/article/details/7716281
Http://open.163.com/movie/2008/1/E/B/M6SGF6VB4_M6SGHM4EB.html
Logistic regression and gradient descent