Recently turned Peter Harrington "machine Learning Combat", see the Logistic regression chapter a little bit of doubt.

After a brief introduction of the principle of logistic regression, the author immediately gives the code of the gradient rise algorithm: The range of the algorithm to the jump is a bit large, the author himself said, here omitted a simple mathematical deduction.

So in fact, this process is also mentioned in Andrew Ng's machine learning public class. Now I recall that when I was a sophomore looking at Andrew's video, I had a lump in my heart (Andrew skipped a step).

So here's how the author omitted the mathematical deduction, and, how to deduce.

Before you do this, review the logistic regression first.

**Logistic regression**

**Rationale:**in the book "combat" This is said, "regression" is to use a straight line to a bunch of data points to fit, this fitting process is called "regression." The main idea of using logistic regression to classify the classification boundary line is to establish a regression formula based on the existing data.

An example of Andrew's Open class illustrates:

The Circle (blue) and fork (red) are two kinds of data points, we need to find a decision boundary to divide it, the boundary form is obviously linear form, described in:

We are credited with:

Where G is a function that accepts all inputs, then calculates the values and classifies them. Here we use the classic sigmoid function

Sometimes, however, the decision boundary cannot be distinguished by a one-dimensional line, where the number of θ parameters is variable, such as the following heap of data

This is a nonlinear relationship.

So you can see here that the x1,x2 parameter is all squared and a circular boundary is found.

**Formula derivation**

So here we can generalize the boundary form as follows:

The last item of the boundary is the form of a vector multiplication, namely:

Then input it into the sigmoid function to determine its category, there is our predictive function, recorded as:

According to the sigmoid image, the predictive function output value is greater than 0, then the x (data point) belongs to the category 1, otherwise 0 (for the two classification problem).

But don't forget our original goal, here the θ vector is unknown. Our aim is to:

**Determining the parameter values of θ allows us to better partition the data set in this decision boundary.**

** **So this process, in Andrew's course, was skipped, he gave the cost function and the J (θ) function directly, and then the optimal θ parameter was obtained by gradient descent. where the J (θ) function is:

Using the above formula and the gradient descent algorithm, we can find the value of θ.

Then let's talk about how the formula was deduced.

Let's start by looking at what we already know:

1, a bunch of data points + their categories (2 classes)

2, their probability distribution hθ (x): Although the current θ is still an unknown parameter

Our goal is **to find the unknown parameters, so that each sample data point belongs to the category it is currently tagged with the highest probability. **

** ** so it leads to the **maximum likelihood estimation** of Fisher.

This does not speak of the concept and formula of maximum likelihood estimation, but it is an example to illustrate the effect of maximum likelihood estimation:

A hunter and a student walked along the mountain Road, suddenly ran out of the mountain a rabbit, snapped a gun, the rabbit fell dead. Q: Who is most likely to kill rabbits?

The answer is obvious: the Hunter. So here, the Hunter is the parameter theta. The goal of the maximum likelihood estimation is to predict the parameters to be evaluated, so that the probability of the sample event is maximal.

For a continuous distribution, we need its probability density function, in this case, actually the sigmoid function (the value range 0-1 is exactly the probability of occurrence), we re-write here:

Write these two formulas together:

It can be verified that when Y=1 or y=0, the above-style is satisfied. For each sample data point, it satisfies the above, so for the population (actually the sample event here is: All the sample data points belong to their own classification), we continue.

The likelihood function is obtained according to the procedure of maximum likelihood estimation:

The θ parameter that requires the maximum value of L (θ).

The multiplication is not easy to solve, at the same time it is easy to cause the overflow. This is because X and ln (x) have the same monotonicity, and both sides take the logarithm

So this is the J (Theta) that Andrew gave, and the only difference is that Andrew has a negative coefficient in front of it, which makes the maximum value a minimum, so that the gradient descent algorithm can be used.

But in fact, with this formula can also complete the task, just use the algorithm to become gradient rise, in fact, no difference.

**Conclusion**

Here Amway "machine learning Combat" This book, really pretty good, practical very strong, both the introduction of ML, and exercise the hands-on ability.

Logistic regression cost function and the derivation of J (θ)----Andrew Ng "Machine learning" open class