Personal Summary:
1, this article is mainly proof of the main things, so the mathematical formula is relatively more, the original note author omitted some things, no and the above is very good cohesion, so beginners do not necessarily see clearly, the proposed combination of Stanford machine learning the original handout (English, did not find the full text of the Chinese translation version) to see, If the derivation of the formula is confusing, it means that you need to learn some basic math.
2, combined with the gradient descent method mentioned in the previous article, this paper proposes a faster iterative method called Newton method. The original formula (1) can be understood, a blink of an eye how to change the formula (2) it? I do not know if there is a confused friend, in fact, the author of the original writing there is a misunderstanding, in fact, the formula (2) should not be written in the form of F (theta), but L (Theta), and L (Theta) is Who? is the likelihood estimation function mentioned in the previous article:
written in logarithmic form for:
after derivation:
so the F ' (theta) in the formula (2) is actually, and F ' (theta) is the derivative of the upper equation.
3, the above mentioned Newton method iteration speed fast, and mentioned it is two times convergence, presumably many people want to ask what is two times convergence, why fast? Simply put, Newton's method takes the gradient into account, and in the second-order function it can find the fastest descent method, one step, in fact, it uses two surfaces to fit the current position of the surface, and the gradient descent is to use the plane to fit the current surface. If you don't understand, look at this diagram from the wiki,
the Red Line is the Newton method, the Green line is the gradient descent method, the popular understanding is the gradient descent belongs to the greedy algorithm, takes one step to see one step, each time is selected the current gradient maximum direction to descend, but the Newton method may consider the gradient gradient, has the global vision, it will consider you to walk the step after the gradient will become larger So it's more in line with the real optimal descent strategy. Some of the theoretical, mathematical, and convex optimization theories involved can be consulted: Why is Newton's method less iterative than the gradient descent method in solving the optimization problem? and the gradient-Newton-quasi-Newton optimization algorithm and its implementation
4, on the last article mentioned in the question, why the logical regression algorithm and the least squares of the final formula of the form is similar, this paper has shown that they belong to the exponential distribution family, and this leads to the generalized linear regression model, this part of the mathematical deduction more, The basis of mathematics is not very good can look at the original text of English handouts, really do not understand to remember a conclusion it.
5, on the Newton method, as mentioned above, there is an H (n*n, the actual (n+1) * (n+1) includes the x0 intercept items, n is the number of attributes) matrix, so n can not be too large, Newton method may be used in conjunction with the random gradient descent, first using the random gradient descent to find the best value near the point, and then , the effect will be better.
6, about the problem of multi-classification, is actually a generalization of two classification, for multi-classification problem, more use of tree model, regression tree, classification tree, etc.
Newton's method, exponential distribution family, generalized linear model-Stanford ML public Lesson Note 4