Copyright Notice:
This article is published by Leftnoteasy in Http://leftnoteasy.cnblogs.com, this article can be reproduced or part of the use, but please indicate the source, if there is a problem, please contact [email protected]
Objective:
At the end of the previous chapter, it was mentioned that the issue of preparing to write linear classification, the article has been written almost, but suddenly heard that the team is ready to do a set of distributed classifier, may use the random forest to do, under a few papers looked at, simple random Forest is also easier to understand, more complex and boosting and other algorithms combined (see ICCV09), boosting also do not know very much, so cramming to see. Speaking of boosting, strong brother before the implementation of a set of gradient boosting decision Tree (GBDT) algorithm, just reference.
Some recent papers have found that the benefits of model combinations, such as GBDT or RF, combine simple models with better results than a single, more complex model. The combination of many ways, randomization (such as random forest), boosting (such as GBDT) are typical of the method, today mainly talk about the gradient boosting method (this is a little different from the traditional boosting) some mathematical basis, With this mathematical basis, the application above can be seen Freidman gradient boosting machine.
This article requires the reader to learn basic college mathematics, as well as the basic machine learning concepts of classification and regression.
The main references in this paper are PRML and gradient boosting machine.
Boosting method:
Boosting this actually thought quite simple, probably is, to a data, establishes m model (for example the classification), generally this kind of model is relatively simple, is called the weak classifier (weak learner) Each classification each time the data weight of the last error increases a bit again to classify, So the final classifier can get better results on test data and training data.
(Picture from Prml p660) is a boosting process, the green line represents the current model (the model was merged with the model obtained by the first m), and the dotted line represents the current model. Each time the classification, will pay more attention to the divided data, the red and blue dots is the data, the larger the point is the higher the weight, look at the lower right corner of the picture, when the m=150, the acquisition of the model has been almost able to distinguish the red and blue points open.
Boosting can be represented by the following formula:
There are n points in the training set, we can assign a weight to each of the points in the WI (0 <= i < n), indicating the importance of this point, through the process of training the model, we modify the weight of the points, if the classification is correct, weight reduction, if the classification is wrong, then the weight increase, At the beginning, the weights are the same. In the Green line is to indicate the training model, you can imagine that the more the program is executed, the more the training model will be concerned with those who are prone to divide the wrong (high weight) points. when all the programs are finished, you will get M models, corresponding to the Y1 (x) ... ym (x), combined into a final model YM (X) by a weighted way.
I think boosting more like a person to learn the process, start learning something, will do some exercises, but often even some simple problems will be mistaken, but the more to the back, the simple topic has been difficult to pour him, will go to do more complex topics, until he did a lot of problems, Both the problem and the simple question can be solved.
Gradient Boosting Method:
In fact, boosting is more like a kind of thought, Gradient boosting is a kind of boosting method, its main idea is that every time the model is established, the gradient descent direction of the model loss function is established before. This sentence has a bit of a mouthful, the loss function (loss functions) describes the model is not reliable degree, the larger the loss function, it is more prone to error (in fact, there is a variance, deviation equalization problem, but here is assumed that the larger the loss function, the more error-prone model). If our model allows the loss function to continue to decline, it means that our model is constantly improving, and the best way is to let the loss function fall in the direction of its gradient (Gradient).
The following is a mathematical way to describe gradient boosting, mathematically not too complicated, as long as the bottom of the heart to see can understand:)
The gradient representation of an additive parameter:
Suppose our model can be represented by the following function, p represents the parameter, there may be multiple parameters, p = {p0,p1,p2....},f (x; p) represents the function of x with the P parameter, which is our predictive function. Our model is added by multiple models, β represents the weights of each model, and α represents the parameters within the model. To optimize F, we can optimize {β,α} that is p.
We still use p to represent the parameters of the model, and we can get that Φ (p) Represents the likelihood function of P, which is the model F (x; p) The loss function, Φ (p) = ... The next piece looks complicated, so long as it is understood to be a loss function, do not be frightened away.
Since the model (F (x; p) is additive, for parameter p, we can also get the following formula: So the process of optimizing p, it can be a gradient descent process, assuming that the M-1 model has been obtained, in order to get the M model, we first calculate the gradient of the former m-1 model. Get the fastest descent direction, GM is the fastest drop direction.
Here is a very important hypothesis, for the first m-1 model to be found, we think it is known, do not change it, and our goal is to put on the model after the establishment. just like doing things, before doing the wrong things will not regret the medicine to eat, only try to do in the following things do not make mistakes:
The new model we get is that it is in the gradient direction of the P-likelihood function. ρ is the distance that falls in the gradient direction.
We can finally get the best ρ by optimizing the following formula:
The gradient representation of an additive function:
The gradient descent method of the likelihood function of the p is obtained by the additive of the parameter p. We can generalize the additive of parameter p to function space, we can get the following function, where FI (x) is similar to H (x;α) above, because this is used in the author's literature, I use the author's expression method here:
Similarly, we can get the gradient descent direction g (x) of the function f (x)
Finally, you can get the expression of the first M model FM (x):
General gradient descent boosting framework:
I will deduce the general form of the gradient descent method, as discussed earlier:
For the model parameter {Β,α}, we can use the following expression, which means that for the N sample point (Xi,yi) calculation of its loss function under Model F (x;α,β), the optimal {β} is able to make the loss function of the smallest {β}. Represents two m-dimensional parameters:
The way to write gradient descent is the following form, which is what we are going to get the Model FM (x) parameter {αm,βm} can make the direction of the FM is the quickest direction of the loss function of the model FM-1 (x) that was obtained earlier:
For each data point Xi can get a GM (xi), and finally we can get a full gradient descent direction
In order for the FM (x) to be able to be in the direction of GM (x), we can optimize the following formula, and use the least squares method:
The alpha is obtained on the basis of which can then be obtained βm.
Eventually merged into the model:
The flowchart of the algorithm is as follows
Later, the author also said that the algorithm in other places of promotion, wherein, MULTI-CLASS logistic regression and classification is a GBDT implementation, you can see, the flowchart is similar to the above algorithm. Here is not going to continue to write down, and then write down into a thesis translation, please refer to the article: greedy function Approximation–a Gradient boosting machine, author Freidman.
Summarize:
This paper mainly talks about the method of boosting and gradient boosting, and boosting is mainly a kind of thought, which means "knowing and wrong is changing". and gradient boosting is an optimized method of a function (or model) under this thought, first decomposing the function into an additive form (in fact, all functions are additive, just whether it is good to put in the frame, and what the final effect will be). Then the M iterations are made, and the loss function is reduced in the gradient direction, resulting in an excellent model. It is worth mentioning that, each time the model in the gradient direction of the reduction of the part, can be considered a "small" or "weak" model, and eventually we will be weighted (that is, each in the gradient direction of the fall distance) of these "weak" model to combine to form a better model.
With this gradient descent this foundation, can also do a lot of things. Also on the path of machine learning further:)
(turn) Mathematics in machine learning (3)-boosting and gradient boosting of model combining