GBDT is all called Gradient boosting decision Tree. as the name implies, it is a categorical regression algorithm based on the decision tree (decision) implementation. It is not difficult to find that GBDT has two parts: gradient boosting, decision tree. boosting as a model combination, and gradient descent have a deep source, what is the relationship between them? at the same time DecisionTree as a base weak learner, and how to assemble a strong classifier by boosting? This article is to understand the "face" and "lining" of GBDT in detail by answering these two questions.
gradient descent is well known as a common method to determine the possible micro-equations. It is an iterative solution process, in which the solution is iterated in the opposite direction of the gradient corresponding to the current solution. This direction is also called the steepest descent direction. The specific derivation process is as follows. Assuming that the current iteration to the end of the K-round, then the results of the first k+1 round how to get it? we do the following first-order Taylor expansion for function f:
In order to make the function value of the k+1 wheel smaller than the K-wheel, the following inequality is established.
You only need to make:
continue to iterate as it is, until?F(xk)=0,xk+1=xk,the function converges and the iteration stops. as Taylor expands, it is requiredxk+1?xksmall enough. Therefore, it is necessaryGammarelatively small, generally set to 0~1 decimal.
What's that? You've never heard of this gamma? Then I say it's another name you categorization malleability know: "Learning rate (learning rates)". You finally know why the study rate to set the relatively small.
Incidentally, Gradient descent is a first-order optimization method, why do you say so? because it does not require second-order or more information in the iterative process. If we were to expand in Taylor, it would not be a first-order expansion, but a second. the corresponding method is another well-known solution to the problem of the micro-equation: Newton method, the details of Newton's method, we will be introduced in the following article.
- boosting:gradient descent in functional space
boosting generally exists as a model combination, which is also its role in GBDT. What does that boosting have to do with gradient descent? in the previous section, we said that gradient descent is a method to determine the solution of the micro-equation. There is a requirement here that the loss function f above is directly micro to Model X. so Model X can be solved directly based on the gradient iteration. It is a strong hypothesis that this loss function is direct to the model, and not all models are satisfied, such as the decision tree model. Now let's go back to the first section and write the F (x) A little more specific:
F(x)=L(h(x,D),Y)
where D is the data characteristic; Y is the data label;h for the model function, solves the mapping by D->y, X is the model function parameter, that is the model which we usually say; L is the target function or the loss function. For the relationship between them, refer to the mathematical principles of the decision tree .
take logistic regression as an example, x is the weight vector, and the H model function expands to:
The objective function L expands to:
we find that the function L can be micro to H, while H pair X can be micro, so l pair x can be micro. Therefore, we can solve the x directly by gradient descent, instead of saving H. However, if L can be micro to H, but H pair X is not micro ? we still proceed to the Taylor expansion of L according to the first section, just not for X, but for H. For the sake of simplicity, we omit D, Y.
which
According to the logic in section one, we are not difficult to draw the following iterative formula:
but don't forget, our goal is not to ask for H, but to X. Since h is not micro-X, x must be re-learned based on the data. At this point we re-learn that the goal of X is not the source target Y, but the original loss function L at the current H gradient, namely:
This process of re-learning X is exactly what each base weak learner does. This is boosting by weak learner to fit the gradient of each iteration and then realizing the weak learner combination. and since we are in the process of derivation, the loss function L cannot directly derivative the Model X, but only the model function H. So boosting has another nickname: "Function space gradient descent."
In addition, you may have heard of boosting's additive (additive). Incidentally, the additive refers to the additive of H, not to the additive of X. For example, if x is a decision tree, how are the two decision trees added together? You can put them together at most. the only Predictor H (x,d) that can be added is the sample based on the decision tree model.
- Decision Tree:the based Weak learner
in the last section, we mentioned that the essence of boosting is to use each weak learner to fit the current gradient. then D in this case, or D in the original data, and Y is not the original Y. The weak learner in GBDT is a categorical regression tree (CART). so we can use a decision tree to directly fit the gradient:?L(H(xT)). at this point our request for x becomes a tree with a K-leaf node that minimizes the following objective function:
where T is the target ? L (H(xT)) , W is the weight of each leaf node, and L is a collection of leaf nodes. It is easy to obtain the weight of each leaf node as the sample mean of the current leaf node. namely:
The predicted value for each sample is the weight of the leaf node to which it belongs, i.e. h (xt+1)
It is not difficult to find out in fact that the essential difference between GBDT and logistic regression lies in the difference of H. If the x in the H function is the decision tree, the predicted value is obtained by the decision tree, which is GBDT; if the x in H becomes a weight vector and the predicted value is the inner product of X and D, then the algorithm becomes logistic regression (LR).
Therefore, the content described in this article can be regarded as a relatively general algorithm framework, as long as the different business and data, the corresponding weak learner and loss function can be designed. In practical application, we should grasp the parts of "change" and "invariable".
The Taylor expansion in this paper is limited to the first-order expansion, What is the difference if we use deduced to expand the original loss function (provided that the loss function has a second derivative) ?
You can push and see for yourself.
In addition, for a complete set of GBDT, interested friends can refer to GitHub:
Https://github.com/liuzhiqiangruc/dml/tree/master/gbdt
The objective function in this implementation is "0-1 log Loss", which is consistent with the loss function in logistic regression. at the same time, the strategy of Taylor expansion with loss function is used.
by correcting the gradient of the second-order guide, the convergence effect of the algorithm is better. This can be analogous to the difference between Gradient descent and Newton Method.
Code structure design such as:
Deep understanding of GBDT