GBDT and Xgboost in the competition and industrial use are very frequent, can effectively apply to classification, regression, sorting problems, although it is not difficult to use, but to be able to fully understand is still a bit of trouble. This article tries step by step combing GB, GBDT, Xgboost, they have a very close connection, GBDT is a decision tree (CART) as the basis of the study of the GB algorithm, Xgboost extended and improved gdbt,xgboost algorithm faster, accurate rate is relatively high.
1. Gradient Boosting (GB)
The goal of the learning algorithm in machine learning is to optimize or minimize the loss Function, and the idea of Gradient boosting is to iterate over multiple (m) weak models, then add the predictions of each weak model, and the subsequent model fm+1 (x) based on the previous learning model of FM The effect of (x) is generated, the relationship is as follows:
The idea of a GB algorithm is simple, the key is how to generate H (x)?
If the objective function is the mean square error of the regression problem, it is easy to think that the ideal h (x) should be able to fit perfectly, which is often said to be based on residual learning. Residual learning can be used well in regression problems, but for general (classification, sorting problems), it is often the negative gradient learning based on loss function in functional space, which is the same for regression problem residuals and negative gradients. F, do not understand as a function in the traditional sense, but a function vector, the number of elements in the vector is the same as the number of training samples, so the study based on the negative gradient of the loss function space is also called "pseudo residual error".
Steps for the GB algorithm:
1. Initialize the model as a constant value:
2. Iterative generation of M-based learners
1. Calculating pseudo residuals
2. Based on the generated base learner
3. Calculate the optimal
4. Updating the Model
2. Gradient Boosting decision Tree (GBDT)
GB algorithm is the decision tree, especially the cart, as the name implies, GBDT is a combination of GB and dt. It is important to note that the decision tree here is a regression tree, the decision tree in GBDT is a weak model, the depth of the smaller generally not more than 5, the number of leaf nodes will not exceed 10, for each decision tree generated by a relatively small reduction factor (learning rate <0.1), Some GBDT implementations have added random sampling (subsample 0.5<=f <=0.8) to improve the generalization capability of the model. The best parameters are selected by cross-validation method. So GBDT the actual core problem becomes how to build based on using the cart regression tree?
The CART classification tree is described more in many books and materials, but once again it is emphasized that GDBT is used in regression trees. As a comparison, first of all, we know that the cart is a two-fork tree, the cart classification tree at each branch, is a poor lift each feature each threshold value, according to the Gini coefficient to find the maximum feature and its threshold value, and then according to feature<= thresholds, And the feature> threshold is divided into two branches, each of which contains samples that meet the branch criteria. Continue branching in the same way until all the samples under the branch belong to a uniform category, or to a predetermined termination condition, if the category in the final leaf node is not unique, the majority category is used as the gender of the leaf node. Regression tree The overall process is similar, but at each node (not necessarily the leaf node) will have a predictive value, in the case of age, the predicted value is equal to the average age of all people belonging to this node. Branching is poor at each threshold of each feature to find the best segmentation point, but the best measure is no longer the Gini coefficient, but the minimization of the mean variance-that is (everyone's age-predicted age) ^2 sum/n, or each person's prediction error squared and divided by N. It's good to understand that the more people are predicted to go wrong, the more ridiculous the error, the greater the mean variance, and the most reliable branching basis can be found by minimizing the mean variance. Branches until the age of each leaf node is unique (this is too difficult) or to achieve a predetermined termination conditions (such as the maximum number of leaves), if the final leaf node of the age is not unique, then the average age of the people on that node as the leaf node of the predicted age.
3. Xgboost
Xgboost is an efficient implementation of the GB algorithm, in which the base learner in Xgboost can be either a cart (gbtree) or a linear classifier (gblinear). All of the following are from the original paper, including the formula.
(1). Xgboost the regularization item shown in the target function, when the base learning is the cart, the regularization is related to the number of leaf nodes of the tree and the value of the leaf node.
(2). GB uses loss function to calculate pseudo residuals for the first derivative of f (x) for learning to generate FM (x), Xgboost not only uses the first derivative, but also uses the second derivative.
Loss of the first T:
To do the Taylor unfold: G is the first derivative and H is second-order derivative.
(3). The criterion for finding the best segmentation point in the cart regression tree is to minimize the mean variance, xgboost the criterion of finding the dividing point is maximized, and Lamda,gama is related to the regularization term.
The steps of the Xgboost algorithm and GB are basically the same, are initialized to a constant, GB is based on the first-order derivative Ri,xgboost is based on the first-order derivative GI and second derivative hi, iterative generation of the basis of the learner, the addition of the update learner.
xgboost and GDBT in addition to the above three points, xgboost in the implementation of a number of optimizations :
- When looking for the best split point , considering that the greedy method of traditional enumeration of all possible points of each feature is too inefficient, xgboost implements an approximate algorithm. The general idea is to enumerate several candidates that may be the dividing points according to the percentile method, and then to find the best dividing point from the candidate based on the formula of the split point.
- xgboost considering the fact that the training data is a sparse value, You can specify the default direction of the branch for a missing value or a specified value, which can greatly increase the efficiency of the algorithm, paper mentions 50 times times.
-
-
- xgboost also considered how to use disk effectively when the amount of data is large and memory is not enough. Mainly combines multi-threading, data compression, and fragmentation methods, as far as possible to improve the efficiency of the algorithm.
Resources:
1. Wikipedia Gradient boosting
2. Xgboost:a Scalable Tree Boosting System
3. Chen Tianchi's slides
4. Xgboost Guide and actual combat. pdf
A step-by-step understanding of GB, GBDT, Xgboost