Brief introduction of GBDT algorithm

Last Update:2018-08-05 Source: Internet

Author: User

Tags xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Enhance decision Tree GBDT

Gradient Enhancement Decision Tree algorithm is one of the most mentioned algorithms in recent years, which is mainly due to the performance of the algorithm and the excellent performance of the algorithm in various data mining and machine learning competitions, and many people have developed open source code for the GBDT algorithm. Compare the fire is Chen Tianchi's xgboost and Microsoft's LIGHTGBM

I. Supervised learning

1, the main task of supervising learning

Supervised learning is an important part of machine learning algorithms, and for supervised learning, it has a training sample of M:

which

, such as a classification problem, or a continuous value, such as a regression problem. Using training samples to train a model in supervised learning, the model is capable of fine lines from sample characteristics.

To be able to solve the mapping F, a loss function is usually set on the model

And the loss function is the smallest case mapping for the best mapping.

For a specific problem, such as a linear regression problem, its mapping function is in the form of:

The gradient descent algorithm is the simplest and most direct method to solve the optimization problem. The gradient descent method is an iterative optimization algorithm for optimization problems:

The basic steps are:

1) Randomly select an initial point

2) Repeat the following procedure:

Determine the direction of descent:

Select Step Size

Update:

Until the termination condition is met

The specific process of the gradient descent method is as follows:

2. Optimization in function space

The above is the search for the optimization function in the specified function space, so can we find the optimal function directly in the function space? The idea of eradicating the above gradient descent method, for the loss function of the model, in order to

Second, boosting

1. Boosting of integration method

Boosting method is an important method in integrated learning, the two main methods in the integrated learning method are bagging and boosting, and in bagging, different training sample sets are obtained by the method of resampling the training samples. Training the learner on these new sets of training samples, eventually merging the results of each learner, as the final learning result, the specific process of the bagging method is as follows:

The most important algorithm in the bagging method is the random forest RF algorithm. As can be seen from the above diagram, in the bagging method, the B learners are independent of each other, such characteristics make the bagging method easier to parallel. And bagging is different, in the boosting algorithm, the learner is the existence of sequencing, at the same time, each kind of this is the right weight, the initial, each sample weight is equal, first of all, the 1th learner to the training sample to learn, when the study completed, increase the weight of the wrong sample, At the same time reduce the weight of the correct sample, and then use the 2nd learner to learn it, in order to go down, and finally get a B learner, and finally, the result of merging the B-learner, at the same time, unlike bagging, each learner's weight is not the same, boosting method of the specific process as shown:

The most important methods in the boosting method are: AdaBoost and GBDT.

GB, gradient boost, through M iterations, each iteration produces a regression tree model, we need to make each iteration of the model to the training set of the loss of the function of the minimum, and how to make the loss function less and less? We use the gradient descent method to move the loss function in the negative gradient direction of the loss function at each iteration so that we can get more and more accurate models.

Suppose the GBDT Model T has 4 regression tree compositions: t1,t2,t3,t4, sample label y (y1,y2,y3,...., yn)

The error function for this model is L, and for Squarederror, the error of the whole sample is deduced as follows:

For the first tree, it can be seen that the fitting is the Training sample label, and get T1 predicted residual, from the formula of the error function can be seen, the residual difference r2=r1-t2,r3=r2-t3,r4=r3-t4 ..., which can be derived from the back of the regression tree t2,t3, T4 is created in order to fit the previous residuals, it can be seen that the residuals continue to decrease until an acceptable threshold is reached.

For the gradient version, the current negative gradient value of the error function is used as the residuals left by the current model prediction, so a new regression tree is created to fit the residuals, and after the update, the residual of the overall GBDT model will be further reduced, and the continuous reduction of L is also brought.

The GBDT tree is divided into two types,

(1) Residual version

Residuals are actually the difference between real and predicted values, in the process of learning, first learn a regression tree, and then the "real value-predictive value" to get the residual, and then the residual as a learning target, learning the next tree, and so on, until the residual difference is less than a threshold of nearly 0 or the number of regression trees reached a threshold. The core idea is to reduce the loss function by fitting residuals each round.

In general, the first tree is normal, and then all the tree decisions are determined by the residuals.

(2) Gradient version

and residual version of the GBDT is said to be a residual iteration tree, that each regression tree in the study before the N-1 tree residual difference, gradient version of the GBDT is said to be a gradient iteration tree, using gradient descent method to solve, that each regression tree before learning N-1 tree gradient drop value. In general, the same thing is that both are iterative regression trees, which accumulate each tree result as the final result, each tree in the study before the N-1 tree remains insufficient, from the overall process and the input and output of the two are no difference;

Whether the gradient is used as the solution for each iteration of the two main stages. The former does not use gradient and residual-residual is the global optimal value, gradient is the local optimal direction * step, that is, each step of the former is trying to make the results best, the latter every step to try to make the results a little better.

Both advantages and disadvantages. It seems that the former is more scientific-there is absolutely the best direction not to learn, why Shejinqiuyuan learn a local optimal direction? The reason is flexibility. The biggest problem of the former is that because it relies on residuals, the loss function is generally fixed as the mean variance of the projection residuals, so it is difficult to deal with problems other than the pure regression problem. The latter solution is the gradient descent method, as long as the derivative loss function can be used.

Summary: GBDT is also called Mart, is an iterative decision tree algorithm, the algorithm is composed of a number of decision trees, all the tree's conclusion summed up to do the final answer, it was proposed at the beginning and SVM is considered to be a strong generalization ability algorithm.

The tree in GBDT is a regression tree (not a classification tree), and GBDT is used to make regression predictions, which can be used for classification.

Setting and significance of important parameters

Question: Why is it that the depth of the tree is very high when the xgboost and GBDT are in the parameter adjustment?

With XGBOOST/GBDT at the time of the tuning of the tree to the maximum depth of 6 has a very high precision, but with desion tree, randomforest, you need to adjust the depth of the tree to 15 or higher. The depth and desiontree of the tree required by randomforest, I can understand, because he is combined with disitiontree, the equivalent of doing many times decisiontree. But XGBOOST/GBDT only use the gradient rise method can achieve very high prediction accuracy, surprised me to suspect that he is black technology, how does the next XGBOOST/GBDT do? Does her node differ from the general desition?

A: Boosting mainly focus on reducing the deviation, because boosting can give a very weak generalization performance of the learning device to build a strong integration, bagging mainly focus on reducing the deviation, so it is not pruning decision trees, neural networks and other learning device effect more obvious.

Random forests and GBDT are all part of the Integrated learning category. There are two important strategies bagging and boosting under integrated learning.

For the bagging algorithm, because we will train many different classifiers in parallel, the goal is to reduce this variance, because, with the independent base classifier, the H value will naturally close, so for each classifier, the goal is how to reduce this deviation, So we're going to use very deep, even non-pruning decision trees.

For boosting, each step we will be on the basis of the previous round to more fit the original data, so we can guarantee the deviation, so for each base classifier, the problem is how to choose a smaller variance classifier, a simple classifier, so we have chosen a deep shallow decision tree.

Brief introduction of GBDT algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More