If the linear regression algorithm is like the Toyota Camry, then the gradient boost (GB) method is like the UH-60 Black Hawk helicopter. Xgboost algorithm as an implementation of GB is Kaggle machine learning competition victorious general. Unfortunately, many practitioners only use this algorithm as a black box (including the one I used to be). The purpose of this article is to introduce the principle of classical gradient lifting method intuitively and comprehensively.
Principle explanation
Let's start with a simple example. We want to predict a person's age based on whether they play video games, whether they enjoy gardening, or whether they like to wear a hat three characters. Our objective function is to minimize the sum of squares and the training set that will be used to train our model is as follows:
Id |
Age |
Like gardening |
Play video games |
Like to wear hats |
1 |
13 |
False |
True |
True |
2 |
14 |
False |
True |
False |
3 |
15 |
False |
True |
False |
4 |
25 |
True |
True |
True |
5 |
35 |
False |
True |
True |
6 |
49 |
True |
False |
False |
7 |
68 |
True |
True |
True |
8 |
71 |
True |
False |
False |
9 |
73 |
True |
False |
True |
We may have the following intuition about the data:
- People who like gardening may be older.
- People who like video games may be younger.
- Wearing a hat may not be very old.
For these instincts, we can quickly test the data.
Characteristics |
False |
True |
Like gardening |
"13,14,15,35" |
"25,49,68,71,73" |
Play video games |
"49,71,73" |
"13,14,15,25,35,68" |
Like to wear hats |
"14,15,49,71" |
"13,25,35,68,73" |
Now we're trying to model the data with a regression tree. At the outset, we asked for at least three data points for each leaf node. In this way, the regression tree begins to split (split) from the characteristic of liking gardening, discovering that the condition is satisfied and the split is over. Results such as:
The results were good, but we didn't use the feature of playing video games. Now we're trying to change the constraint to a leaf node with two data points. The results are as follows:
We used three of them in this tree, but we liked hats. This feature is not age-dependent, which means that our regression tree may be over-fitted.
In this example we can see the disadvantage of using a single decision/regression tree (decision/regression trees): It cannot overlay two valid features with overlapping areas (for example: like gardening and playing video games). Suppose we measure the training error of the first tree, the result would be as follows:
Id |
Age |
Tree1 Forecast Results |
Tree1 predictive residuals (residual) |
1 |
13 |
19.25 |
-6.25 |
2 |
14 |
19.25 |
-5.25 |
3 |
15 |
19.25 |
-4.25 |
4 |
25 |
57.2 |
-32.2 |
5 |
35 |
19.25 |
15.75 |
6 |
49 |
57.2 |
-8.2 |
7 |
68 |
57.2 |
10.8 |
8 |
71 |
57.2 |
13.8 |
9 |
73 |
57.2 |
15.8 |
Now we can use another tree Tree2 to fit the predicted residuals of the Tree1, and the results are as follows:
Now we can notice that this regression tree does not include a feature like a hat (the regression tree that was previously fitted uses this feature). This is because the regression tree can examine the characteristics of the entire sample, whereas a single regression tree that was previously fitted can only be inspected locally.
Now we can add the second "error correction" regression tree based on the first regression tree, and the results are as follows:
PersonID |
Age |
Tree1 Prediction |
Tree1 Residual |
Tree2 Prediction |
Combined Prediction |
Final Residual |
1 |
13 |
19.25 |
-6.25 |
-3.567 |
15.68 |
2.683 |
2 |
14 |
19.25 |
-5.25 |
-3.567 |
15.68 |
1.683 |
3 |
15 |
19.25 |
-4.25 |
-3.567 |
15.68 |
0.6833 |
4 |
25 |
57.2 |
-32.2 |
-3.567 |
53.63 |
28.63 |
5 |
35 |
19.25 |
15.75 |
-3.567 |
15.68 |
-19.32 |
6 |
49 |
57.2 |
-8.2 |
7.133 |
64.33 |
15.33 |
7 |
68 |
57.2 |
10.8 |
-3.567 |
53.63 |
-14.37 |
8 |
71 |
57.2 |
13.8 |
7.133 |
64.33 |
-6.667 |
9 |
73 |
57.2 |
15.8 |
7.133 |
64.33 |
-8.667 |
Tree1 SSE |
Combined SSE |
1994 |
1765
|
Gradient Boost Scheme 1
Inspired by the "error correction" tree above, we can define the following gradient lifting methods:
- Use a model to fit the data F1 (x) = y
- Use another model to fit the residual of the previous model prediction H1 (x) = Y-F1 (x)
- Creating a new model with residual model and original model F2 (x) = F1 (x) + h1 (x)
We can easily think of inserting more models to correct previous model errors (ResNet can be seen as an example):
FM (x) = FM-1 (x) + hM-1 (x), F1 (x) is the initial model
Because our first step is to initialize the model F1 (x), our next task is to fit the residuals: HM (x) = Y-FM (x).
Now we stop to observe, we just say HM is a "model"--not that it must be a tree-based model. This is one of the advantages of gradient ascension, where we can easily introduce any model, that is to say, the gradient boost is only used to iterate the weak model. Although theoretically our weak model can be any model, but in practice it is almost always tree-based, so we now take HM as a regression tree there is no problem.
Gradient Boost Scheme 2
Let's now try to initialize like most gradient-boosting implementations-initialize the model to only output a single prediction. Because our task is currently minimizing squared and error, we can let initialize F0 as the mean of the predictive training sample:
Now we can recursively define our subsequent models:
, for
Where HM is one of the underlying models, such as a regression tree.
At this point you might consider a question: How to select a hyper-parameter m. In other words, how many times should we use this residual correction process. One approach is that the best m can be determined by using cross-validation (cross-validation) to test different m.
Gradient Boost Scheme 3
So far our goal has been to minimize variance and (L2), but what if we want to minimize the absolute error and (L1)? It is easy to think of this by altering the objective function of our underlying model (the regression tree), but there are several drawbacks to doing so:
- Calculations can be expensive when data is large (we need to traverse to find the median each time we split)
- The above-mentioned GB does not limit the nature of the underlying model will disappear, so we will only use the underlying module that supports this objective function
Let's try to find a more beautiful solution. Looking back at the example above, we'll let F0 predict the median of the training sample, which is 35, to minimize the absolute value and. Now we can calculate the predicted residuals for the F0:
PersonID |
| Age
F0 |
Residual0 |
1 |
13 |
35 |
-22 |
2 |
14 |
35 |
-21 |
3 |
15 |
35 |
-20 |
4 |
25 |
35 |
-10 |
5 |
35 |
35 |
0 |
6 |
49 |
35 |
14 |
7 |
68 |
35 |
33 |
8 |
71 |
35 |
36 |
9 |
73 |
35 |
38 |
The predicted residuals for the first and fourth training samples were observed-22 and-10. Let's say we can make each prediction close to the actual value of 1 units, so that the squared and error of sample 1 and sample 4 will be reduced by 43 and 19 respectively, and the absolute error of both will be reduced by 1. From the above calculation we can find that the regression tree using squared errors will focus primarily on reducing the predictive residuals for the first training sample, while the regression tree using absolute error will be equally concerned with these two samples. Now we're going to train H0 to fit the predictive residuals of the F0, and unlike before, we'll use the derivative of the F0 predictive loss function to train H0. When using absolute error, HM only considers the symbol of the FM predictive residuals (squared error also takes into account the size of the residuals). After the samples in H are divided into individual leaf nodes, the average gradient of each leaf node can be computed and weighted to update the model: (so the loss of each leaf node will be reduced, the actual use of the leaf node weight may be different).
Gradient descent (Gradient descent)
Now let's use GD to make these ideas more formal. Consider the following possible loss function:
The goal is to find a pair to minimize L. We can note that this loss function can be seen as calculating the mean variance of two numbers of points, and the true values of two numbers are 15 and 25, respectively. Although we can parse to find the minimum value of this loss function, GD can help us to optimize the more complex loss function (perhaps unable to find the analytic solution).
Initialization
Total iteration Algebra M = 100
Start prediction
Step
For Iteration to:
1. Calculate the gradient at the last forecast point
2. Move the forecast point to the steepest gradient direction, i.e.
If the step size is small enough and M is large enough, the final prediction of SM will converge to the minimum value of L.
Using gradient descent
Now let's use GD in the gradient lift model. Our objective function is recorded as L, and the starting model is recorded as F0 (x). The gradient of L to F0 (x) is computed at the iteration number m = 1 o'clock. Then we use a weak model to fit the gradient, using the regression tree as an example, the sample with similar characteristics in the leaf node will calculate the average gradient, then use the average gradient to update the model and get F1. Repeat this process until we get FM.
Tidy up the gradient lifting algorithm using GD, described as follows:
Initialization
for m = 1 to M:
Calculate pseudo residuals:
Fitting pseudo residuals
Calculation step size (the decision tree can be individually assigned step size for each leaf node)
Update model
To help you verify that you understand the gradient lift algorithm, the following are the results of using the L1 and L2 objective functions for example problems.
L2 loss function
| Age
F0 |
Pseudo Residual0 |
H0 |
gamma0 |
F1 |
Pseudo Residual1 |
H1 |
GAMMA1 |
F2 |
13 |
40.33 |
-27.33 |
-21.08 |
1 |
19.25 |
-6.25 |
-3.567 |
1 |
15.68 |
14 |
40.33 |
-26.33 |
-21.08 |
1 |
19.25 |
-5.25 |
-3.567 |
1 |
15.68 |
15 |
40.33 |
-25.33 |
-21.08 |
1 |
19.25 |
-4.25 |
-3.567 |
1 |
15.68 |
25 |
40.33 |
-15.33 |
16.87 |
1 |
57.2 |
-32.2 |
-3.567 |
1 |
53.63 |
35 |
40.33 |
-5.333 |
-21.08 |
1 |
19.25 |
15.75 |
-3.567 |
1 |
15.68 |
49 |
40.33 |
8.667 |
16.87 |
1 |
57.2 |
-8.2 |
7.133 |
1 |
64.33 |
68 |
40.33 |
27.67 |
16.87 |
1 |
57.2 |
10.8 |
-3.567 |
1 |
53.63 |
71 |
40.33 |
30.67 |
16.87 |
1 |
57.2 |
13.8 |
7.133 |
1 |
64.33 |
73 |
40.33 |
32.67 |
16.87 |
1 |
57.2 |
15.8 |
7.133 |
1 |
64.33 |
L1 loss function
| Age
F0 |
Pseudo Residual0 |
H0 |
gamma0 |
F1 |
Pseudo Residual1 |
H1 |
GAMMA1 |
F2 |
13 |
35 |
-1 |
-1 |
20.5 |
14.5 |
-1 |
-0.3333 |
0.75 |
14.25 |
14 |
35 |
-1 |
-1 |
20.5 |
14.5 |
-1 |
-0.3333 |
0.75 |
14.25 |
15 |
35 |
-1 |
-1 |
20.5 |
14.5 |
1 |
-0.3333 |
0.75 |
14.25 |
25 |
35 |
-1 |
0.6 |
55 |
68 |
-1 |
-0.3333 |
0.75 |
67.75 |
35 |
35 |
-1 |
-1 |
20.5 |
14.5 |
1 |
-0.3333 |
0.75 |
14.25 |
49 |
35 |
1 |
0.6 |
55 |
68 |
-1 |
0.3333 |
9 |
71 |
68 |
35 |
1 |
0.6 |
55 |
68 |
-1 |
-0.3333 |
0.75 |
67.75 |
71 |
35 |
1 |
0.6 |
55 |
68 |
1 |
0.3333 |
9 |
71 |
73 |
35 |
1 |
0.6 |
55 |
68 |
1 |
0.3333 |
9 |
71 |
Gradient boost Scheme 4
Gradually reduce the step/learning rate (shrinkage) to help stabilize convergence.
Gradient Boost Scheme 5
Line sampling and column sampling. The different sampling techniques are effective because different sampling back causes different tree forks-which means more information.
Gradient boost in combat
The gradient lifting algorithm is very effective in combat. One of the most popular implementations Xgboost in Kaggle's competition. Xgboost uses a number of tricks to speed up and improve accuracy (especially with two-step descent). LIGTGBM from Microsoft has also attracted a lot of attention recently.
What else can the gradient boost algorithm do? In addition to regression (regression), classification and ranking can also be used-as long as the loss function is micro-can. In the classification application, the two-yuan classification commonly used logistic function as the loss function, the multivariate classification uses the Softmax as the loss function.
Original link: http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Kaggle Master Interpretation Gradient enhancement (Gradient boosting) (translated)