Kaggle Master Interpretation Gradient enhancement (Gradient boosting) (translated)

Last Update:2018-10-21 Source: Internet

Author: User

Tags xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If the linear regression algorithm is like the Toyota Camry, then the gradient boost (GB) method is like the UH-60 Black Hawk helicopter. Xgboost algorithm as an implementation of GB is Kaggle machine learning competition victorious general. Unfortunately, many practitioners only use this algorithm as a black box (including the one I used to be). The purpose of this article is to introduce the principle of classical gradient lifting method intuitively and comprehensively.

Principle explanation

Let's start with a simple example. We want to predict a person's age based on whether they play video games, whether they enjoy gardening, or whether they like to wear a hat three characters. Our objective function is to minimize the sum of squares and the training set that will be used to train our model is as follows:

Id	Age	Like gardening	Play video games	Like to wear hats
1	13	False	True	True
2	14	False	True	False
3	15	False	True	False
4	25	True	True	True
5	35	False	True	True
6	49	True	False	False
7	68	True	True	True
8	71	True	False	False
9	73	True	False	True

We may have the following intuition about the data:

People who like gardening may be older.
People who like video games may be younger.
Wearing a hat may not be very old.

For these instincts, we can quickly test the data.

Characteristics	False	True
Like gardening	"13,14,15,35"	"25,49,68,71,73"
Play video games	"49,71,73"	"13,14,15,25,35,68"
Like to wear hats	"14,15,49,71"	"13,25,35,68,73"

Now we're trying to model the data with a regression tree. At the outset, we asked for at least three data points for each leaf node. In this way, the regression tree begins to split (split) from the characteristic of liking gardening, discovering that the condition is satisfied and the split is over. Results such as:

The results were good, but we didn't use the feature of playing video games. Now we're trying to change the constraint to a leaf node with two data points. The results are as follows:

We used three of them in this tree, but we liked hats. This feature is not age-dependent, which means that our regression tree may be over-fitted.

In this example we can see the disadvantage of using a single decision/regression tree (decision/regression trees): It cannot overlay two valid features with overlapping areas (for example: like gardening and playing video games). Suppose we measure the training error of the first tree, the result would be as follows:

Id	Age	Tree1 Forecast Results	Tree1 predictive residuals (residual)
1	13	19.25	-6.25
2	14	19.25	-5.25
3	15	19.25	-4.25
4	25	57.2	-32.2
5	35	19.25	15.75
6	49	57.2	-8.2
7	68	57.2	10.8
8	71	57.2	13.8
9	73	57.2	15.8

Now we can use another tree Tree2 to fit the predicted residuals of the Tree1, and the results are as follows:

Now we can notice that this regression tree does not include a feature like a hat (the regression tree that was previously fitted uses this feature). This is because the regression tree can examine the characteristics of the entire sample, whereas a single regression tree that was previously fitted can only be inspected locally.

Now we can add the second "error correction" regression tree based on the first regression tree, and the results are as follows:

PersonID	Age	Tree1 Prediction	Tree1 Residual	Tree2 Prediction	Combined Prediction	Final Residual
1	13	19.25	-6.25	-3.567	15.68	2.683
2	14	19.25	-5.25	-3.567	15.68	1.683
3	15	19.25	-4.25	-3.567	15.68	0.6833
4	25	57.2	-32.2	-3.567	53.63	28.63
5	35	19.25	15.75	-3.567	15.68	-19.32
6	49	57.2	-8.2	7.133	64.33	15.33
7	68	57.2	10.8	-3.567	53.63	-14.37
8	71	57.2	13.8	7.133	64.33	-6.667
9	73	57.2	15.8	7.133	64.33	-8.667

Tree1 SSE	Combined SSE
1994	1765

Gradient Boost Scheme 1

Inspired by the "error correction" tree above, we can define the following gradient lifting methods:

Use a model to fit the data F1 (x) = y
Use another model to fit the residual of the previous model prediction H1 (x) = Y-F1 (x)
Creating a new model with residual model and original model F2 (x) = F1 (x) + h1 (x)

We can easily think of inserting more models to correct previous model errors (ResNet can be seen as an example):

FM (x) = FM-1 (x) + hM-1 (x), F1 (x) is the initial model

Because our first step is to initialize the model F1 (x), our next task is to fit the residuals: HM (x) = Y-FM (x).

Now we stop to observe, we just say HM is a "model"--not that it must be a tree-based model. This is one of the advantages of gradient ascension, where we can easily introduce any model, that is to say, the gradient boost is only used to iterate the weak model. Although theoretically our weak model can be any model, but in practice it is almost always tree-based, so we now take HM as a regression tree there is no problem.

Gradient Boost Scheme 2

Let's now try to initialize like most gradient-boosting implementations-initialize the model to only output a single prediction. Because our task is currently minimizing squared and error, we can let initialize F0 as the mean of the predictive training sample:

Now we can recursively define our subsequent models:

, for

Where HM is one of the underlying models, such as a regression tree.

At this point you might consider a question: How to select a hyper-parameter m. In other words, how many times should we use this residual correction process. One approach is that the best m can be determined by using cross-validation (cross-validation) to test different m.

Gradient Boost Scheme 3

So far our goal has been to minimize variance and (L2), but what if we want to minimize the absolute error and (L1)? It is easy to think of this by altering the objective function of our underlying model (the regression tree), but there are several drawbacks to doing so:

Calculations can be expensive when data is large (we need to traverse to find the median each time we split)
The above-mentioned GB does not limit the nature of the underlying model will disappear, so we will only use the underlying module that supports this objective function

Let's try to find a more beautiful solution. Looking back at the example above, we'll let F0 predict the median of the training sample, which is 35, to minimize the absolute value and. Now we can calculate the predicted residuals for the F0:

Age

PersonID		F0	Residual0
1	13	35	-22
2	14	35	-21
3	15	35	-20
4	25	35	-10
5	35	35	0
6	49	35	14
7	68	35	33
8	71	35	36
9	73	35	38

The predicted residuals for the first and fourth training samples were observed-22 and-10. Let's say we can make each prediction close to the actual value of 1 units, so that the squared and error of sample 1 and sample 4 will be reduced by 43 and 19 respectively, and the absolute error of both will be reduced by 1. From the above calculation we can find that the regression tree using squared errors will focus primarily on reducing the predictive residuals for the first training sample, while the regression tree using absolute error will be equally concerned with these two samples. Now we're going to train H0 to fit the predictive residuals of the F0, and unlike before, we'll use the derivative of the F0 predictive loss function to train H0. When using absolute error, HM only considers the symbol of the FM predictive residuals (squared error also takes into account the size of the residuals). After the samples in H are divided into individual leaf nodes, the average gradient of each leaf node can be computed and weighted to update the model: (so the loss of each leaf node will be reduced, the actual use of the leaf node weight may be different).

Gradient descent (Gradient descent)

Now let's use GD to make these ideas more formal. Consider the following possible loss function:

The goal is to find a pair to minimize L. We can note that this loss function can be seen as calculating the mean variance of two numbers of points, and the true values of two numbers are 15 and 25, respectively. Although we can parse to find the minimum value of this loss function, GD can help us to optimize the more complex loss function (perhaps unable to find the analytic solution).

Initialization

Total iteration Algebra M = 100

Start prediction

Step

For Iteration to:
1. Calculate the gradient at the last forecast point
2. Move the forecast point to the steepest gradient direction, i.e.

If the step size is small enough and M is large enough, the final prediction of SM will converge to the minimum value of L.

Using gradient descent

Now let's use GD in the gradient lift model. Our objective function is recorded as L, and the starting model is recorded as F0 (x). The gradient of L to F0 (x) is computed at the iteration number m = 1 o'clock. Then we use a weak model to fit the gradient, using the regression tree as an example, the sample with similar characteristics in the leaf node will calculate the average gradient, then use the average gradient to update the model and get F1. Repeat this process until we get FM.

Tidy up the gradient lifting algorithm using GD, described as follows:

Initialization

for m = 1 to M:

Calculate pseudo residuals:

Fitting pseudo residuals

Calculation step size (the decision tree can be individually assigned step size for each leaf node)

Update model

To help you verify that you understand the gradient lift algorithm, the following are the results of using the L1 and L2 objective functions for example problems.

L2 loss function

Age

	F0	Pseudo Residual0	H0	gamma0	F1	Pseudo Residual1	H1	GAMMA1	F2
13	40.33	-27.33	-21.08	1	19.25	-6.25	-3.567	1	15.68
14	40.33	-26.33	-21.08	1	19.25	-5.25	-3.567	1	15.68
15	40.33	-25.33	-21.08	1	19.25	-4.25	-3.567	1	15.68
25	40.33	-15.33	16.87	1	57.2	-32.2	-3.567	1	53.63
35	40.33	-5.333	-21.08	1	19.25	15.75	-3.567	1	15.68
49	40.33	8.667	16.87	1	57.2	-8.2	7.133	1	64.33
68	40.33	27.67	16.87	1	57.2	10.8	-3.567	1	53.63
71	40.33	30.67	16.87	1	57.2	13.8	7.133	1	64.33
73	40.33	32.67	16.87	1	57.2	15.8	7.133	1	64.33

L1 loss function

Age

	F0	Pseudo Residual0	H0	gamma0	F1	Pseudo Residual1	H1	GAMMA1	F2
13	35	-1	-1	20.5	14.5	-1	-0.3333	0.75	14.25
14	35	-1	-1	20.5	14.5	-1	-0.3333	0.75	14.25
15	35	-1	-1	20.5	14.5	1	-0.3333	0.75	14.25
25	35	-1	0.6	55	68	-1	-0.3333	0.75	67.75
35	35	-1	-1	20.5	14.5	1	-0.3333	0.75	14.25
49	35	1	0.6	55	68	-1	0.3333	9	71
68	35	1	0.6	55	68	-1	-0.3333	0.75	67.75
71	35	1	0.6	55	68	1	0.3333	9	71
73	35	1	0.6	55	68	1	0.3333	9	71

Gradient boost Scheme 4

Gradually reduce the step/learning rate (shrinkage) to help stabilize convergence.

Gradient Boost Scheme 5

Line sampling and column sampling. The different sampling techniques are effective because different sampling back causes different tree forks-which means more information.

Gradient boost in combat

The gradient lifting algorithm is very effective in combat. One of the most popular implementations Xgboost in Kaggle's competition. Xgboost uses a number of tricks to speed up and improve accuracy (especially with two-step descent). LIGTGBM from Microsoft has also attracted a lot of attention recently.

What else can the gradient boost algorithm do? In addition to regression (regression), classification and ranking can also be used-as long as the loss function is micro-can. In the classification application, the two-yuan classification commonly used logistic function as the loss function, the multivariate classification uses the Softmax as the loss function.

Original link: http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

Kaggle Master Interpretation Gradient enhancement (Gradient boosting) (translated)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kaggle Master Interpretation Gradient enhancement (Gradient boosting) (translated)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PersonID		F0	Residual0
1	13	35	-22
2	14	35	-21
3	15	35	-20
4	25	35	-10
5	35	35	0
6	49	35	14
7	68	35	33
8	71	35	36
9	73	35	38

PersonID		F0	Residual0
1	13	35	-22
2	14	35	-21
3	15	35	-20
4	25	35	-10
5	35	35	0
6	49	35	14
7	68	35	33
8	71	35	36
9	73	35	38

Kaggle Master Interpretation Gradient enhancement (Gradient boosting) (translated)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PersonID		F0	Residual0
1	13	35	-22
2	14	35	-21
3	15	35	-20
4	25	35	-10
5	35	35	0
6	49	35	14
7	68	35	33
8	71	35	36
9	73	35	38