One: The purpose of GBDT algorithm machine learning
GBDT algorithm is a supervised learning algorithm. The supervised learning algorithm needs to address the following two questions:
1. The loss function is as small as possible, so that the objective function can conform to the sample
2. The regularization function punishes the result of the training and avoids overfitting so that it can be accurate at the time of prediction.
The GBDT algorithm needs to eventually learn that the loss function is as small as possible and effectively prevents overfitting.
As an example of how a sample changes over time to a particular event, the following figures illustrate the role of machine learning.
Let's say the following is a sample of the K topic as time changes:
If there is no valid regularization, the learning result will be as follows:
In this case, the learning result is very consistent with the sample, the loss function is very small, but the sample is predicted, the failure rate will be high due to overfitting.
If the loss function is too large, then the learning results are shown in the following figure:
In this case, the learning results are too different from the sample, the loss function is also very large, in the forecast due to the error jump, the failure rate will be very high.
The learning results of the loss function and regularization to prevent over-fitting balance are shown in the following figure:
In this case, the loss function and the regularization function prevent the overfitting from reaching a balance, and the predictions will be more accurate.
GBDT algorithm Training result is a decision forest. The GBDT algorithm iterates n times during training, the forest contains n trees, each tree contains several leaves, each of which corresponds to a specific score. The end result of GBDT decision-making forest learning is
1. Each leaf corresponds to the score
2. Structure of each decision tree
As an example of whether you like a game based on a sample to create a decision forest, as shown in the following figure, 5 samples,
Assuming 2 iterations, the results of the study include the following 2 trees
Whether to like the score of a game, for the first sample boy, the first tree scored 2 points, the second tree score is 0.9 points, its total score is 2.9 points; third sample grandpa, the first tree score is-1, the second tree score is 0.9, get its score is-0.1 points.
For the above example, the ultimate goal of machine learning is to learn the function F1 of the first tree above, to know
F1 (Boy) =2
The function of the second tree F2, to know
F2 (boy) =0.9
Also learn about the first tree, why the age of the feature is the first split element. Why did age divide when he was 15 years old?
Two: The principle of GBDT algorithm
Assuming there is a K tree, the score for sample I is:
n samples, the target function under the K tree is:
The iterative process of the GBDT algorithm can be represented by the following diagram:
T-round iterations, what we need to determine is
The target function for the T-round iteration is:
The variable of the target function is
By optimizing the target function of the T-round iteration, we determine
The following mathematical derivation process is the process of optimizing the objective function.
The objective function of optimizing the T-wheel iteration uses the Taylor expansion:
The objective function expands with the Taylor expansion result as follows:
which
Represents the 1 derivative of the loss function, representing the 2 derivative of the loss function
In the T tree, there is a mapping function that maps a sample to a leaf node, which is called:
To illustrate the effect of this method, for the following sample tree:
As shown on the icon red, the little boy mapped to the first leaf after this method mapping; After this method maps, the grandmother maps to a third leaf
For the little boy in the picture above, =W1
Here the regularization function is defined as:
where t indicates that the tree contains a T-leaf, and for the example tree above, its regularization penalty function is:
The sample set corresponding to the J leaves is expressed in the following formula:
Since all the samples are mapped to a leaf, the objective function can be converted from the sample summation to the sum of the leaves:
The objective function is transformed into a sum of two quadratic equations.
We add the following definition:
The above objective function (summation of the unary two-order equation) obtains the minimum value for the following position:
The minimum value is:
you can see that the value of the target function is a T-leaf and
For the example tree above, the corresponding target function results, as shown in the following figure:
In the image above, the third leaf node is marked red. You can see that the third leaf node contains a total of 3 samples, 2nd, 3rd, and 5th. The gradient statistics of 3 samples is used to calculate the score of the third leaf.
When a decision tree is created, the information gain after the split of a leaf node is:
the optimal splitting point is the maximum position of the information gain (Gain).
In order to find the maximum gain position of a feature, the sample is sorted based on the value of the feature, then all the samples are scanned sequentially, the gain of each split point is computed, and then the maximum gain position is taken as the split point. As shown in the following illustration:
Above, the age of this feature is sorted, then from small to large to scan the sample, calculate the gain of each split point, and finally take the maximum gain position as the age of the feature point of division.
But there are multiple feature when building a decision tree, which feature first splits.
The answer is that we need to traverse all the feature, find the maximum split point of each feature's gain, and calculate the gain for each feature split point, then take the feature with the greatest gain of all the feature split points as the first Division point. This process is illustrated as follows:
Finally, using the greedy method, repeat the process above and build a complete decision tree.
from the above splitting process it is known that the purpose of each division is to obtain more information gain, and if the information gain after splitting is negative, stop splitting