In the previous article, we talked about the first version of the GBDT algorithm, which is based on the learning idea of residual error. Today, the second version, it can be said that this version of the more complex, involving some derivation and matrix theory knowledge. However, as we can see today, the connection between the two versions is an important step in the learning algorithm.
this blog post mainly from the following aspects of the gradient-based GBDT algorithm:
(1) The basic steps of the algorithm;
(2) The mathematical derivation of the study;
(3) The link between the gradient-based version and the residual-based version;
Before explaining the detailed steps of the algorithm, we can first clarify one idea, that is, the gradient version of the GBDT is a multi-class classification Multi-Class classification ideas to achieve, or can say that the GBDT version of the integration into a multi-class classification can be better mastered.
1. Basic steps of the algorithm
First, a lot of online blog will appear the following diagram:
So that is to say the gradient version of the GBDT algorithm mainly has 8 steps.
(0) Initialize the estimates for all samples on K categories.
F_k (x) is a matrix that we can initialize to full 0, or randomly set. As follows:
In the above matrix, we assume that there are n=8 samples, each of which may fall into one of the k=5 categories, where the estimate matrix F is initialized to 0.
In addition, these 8 training samples are tagged with categories such as:
The description of the I=1 sample belongs to category 3rd.
(1) Cycle The following learning update process m times;
(2) A logistic transformation is done to estimate the function with no sample.
As we mentioned in the previous logistic regression, one of the important properties of the logistic function is that it can be converted to a probability value between 0~1, and we can convert the estimated value of the sample to the probability that the sample belongs to a class by the following transformation formula:
You can see that the initial sample of each category is estimated to be 0, the probability of belonging to the category is also equal, but with the subsequent update, its estimates are not the same, the probability is naturally different.
such as the following:
(3) The probability of traversing each category of all samples
This step needs to be noted that each category is traversed, not all samples. As follows:
Why is this a class-by-category study? Because you need to learn a regression tree for each category K later.
(4) to ask for a probability gradient of each sample on class K
In the previous step, we had the probability that many samples belonged to a category K, and whether they really belonged to the probability of category K (the samples are training samples, so whether they belong to a category is known, and the probability is 0/1). So this is a typical regression problem. We can of course use the algorithm of the regression tree (note, this is the multi-class classification problem and the key of GBDT connection).
We learn by the common establishment of the cost function, and the derivation of the gradient descent method. The cost function is a logarithmic likelihood function in the form of:
For the derivation of this cost function, we can get:
(The detailed derivation process is given in the next section)
Has it not been found that the derivative obtained here is actually the form of residuals: the first sample belongs to the K category residuals = True probability-the probability of estimation.
This step is also associated with the residuals version and the gradient version.
These gradients are also the learning direction below which we build the regression tree.
(5) Learning the regression tree of J leaf nodes along a gradient method.
To learn the pseudo-code:
We enter all sample XI, I = 1~n, and the residual of probability for each sample in the K category as the update direction, we learn the regression tree with J leaves. The basic process of learning is similar to the regression tree: Traversing the feature dimension of a sample, selecting a feature as the dividing point, the principle of minimum mean variance, or satisfying the "left subtree sample target value (residuals) and the square mean value + Right subtree sample target value (residuals) and squared mean-parent node all sample target values (residuals) and squared mean "the biggest criterion, once we learn to J leaf nodes, we stop learning." As a result, there are many samples on each leaf in the regression tree that are distributed above.
Remember: The samples on each leaf have their own estimated probabilities of category K, as well as the real probabilities, because they are needed for the subsequent gain.
(6) Seek the gain of each leaf node
The gain formula for each node is:
Note that each leaf node J has a gain value (not a vector, which is a value). The gradient of all samples on the leaf node is required for calculation.
In other words, each leaf node can calculate a gain value, remember the value Ah!
(7) Update the estimates for all samples under category K
The gain obtained in the previous step is based on the gradient calculation, and the gradient and residuals mentioned above have some correlation, and we can use this gain to update the estimated values of the samples.
The estimated value f of all samples under category K in the first iteration of the m-1 can be obtained by the estimates of these samples + the gain vectors in the last iteration. Note that this gain vector needs to sum the gain values of all J leaf nodes and multiply them by the vector 1.
As we said above, the estimate for all samples of Class K is a column:
That is, by the column update, preceded by the number of categories K cycle, so each class (each column) estimates can be updated. Be sure to remember to update by column, each class (each column) to build a regression tree to update, and finally the original K class n samples of the estimate matrix are updated again, with this new estimate matrix, we enter the next m+1 times of iterative learning.
So, after iterative learning m times, we can get the final estimate matrix of all the samples under all categories, based on this estimate matrix, we can implement multi-class classification.
In this way, all the detailed steps of the GBDT algorithm based on the gradient version are all done.
2. Formula derivation
The cost function established above is the form of a logarithmic likelihood function:
For the derivation of this cost function, we can get:
So what is the detailed derivation process?
Which involves the derivation of the logarithmic function, mainly the last step, Yi is the sample belongs to the true probability of Class K, so Yi is 0/1 number, and K categories may only belong to a category, that is, only one Yi is 1, the rest is 0, so there is a final step to deduce the result.
3. Links between two versions
Some of the links we mentioned earlier here are summarized here:
- Version four, based on residuals, takes the residuals as a global orientation and is biased towards regression applications. The gradient-based version is the gradient direction of the cost function as the direction of the update, the scope of application is wider.
- If the logistic function is used as a cost function, then its gradient form is similar to that of the residuals, which means that the two versions are closely linked, although the idea of implementation is different, but the overall purpose is the same. or the residual version is a special case of the gradient version, when the cost function is replaced by the rest of the function, the gradient version is still applicable.
Reference:
http://blog.csdn.net/w28971023/article/details/43704775
Http://www.cnblogs.com/LeftNotEasy/archive/2011/03/07/1976562.html
Http://blog.csdn.net/kunlong0909/article/details/17587101
http://blog.csdn.net/puqutogether/article/details/44752611
Understanding the GBDT Algorithm (III.)--gradient-based version