This article will be the last one based on the weight of the boosting after the discussion boosting another form of Gradient boosting, the weight-based method represents Adaboost, the weights in Adaboost as the sample is classified correctly and in the next iteration of the change, In the Gradient boosting, there is no concept of sample weight, but it is based on the residual difference between the output of the current model and the supervised value of the sample in each iteration, in order to reach the extremum point of the loss function in a step-by-step approximation. Each time the model is established, the gradient of the loss of the previous model is established in the direction of descent
Gradient boosting
Gradient boosting is also Additive model and forward partial algorithm, need to pay attention to the concept of no weight in Gradient boosting, that is, each base learner is equal probability, its model can be expressed as:
\[f_m (x) = \sum_m T (x; \theta_m) \]
Gradient boosting also uses a forward-distribution algorithm, first determining the initial model, defining the initial base learner $f _0 (x) = C $, the model that iterates to step m is:
\[f_m (x) = f_{m-1} (x) + T (x; \theta_m) \]
The value of the parameter $\theta_m$ is determined by minimizing the loss of:
\[arg \min_{\theta_m} \sum_il (Y_i,f_{m-1} (x_i) +t (x; \theta_m)) \]
For example, in the forward step of using the mean square loss $L (Y,f (x)) = (y-f (x)) ^2$, in the first $m $ step iteration, has been $f-{m-1} (x) $, the demand solution $\theta_m$, the sample $ (x_i,y_i) $ has:
\begin{aligned}
& \ \ \ L (y_i,f_{m-1} (x_i) + T (x; \theta_m)) \ \
&= [Y_i-f_{m-1} (X_i)-T (x; \theta_m)]^2\\
&= [R_i-t (x; \theta_m)]^2
\end{aligned}
Here $r _i = y_i–f_{m-1} (x_i) $ is called residuals residual, that is, the gap between the current model $f _{m-1} (x_i) $ and the supervised target value, so that the current base learner $T (x; \theta_m) $ fits this $residual $ Can make $f _{m} (x_i) $ closer to the $y _i$. This is the idea of fitting residuals , in fact, the residual as a label training model only. This constant iteration makes the residuals gradually decrease, and the reduction of residuals makes the model accuracy more and more high. So the goal of building a model is to make the loss function continue to decline, this problem is equivalent to the problem of solving the minimum value of loss function, the simplest method to solve the loss function minimum is Gradient descent? Recall Gradient descent, for a given $\theta$ loss function $L (\theta) $:
\[\theta ^{new} = \theta ^{old}-\frac{\partial L (\theta)} {\partial \theta}\]
Constant iteration will eventually find an optimal $\theta^*$, satisfying $\theta^* = Arg\min_{\theta} L (\theta) $, back to gradient boosting, first for the sample $ (x_i,y_i) $, give a loss function $L (Y,f (x)) $, the loss function represents the difference between the predicted value of the current model and the real value, of course, the smaller the better, the $L (Y_i,f (x_i)) $ as a function of $f (x_i) $, then the loss in the direction of the $f (x_i) $ gradient will be less and less. For the first $m $ iterations:
\[f_{m} (x_i) = f_{m-1} (x_i) –\frac{\partial L (Y_i,f (x_i))}{\partial f (x_i)}\]
It is obvious that $–\frac{\partial L (y_i,f (x_i))}{\partial f (x_i)}$ is the base learner $T (x; \theta_m) between two iterations, the approximate residuals that need to be fitted, and the resulting residuals make the model move further in the direction of the gradient descent , the Model $f _m (x) $, which minimizes the loss, can be obtained by iterative iteration.
Here's a Gradient boosting code:
Input: Training set $\left \{(x_i,y_i) \right \}^n_{i=1}$, loss function $L (y,f (x)) $.
1. Initialize the model with a constant $c $:
\[F_0 (x) = arg \min_c \sum_il (y_i,c) \]
2. $for $ $m = 1,..., m$ $do $:
Calculate approximate residuals:
\[R_{MI} = \left [\frac{\partial L (Y_i,f (x_i))}{\partial f (x_i)} \right]_f (x_i) = f_{m-1} (x) \ \ \, i=,..., N \]
Using approximate residuals to fit the base learner $T (x; \theta_m) $, the training set is $\left \{(x_i,r_mi) \right \}^n_{i=1}$
Calculate the weight of the base learner $\gamma_m$:
\[\gamma_m = arg \min_{\gamma}\sum_il (y_i,f_{m-1} (x_i) + T (x; \theta_m)) \]
Update model: $f _{m} (x) = f_{m-1} (x) + \gamma_m T (x; \theta_m) $
3. Output the final model $f_m (x) $.
It is necessary to pay attention to the problem of fitting, that is, the trade-off between Bias and Varance, if the Bias is reduced, the Variance may be too large to lose the generalization ability of the model, and Gradient boosting have two ways to avoid overfitting:
1) control m size, m too large although the Bias will be reduced, M can be selected by cross Validation Way
2) Shrinkage (shrink) method, Shrinkage is a parameter $v $ when updating the model during an iterative process, where there is $0<v<1$, namely:
\[F_{M} (x) = f_{m-1} (x) + v\gamma_m T (x; \theta_m) \]
In practice setting very small Shrinkage parameters $v $ such as $v < 0.1$, will produce very good pan-China performance than $v = 1$. But $v $ is set to a small number of iterations that are required $M $ larger.
Gradient boosting decision tree is actually the Gradient boosting method used by the base learner decision tree, commonly used is the cart, can be used for classification, but also for regression, here the cart-based GBDT algorithm:
Boosting's Gradient boosting