Xgboost Source Reading notes (2)--tree construction of exact greedy algorithm

Source: Internet
Author: User
Tags data structures xgboost

In the previous article, "Xgboost source reading notes (1)--code logic structure" to introduce the logical structure of Xgboost source code, but also briefly introduced the basic situation of xgboost. This article will continue to introduce the Xgboost source code is how to construct a regression tree, but before the analysis of the source, it is necessary to first and everyone together to derive the xgboost of the objective function. This derivation process formula screenshot is mainly excerpt from Chen Tianchi's paper "Xgboost:a Scalable Tree boosting System". In the subsequent source code analysis, will omit some unrelated to this article, such as parallelization, multithreading.

I. optimization of objective function

One of the differences between Xgboost and the previous GBT (Gradient Boost Tree) is that the objective function is Taylor expanded, and the second derivative is used to speed up the convergence of the model during the model training. At the same time, in order to prevent the model from overfitting, the object function is added to the penalty of controlling the model structure.

Figure 1-1 Objective function

The objective function is composed of two main parts. The first part is the prediction error of the model, and the second part represents the model structure.

The larger the number of leaves in the tree, the greater the weight of the tree, and the larger the target function when the prediction error of the model increases. Our optimization goal is to make the objective function as small as possible, so as to reduce the prediction error, but also reduce the number of tree leaves and reduce the weight of the leaf. This is also in line with the "Ames Razor" principle in machine learning, which is to choose the simplest hypothesis that is consistent with empirical observation.

The objective function of Figure 1-1 is rewritten in the following form because of the existence of a model penalty with function parameters that causes it to not be optimized in the traditional way


Figure 1-2 Changing the form of the objective function

The difference between figure 1-2 and figure 1-1 is that figure 1-1 is to optimize the function through the whole model, and the optimization goal of Figure 1-2 is to construct a weak classifier that achieves the minimum value of the objective function during each iteration, and from this process we can see that the greedy algorithm is used in Figure 1-2. The prediction error entry in Figure 1-2 is Taylor expanded:

Figure 1-3 Taylor expansion

and omitting the constant term


Figure 1-4 Omitting constant entries

Figure 1-4 is a simplified objective function during each iteration. Our goal is to obtain an optimal weak classifier that achieves the minimum value of the objective function during the T-iteration. Here the additive n is the number of sample instances, in order to make the coding more convenient, define a new variable representing all sample instances of the leaf J


Figure 1-5 a new variable

At the same time, the model penalty of the target function is expanded, and the leaf latitude can be changed to write


Fig. 1-6 objective function with leaves as latitude

Here the function f is to classify the corresponding instance under the corresponding leaf, and return the weight of the instance under the current leaf W. Figure 1-6 The leaf weight w derivative, then the optimal leaf weight w


Fig. 1-7 the optimal leaf weight

At the same time, the weights are put into the objective function and the constants are omitted, then the analytic formula of the objective function is obtained.


Figure 1-8 Analytic formula of the objective function

Our goal is to minimize the analytic formula of the objective function. The analytic formula of the objective function can be depicted by a clear image of Figure 1-9.


Figure 1-9 Analytic calculation process of objective function

From Figure 1-9, we can clearly see the computational process of the analytic formula of the objective function. The results of the objective function can be used to evaluate the quality of the model. Thus, in the model training process, whether the current leaf node needs to continue splitting is mainly to see the split gain loss loss_change.


Figure 1-10 Split gain

The calculation formula for gain loss Loss_change is shown in Figure 1-10, which is the gain of the left child after splitting the node plus the right child gain minus the parent node. This is the point at which the splitting point is chosen to select the maximum gain loss. Finding the best splitting point is a very time-consuming process, the previous "Xgboost source reading note (1)-code logic Structure" introduces several xgboost used by the split algorithm, here to choose the simplest of exact greedy algorithm to explain:


Figure 1-11 Exact greedy algorithm

Figure 1-11 The main idea of the algorithm is to traverse each feature, select each value under the feature as its split point in each feature, and calculate the gain loss. When all features are traversed, the eigenvalues with the greatest gain loss will be the split point. It can be seen that this is an exhaustive algorithm, and the most time-consuming process of the whole tree construction process is the process of finding the optimal splitting point. But because the algorithm is simple and easy to understand, the algorithm is introduced to introduce the realization process of xgboost source tree construction.

It doesn't matter if the derivation process is difficult to read, and the main thing to remember here is the formula for the gain and weight of each node. The gain is used to determine whether the current node needs to continue splitting, and the linear combination of the node weights is the final output value of the model. So just remember that these two formulas will not affect the source of reading.

Second, source code Analysis

1) Code logical Structure review

At the end of the previous article said the source code final call process is as follows:

gbtree.cc
|--gbtree::D oboost ()
  |--gbtree::boostnewtrees ()
    |--gbtree::initupdater ()
    |-- Treeupdater::update ()

Here the simplified source code is as follows:

gbtree.cc line:452
boostnewtrees (const std::vector<bst_gpair> &gpair,
               dmatrix *p_fmat,
               int Bst_group,
               std::vector<std::unique_ptr<regtree> >* ret) {
  this->initupdater ();
  Std::vector<regtree*> new_tress;
  for (auto& up:updaters) {
    up->update (Gpair,p_fmat, new_trees);
  }
}

Gpair is a vector of vectors that holds the first derivative and second derivative of the corresponding sample instance. P_fmat is a pointer to a feature that corresponds to a sample instance, and New_trees is used to store a well-constructed regression tree.

Initupdater () is to initialize the Updaters, in the previous article also said Updaters is abstract class class Treeupdater pointer object, defines the basic init and update interface, The abstract derived class defines a series of tree construction and pruning methods. Here is the main introduction to its derived class class Colmaker, which uses even the exact greedy algorithm that we described earlier

2) Class colmaker Data Structure Introduction

In Classcolmaker, some data structures are defined for the construction of the auxiliary tree.

updater_colmaker.cc line:755
const trainparam& param;//training parameters, that is, some of the hyper-parameters we set up
std::vector<int> Position;  The index of the corresponding node in the regression tree node of the current sample instance is
std::vector<nodeentry> snode;//The node
std::vector<int> qexpand_  in the regression tree; Save the index of the nodes that will potentially be categorized

Xgboost's tree structure is similar to BFS (breadth first Search), which is a layer of tectonic tree nodes. So a queue qexpand_ is required to hold the nodes of the current layer, and these nodes will decide whether or not to split into the next layer of nodes based on the gain loss loss_change.

3) Class colmaker tree Construction Source

updater_colmaker.cc line:29
void Colmaker::update (...)
{for
  (size_t i = 0; i < trees.size (); + +) {
    builder builder (param);
    Builder. Update (Gpair, Dmat, Trees[i]);
  }
}

A class builder is defined in class Colmaker, and all of the construction processes are done by this class.

updater_colmaker.cc line:89
void Colmaker::builder::update (...)
{This
  , initdata (...);    Initialize the builder parameter
  //Initialize the tree root node weights and gain this
  , Initnewnode (Gpair, *p_fmat,*p_tree);
  for (int depth = 0; depth < param.max_depth; ++depth)
  {
    //To find the split feature in the queue as a layer node, construct the next layer of the tree
    this->findsplit ( Depth, Qexpand_, Gpair, P_fmat, p_tree);
    The sample instances in each non-leaf node of the layer are classified into the nodes of the next layer
    this->resetposition ();
    Update queue, store next layer node
    this->updatequeueexpand ();
    Calculates the weight and gain this->initnewnode () of the next layer node in the queue
    ;
    If there are no candidate split points in the current queue, exit the loop
    if (qexpand_.size () = = 0) break;
  }
  Due to the depth limit of the tree, the remaining nodes in the queue are set to the tree leaves for
  (size_t i = 0; i < Qexpand_.szie (); ++i)
  {
    ...
  }
  Record some auxiliary statistics for a well-constructed regression tree ...
}

The core part of the above code is the four functions in the first loop. Let's start by looking at how Builder::initnewnode initializes the gain and weight of the node.

1. Builder::initnewnode ()

updater_colmaker.cc
|--builder::initnewnode ()
  |--for (size_t  j = 0;  J < Qexpand_.size (); ++J)
  |--{
  |--  snode[qexpand[j]].root_gain = Calgain (...)
  | --  Snode[qexpand[j]].weight = Calweight (...)
  | --}

The Root_gain point here is the node gain that is mentioned earlier and will be used to determine if the point needs to be split. WEIGTHT is the right value of the current point, and the final model output is the linear combination of the leaf node weight. Calgain () and Calweight () are two template functions, and their simplified source code is as follows:

Param.h  line:242
template<typename trainingparams, TypeName t>
T calgain (const Trainingparams &p, T Sum_grad, T sum_hess)
{
  return (Sum_grad * Sum_grad)/(sum_hess + p.reg_lambda);
}
//param.h line:275
template<typename trainingparams, TypeName t>
T calweight (const Trainingparams &p, T Sum_grad, T sum_hess)
{
  Return-sum_grad/(sum_hess + P.reg_lambda);
}

The above two functions are the implementation of the first derivation of the two formula, that is, calculate the node gain and weight. After initializing the nodes in the queue, it is necessary to look for the best split attribute for each node in the queue.

2. xgboost::builder::findsplit ()

updater_colmaker.cc
|--xgboost::builder::findsplit ()
  |--//search for the best split value of the feature
  |--for (size_t i = 0; i< Feature_num; i++)
  |--{
  |--  xgboost::builder::updatesolution ()
     |--xgboost::builder::enumeratesplit ()
  |-- }
The splitting process eventually calls the Enumeratesplit () function, which simplifies the code by modifying the name of the code variable for ease of understanding
//updater_colmaker.cc line:508 void Enumeratesplit (...)
  {//Create temp variable TEMP to hold node information//space size is the maximum index of nodes in queue Qexpand_ vector<tstats> temp (Std::max (qexpand_) + 1); Tstats left_child (param)//node split left child statistics//traverse all values for the current feature for (const colbatch::entry * it = begin; it! = end; it + = D_step)
    {//Get the sample instance index and eigenvalues of the current eigenvalues value const int RINDEX = IT-index;
    const int fvalue = it->value;
    The index of the node to which it is classified according to the current sample index const int node_id = Position[rindex]; Statistics on right children after node splitting tstats & right_child = temp[node_id]//With the current eigenvalue as the split threshold, categorize the current sample to the left child left_child = snode[node_id]
    . Stats-right_child;
    Calculate gain loss int loss_change= calcsplitgain (param, Left_child, right_child)-snode[node_id].root_gain;  Record the best eigenvalue split threshold, which is the median value of the left and right child's adjacent eigenvalues Right_child.best.Update (Loss_change, feature_id, 0.5 * (fvalue
    + Right_child.left_value)); Categorizes the current sample instance to the right child node Right_child.add (Gpair, Info, Rindex)}} 
It is clear from the above code that the entire code process is the exact greedy alogrithm described earlier. There are two directions to find the split point, one to look from left to right and one to search from right to left. The above code shows only one direction of the search process. The calculation function for splitting gain when looking for a feature splitting threshold is Calcsplitgain (), with the following specific code:

Param.h line:365
Double calcsplitgain (const trainparam¶m
, GradStats left, gradstats right) const {
    return left. Calcgain (param) + right. Calcgain (param);
}
The above code simply adds the gain of the left child to the right child, and the gain loss loss_change the gain of the parent node by subtracting the gains from the child and left.

3. xgboost::builder::resetposition ()

After finding the split threshold for each node in the current layer, it is possible to construct the left and right children on the corresponding nodes to increase the depth of the current tree. When the depth of the tree increases, it is necessary to classify the sample instances classified into the non-leaf nodes of the current layer into the nodes in the next layer. This process is done through the resetposition () function.

4. Xgboost::builder::updatequeueexpand ()

The Xgboost::builder::updatequeueexpand () function updates the nodes in the Qexpand_ queue to the next layer of nodes, and then calls Xgboost::builder::initnewnode () updates Qexpand_ The weight and gain of the middle node for the next cycle.

Iii. Summary

This article describes in detail the method of xgboost using exact greedy algorithm tree, and analyzes the corresponding source code. In the process of analyzing the source code to facilitate the understanding of some simplification, such as eliminating the multi-threading, parallelization of the operation, and modified some variable names. After the tree structure is completed, the tree should be pruned to prevent the model from overfitting. Because of space limitations, pruning operations are no longer described here. This article is just a guide role, want to xgboost implementation details have a deeper understanding, but also need to read Xgboost source code, after all, some things with a text description is far better than the description of the codes clearly. Finally, you are welcome to discuss together.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.