Xgboost Source Reading Notes (1)--Code logical Structure

Last Update:2018-08-04 Source: Internet

Author: User

Tags data structures git clone xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A. Xgboost Introduction

Xgboost (EXtreme Gradient boosting) is an efficient, convenient and extensible machine Learning Library based on the GB (Gradient boosting) model framework. The library was started by Chen Tianchi in 2014 after the completion of the v0.1 version of Open source to Github[1], the current latest version is v0.6. At present, in all kinds of related competitions can see its appearance, such as Kaggle[2], in 2015 29 competitions, the TOP3 team published solutions in 17 scenarios used Xgboost, and only 11 solutions using deep learning At the same time, the TOP10 team in 2015KDDCup used Xgboost[3]. [4] because of its similarities with GBDT (Gradient boosting decision Tree), there is often a contrast between GBDT and Xgboost on the web. Recently, just read Chen Tianchi's paper "Xgboost:a Scalable Tree Boosting System" [3], from the paper can be seen in the Xgboost novelty is:

1. The regularization objective function is used, and the penalty item added will control the model complexity (number of leaves) and the score weight of the leaf node.

Figure 1-1 Objective function

2. Using shrinkage, a factor η is used to reduce the weight of each of the latest spanning trees in order to decrease the effect of the resulting tree on subsequent trees.

3. Support for column (feature) sampling, which was used in random forests. can prevent overfitting and speed up model training.

4. Parallel computing. The boost mode tree is generated serially, so it calculates the parallel computation in search of the tree splitting point, speeding up the model training speed.

There are a number of ways to look at split points:

1. Basic enumeration greedy search algorithm. After the feature is sorted by its value, each eigenvalue is enumerated as its split point and the gain of the split point is computed, and then the split point of the maximum gain is selected

2. Approximate greedy search algorithm. The method selects its percentile point as a candidate set before searching for a split point by sorting all the features according to their corresponding values, and performs the basic exhaustive greedy search method.

3. Weighted Division method (Weighted quantile sketch). This method can be used to treat weighted data.

4. Sparse split Point lookup. You can speed up the model to sparse data processing.

One of the differences between the GBDT and the Taylor is that it expands the objective function and uses the second derivative to speed up the convergence of the model. In general, the most important factor in Xgboost's popularity is its rapid training process.

two. Download and compile the source code

The source code download and compilation process on Linux is as follows [5]:

git clone--recursive https://github.com/dmlc/xgboost
CD xgboost
make

The use of the--recursive command is because Xgboost uses the author's own distributed computing Library, through this command can download the corresponding library, compiled and then can begin to read the source code, xgboost main directory structure as follows:

|--xgboost
  |--include
    |--xgboost       //defines Xgboost related header file
  |--src |--c_api |--common        // Some common files, such as the processing of configuration files
    |--data	   //data structures used, such as Dmatrix
    |--GBM           //defined if classifiers such as Gbtree and Gblinear
    |--metric        //define the evaluation function
    |--objective     //define the target function
    |--tree//          Some column operations on the tree

three. SOURCE Logical Structure

The execution entry of the program is in the cli_main.cc file

cli_main.cc
|--main ()
  |--cliruntask ()
  |--cliparam::configure ()
  |--switch (Param.task)
    {case
      ktrain:clitrain (param);
      Case Kdumpmodel:clidumpmodel (param);
      Case kpredict:clipredict (param);

Only the Cliruntask () function is called in the main function, which shows that after the program resolves the configuration file through the function configure (), the corresponding execution function is selected according to the parameter task. Here we mainly look at the training function clitrain ();

cli_main.cc
|--clitrain ()
  |--dmatrix::load ()
  |--learner::create ()
  |--learner::configure ()
  |--learner::initmodel ()
  |--for (int iter = 0; iter < max_iter; ++iter)
     {
       learner::updateoneiter ();
       Learner::evaloneiter ();
     }

In the CLI function, the training data is loaded into memory first, then the Learner class instance is created, then the Configure function configuration parameter of learner is called, and the Initmodel () initialization model is called. Then began xgboost boosting training, the main call is Learner's Updateoneiter () function.

learner.cc
|--updateoneiter ()
  |--learner::lazyinitdmatrix ()
  |--learner::P redictraw ()
  |-- Objfunction::getgradient ()
  |--gradientbooster::D oboost ()

During each iteration, Lazyinitdmatrix () initializes the data structures that need to be used first. Getgradient () Gets the first and second order guides of the target function, and finally Doboost () performs a boost operation to generate a regression tree. Class Gradientboost is an abstract class that defines an abstract interface for gradient boost. The derived two classes class Gbtree and class Gblinear respectively correspond to the parameters "Gbtree" and "Gblinear" in the configuration file, and the class Gbtree mainly uses the regression tree as its weak classifier, and class Gblinear uses linear regression or logistic regression as its weak classifier.

Class Gbtree is used more, and its doboost () function performs the following actions:

gbtree.cc
|--gbtree::D oboost ()
  |--gbtree::boostnewtrees ()
    |--gbtree::initupdater ()
    |-- Treeupdater::update ()

Doboost () called the Boostnewtrees () function. The Treeupdater instance is initialized in Boostnewtrees (), and a regression tree is generated when its update function is called. Treeupdater is an abstract class that derives many different updater based on the use of the algorithm, which are updater in the Src/tree directory.

|--SRC
  |--tree
    |--updater_basemaker-inl.h
    |--updater_colmaker.cc
    |--updater_skmaker.cc
    |-- updater_refresh.cc
    |--updater_prune.cc
    |--updater_hismaker.cc
    |--updater_fast_hist.cc

A class Basemaker derived from Treeupdater is defined in file updater_basemaker-inl.h. Class Colmaker uses the basic enumeration greedy search algorithm, which enumerates all the features to find the best splitting point; Class Skmaker derives from Basemaker and uses an approximate sketch method to find the best split point; class Treerefresher is used to refresh the statistics and leaf values of the tree on the dataset, Class Treepruner is the pruning operation of the tree, and class Histmaker uses the Histogram method, which is not mentioned in the paper, so it is not very clear.

At this point we can understand the logical structure of Xgboost source code, the current source only see here. After looking at the specific implementation of each algorithm and then write its specific implementation details in subsequent articles.

Four. Reference Documents

[1]. https://github.com/dmlc/xgboost

[2]. https://www.kaggle.com

[3]. Tianqi Chen and Carlos Guestrin. Xgboost:a Scalable Tree boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016 URL: https://arxiv.org/abs/1603.02754

[4]. https://www.zhihu.com/question/41354392

[5]. http://xgboost.readthedocs.io/en/latest/build.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More