Xgboost principle _xgboost

Source: Internet
Author: User
Tags builtin xgboost
SOURCE http://blog.csdn.net/a819825294

The content of the article may be relatively more, the reader can click on the top table of contents, directly read their interest in the chapter.

1. Preface

Distance last editor nearly 10 months, kidnap Love cocoa Teacher (micro Bo) recommended, visit the volume of a steep. The recent graduation thesis is related to Xgboost, so I'll write this article again.

On the principle of xgboost on the network, most of the resources are still in the application level, this article through the study of Dr. Chen Tianchi's ppt, papers, some network resources, hope that the xgboost principle of in-depth understanding. (the author will give the address in the final reference) 2.xgboost vs GBDT

Speaking of Xgboost, have to say GBDT, both are boosting methods (as shown in Figure 1), understand GBDT can see my article address.


Figure 1

The difference between Xgboost and GBDT is the definition of objective function if we do not consider some differences in project implementation and problem solving.

Note: The L that the Red Arrow points to is the loss function; The red box is a regular item, including L1, L2, and the red circle is a constant entry. Xgboost using Taylor to expand three items to make an approximation, we can see clearly that the ultimate objective function relies only on the first and second derivative of each data point on the error function. 3. Principle

For the objective function given above, we can further simplify

(1) Define the complexity of the tree

For the definition of F to do a little refinement, divide the tree into the structure part Q and the leaf weight part W. The following figure is a concrete example. The structure function Q maps the input to the index number of the leaf, and W gives what the leaf fraction of each index number corresponds to.

The definition of this complexity contains the number of nodes in a tree and the L2 modulus squared of the output points above each tree leaf node. Of course, this is not the only way to define, but this is a way of defining the tree effect is generally good. An example of complexity calculation is given in the following figure.

Note: The box section controls this portion of the final model formula, corresponding to the lambda in the model parameters, gamma

Under this new definition, we can rewrite the objective function, where I is defined as the sample set on each leaf, G is the first derivative, and H is the second derivative.

This goal contains the T-independent univariate two-second function. We can define

The final formula can be reduced to

By the derivation equals 0, you can get

And then substituting the optimal solution to get:

(2) Example of the scoring function calculation

obj represents how much we reduce on the target when we specify the structure of a tree. We can call it structural fractions (structure score)

(3) Splitting node

Two methods of splitting nodes are given in this paper.

(1) Greedy method:

Every time you try to add a split to the existing leaf

For each extension, we want to enumerate all possible partitioning schemes and how to enumerate all the partitions efficiently. I suppose we're going to enumerate all the conditions for x < A, and for a particular division A we want to compute the derivative of a left and right side.

We can see that for all a, we can just do a scan from left to right to enumerate all the gradient and GL and Gr. Then use the formula above to calculate the score of each split scheme.

Looking at this objective function, you will find that the second noteworthy thing is that the introduction of segmentation does not necessarily make things better because we have a penalty for introducing new leaves. Optimization of this goal corresponds to the tree pruning, when the introduction of the division to bring the gain is less than a threshold, we can cut off this partition. As you can see, when we formally derive our goals, strategies such as calculating scores and pruning are naturally occurring, rather than being manipulated by heuristic (heuristics).

Here are the algorithms in the paper

(2) Approximate algorithm:

Data is too large to be directly calculated

4. Custom loss function (specify Grad, Hess)

(1) Loss function

(2) Grad and Hess derivation

(3) Official code

#!/usr/bin/python Import NumPy as NP import xgboost as XGB ### # advanced:customized loss function # print (' Start runnin G Example to used customized objective function ') Dtrain = XGB. Dmatrix ('.. /data/agaricus.txt.train ') Dtest = XGB. Dmatrix ('.. /data/agaricus.txt.test ') # Note:for Customized objective function, we leave objective as default # Note:what we are GE Tting is margin value in prediction # Your must know what you are doing param = {' max_depth ': 2, ' eta ': 1, ' Silent ': 1} wat Chlist = [(dtest, ' eval '), (Dtrain, ' train ')] Num_round = 2 # User Define objective function, given prediction, return GR Adient and second order gradient # This is log likelihood loss def logregobj (Preds, dtrain): labels = Dtrain.get_label 

() Preds = 1.0/(1.0 + NP.EXP (-preds)) Grad = Preds-labels Hess = preds * (1.0-preds) return grad, Hess # User Defined evaluation function, return a pair metric_name, result # Note:when your do customized loss function, the Default Prediction VALue is margin # i-BUILTIN evaluation metric not function properly # For example, we are doing logistic loss, T He prediction are score before logistic transformation # The BUILTIN evaluation error assumes input is after logistic trans Formation # Take This in mind the customization, and maybe your need write customized evaluation function def Evalerror (Preds, dtrain): labels = Dtrain.get_label () # Return a pair metric_name, result # since Preds Ma 

Rgin (before logistic transformation, cutoff at 0) return ' error ', float (sum (labels!= (preds > 0.0)))/Len (labels) # Training with customized objective, we can also does step by step training # simply look at xgboost.py ' s implementation of train BST = Xgb.train (param, Dtrain, num_round, watchlist, Logregobj, Evalerror)
The 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 A. 4 5 6 7-8 9-A-list of 5.Xgboost-A-and-a-K at the turn of the April Parameter

Since Xgboost has too many parameters, here are three ideas

(1) Gridsearch

(2) hyperopt

(3) The foreigner writes an article, the operability is relatively strong, recommends the study. Address 6. Engineering Implementation Optimization

(1) Column Blocks and parallelization

(2) Cache aware Access a thread pre-fetches data from non-continuous memory into a continuous bu ffer. The main thread accumulates gradients statistics in the continuous buff er.

(3) System Tricks block pre-fetching. Utilize multiple disks to parallelize disk operations. LZ4 compression (popular recent years for outstanding performance). Unrolling loops. OpenMP 7. Code for the daytime

Thank you very much for landlords's selfless dedication, "4."

The individual sees the code to use is Sourceinsight, because xgboost some files are the cc suffix name, may change by the following command (the default cannot identify)

Find./-name "*.cc" | Awk-f "." ' {print $} ' | Xargs-i-t mv./{}.cc  /{}.cpp
1 1

In fact, after a daytime analysis of Xgboost's source code, you can see the following main threads:

Cli_main.cc:main ()-> cliruntask ()-> clitrain ()-> dmatrix::load () -> learner = Learner::create ()-> learner->configure ()-> Learner->initmodel
                    ()-> for (i = 0; i < param.num_round ++i)-> learner->updateoneiter ()
     -> Learner->save () learner.cc:Create ()-> new Learnerimpl () Configure () Initmodel ()
                    -> Lazyinitmodel ()-> obj_ = Objfunction::create ()-> objective.cc Create ()-> softmaxmulticlassobj (multiclass_obj.cc)/Lambdara Nkobj (rank_obj.cc)/reglossobj (regression_obj.cc)/Poissonregr
                    Ession (regression_obj.cc)-> gbm_ = Gradientbooster::create ()-> gbm.cc
              Create ()           -> Gbtree (gbtree.cc)/Gblinear (gblinear.cc)-> obj_->configure
      ()-> gbm_->configure () updateoneiter ()-> Predictraw ()-> obj_->getgradient ()
-> gbm_->doboost () gbtree.cc:Configure ()-> for (up in updaters)-> Up->init ()
                Doboost ()-> boostnewtrees ()-> new_tree = new Regtree ()-> for (up in updaters) -> up->update (New_tree) tree_updater.cc:Create ()-> Colmaker/distcolmaker (updater_colmaker . cc)/Sketchmaker (updater_skmaker.cc)/Treerefresher (updater_refresh.cc)/Treepruner (Updater_prune . cc)/Histmaker/cqhistmaker/globalproposalhistmaker/quantilehistmaker (update r_histmaker.cc)/Treesyncher (updater_sync.cc)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26-27--28 29---30 31--32 33 34 35 36 37 38-39 40 41 42 45 46 47 48 49 50 51 52 53 54 55 56 1 2 3 4 5 6 7 8 9 10 11 12 13 14-15--16 17---18 19--20 21 22 23 24 25 26-27 28 29 30 2 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55-56

From the above code main flow can be seen, in the implementation of Xgboost, the algorithm is a modular disassembly, several important parts are:

I. objfunction: corresponding to the different loss Function, the first-order and second-order derivative can be computed.
Ii. Gradientbooster: Used to manage boost method generated model, note that the booster model here can correspond to the linear booster model, also can correspond to the tree booster model.
Iii. Updater: For the achievement, according to the concrete achievement strategy different, also can have many kinds of updater. For example, in Xgboost for performance optimization, not only provides a single machine multithreading parallel acceleration, but also support the multi-machine distributed acceleration. The updater implementations of several different parallel contributions are provided, depending on the parallel strategy, including:
I). Inter-feature exact parallelism (characteristic level precise parallelism)
(II). Inter-feature approximate parallelism (feature level approximation parallel, based on feature partition bin computation, which reduces the overhead of enumerating all feature splitting points)
III). Intra-feature parallelism (parallel in feature)

In addition, in order to avoid overfit, a updater (Treepruner) for pruning the tree is provided, as well as a updater (Treesyncher) for the Communication of node model parameter information in a distributed scenario, so designed, The main operation on the achievements can be connected by the way of updater chain, more consistent clean, is a decorator design mode [4] an application.

In the realization of xgboost, the most important thing is to build the link, and the corresponding code, the main is also the realization of updater. So we will take the realization of updater as the starting point of introduction.

Taking Colmaker (stand-alone version of the inter-feature parallelism to achieve a precise contribution to the strategy), for example, its contribution to the operation is as follows:

Updater_colmaker.cc:colmaker::update ()-> Builder Builder; -> Builder.
                           Update ()-> initdata ()-> initnewnode ()//is a tree node that can be used for split (that is, a leaf node in which there is only one Leaf node, which is the root node, calculates statistics, including Gain/weight-> for (depth = 0; maximum depth of depth < tree; ++depth)-> Findsplit ()-> for (each feature)//through OpenMP to get//inter- Feature parallelism-> updatesolution ()-> ENUMERATESPL
                                                   It ()///Each thread of execution handles a feature,//select each feature   
                                   Optimal split point-> parallelfindsplit () 
                                   Multiple execution threads simultaneously handle a feature and select the optimal split point of the feature//; Summarize statistics on each thread assigned to the data sample//Quantity (grad/hess);
                                   Aggregate the sample statistics for all threads (grad/hess),//calculates the boundary eigenvalue of the sample set assigned to each thread as
                                   The optimal dividing point of the split points; In each thread assigned to the sample set corresponding to the set of eigenvalues into//row enumeration as a split point, select the optimal segment-> Syn
                               Cbestsolution ()///Above Updatesolution ()/parallelfindsplit () 
                               The optimal split//point for the feature dimension is found for all leaf nodes that are to be extended, for example, A,OPENMP thread 1 will find a feature for the leaf node F1
                               The optimal split POINT,OPENMP thread 2 will find the feature F2 's most//excellent split point, so you need to perform global sync to find the leaf node A
                         The optimal split point.
                      -> creates a child node-> resetposition ()///////////////////////////////// 
                   Missing value (i.e. default) and non-Missing value (i.e.   Non-default)-> Updatequeueexpand ()////////////To replace qexpand_ with the leaf nodes to be extended for the next round Spli T//start base-> initnewnode ()//is a statistical statistic for tree nodes that can be used for split
The 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 A. 5 6 7 8 9 by the 8.python and R for the Xgbo-A-K-for-a-K-for-for- Simple use of OST

Task: Two classification, there is a problem of sample imbalance (Scale_pos_weight can be interpreted to some extent)

"Python"

R

Introduction to the more important parameters in 9.xgboost

(1) objective [Default=reg:linear] defines learning tasks and corresponding learning goals, the optional objective function is as follows: "Reg:linear" – Linear regression. "Reg:logistic" – Logistic regression. "Binary:logistic"-two the classification of the logical regression problem, the output is the probability. "Binary:logitraw"-Two The logical regression problem of classification, the output result is WTX. "Count:poisson"-the Poisson regression of counting problem, the output result is Poisson distribution. In the Poisson regression, the default value for Max_delta_step is 0.7. (used to safeguard optimization) "Multi:softmax" – Let Xgboost use Softmax objective function to handle multiple classification problems, and need to set parameter Num_class (number of categories) "Multi:softprob"-same as Softmax, but output is ndata * A nclass vector that can be reshape into a matrix of ndata rows nclass columns. No row data represents the probability that the sample belongs to each category. "Rank:pairwise" –set xgboost to does ranking task by minimizing the pairwise loss

(2) ' eval_metric ' The choices are listed below, Evaluation indicator: "RMSE": Root mean square error "Logloss": Negative Log-likelihood " Error ": Binary classification Error rate. It is calculated as # (wrong cases)/# (all cases). For the predictions, the evaluation would regard the instances with prediction value larger than 0.5 as positive, and the others as negative instances. "Merror": Multiclass classification Error rate. It is calculated as # (wrong cases)/# (all cases). "Mlogloss": Multiclass logloss "AUC": Area under the curve for ranking evaluation. "NDCG": Normalized discounted cumulative Gain "map": Mean average Precision "ndcg@n", "Map@n": N can be assigned as a intege R to cut off the top positions in the lists for evaluation. "ndcg-", "map-", "ndcg@n-", "map@n-": in Xgboost, NDCG and map would evaluate the score of a list without any positive samples As 1. By adding "-" in the evaluation metric Xgboost would evaluate these score as 0 to be consistent under the some.

(3) The penalty coefficients of the L2 of the lambda [default=0]

(4) Alpha [default=0] L1 regular penalty coefficients

(5) The L2 of the Lambda_bias on the bias. The default value is 0 (there is no L1 on the offset, because L1 is not important)

(6) ETA [default=0.3]
To prevent the fitting, the contraction step used in the update process. After each elevation calculation, the algorithm directly gets the weight of the new feature. ETA makes the Ascension calculation process more conservative by reducing the weighting of features. The default value is 0.3
The value range is: [0,1]

(7) The maximum depth of the number of max_depth [default=6]. The default value is 6, and the value range is: [1,∞]

(8) Min_child_weight [Default=1]
The child node is the smallest sample weight and. If a leaf node's sample weight and less than min_child_weight, the split process ends. In the current regression model, this parameter refers to the minimum number of samples needed to establish each model. The greater the maturity, the more conservative the algorithm
The value range is: [0,∞] 10. References

(1) Xgboost Guidance and actual combat
(2) Xgboost
(3) Custom objective function
(4) What are the differences between GBDT and xgboost in machine learning algorithms?
(5) Xgboost:reliable large-scale tree boosting System
(6) Xgboost:a Scalable tree boosting System

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.