The interpretation of Xgboost algorithm and output under Python platform

Source: Internet
Author: User
Tags xgboost in python

Implementation of Xgboost algorithm and output interpretation problem in Python Platform description dataset training set and test set Xgboost Modeling 1 model initialization setting 2 modeling and Forecasting 3 visual output 31 score 32 of leaf node 33 feature importance reference

the interpretation of Xgboost algorithm and output under Python platform 1. Description of the problem

Recently, a number of machine learning tasks were done using the xgboost algorithm in the Python environment, and the built-in functions were used to visualize the results of the tree, but the values of the leaf value were scanty. At the same time, also encountered the use of Xgboost built-in predict test set for scoring predictions, found that a number of sample sets of output score is the same. How do you explain this problem? By flipping through the related questions on the stack Overflow, as well as the issue answer on the GitHub search, it should be a preliminary understanding of the problem, special to share! 2. Data sets

Here, use the classic iris data to illustrate. Use the two classification problem to illustrate, so here are only the first 100 rows of data.

From Sklearn import datasets

iris = Datasets.load_iris ()
data = iris.data[:100]
print Data.shape
# ( 100L, 4L)
#一共有100个样本数据, Dimension 4 d

label = iris.target[:100]
print label
#正好选取label为0和1的数据
[0 0 0 0 0 0 The 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-----------------------+---
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1-1
 
3. Training set and test set
From sklearn.cross_validation import train_test_split

train_x, test_x, train_y, test_y = train_test_split (data, Label, random_state=0)
4. Xgboost Modeling 4.1 Model initialization Settings
Import Xgboost as XGB
DTRAIN=XGB. Dmatrix (train_x,label=train_y)
DTEST=XGB. Dmatrix (test_x)

params={' booster ': ' Gbtree ',
    ' objective ': ' binary:logistic ',
    ' eval_metric ': ' AUC ',
    ' max_depth ': 4,
    ' lambda ': ten,
    ' subsample ': 0.75,
    ' colsample_bytree ': 0.75,
    ' min_child_ Weight ': 2,
    ' eta ': 0.025,
    ' seed ': 0,
    ' nthread ': 8,
     ' silent ': 1}

watchlist = [(Dtrain, ' train ' )]
4.2 Modeling and forecasting
Bst=xgb.train (params,dtrain,num_boost_round=100,evals=watchlist)

ypred=bst.predict (dtest)

# Set Thresholds, Output some evaluation indicators
y_pred = (ypred >= 0.5) *1 from

sklearn import metrics
print ' AUC:%.4f '% metrics.roc_auc_score (t est_y,ypred)
print ' ACC:%.4f '% Metrics.accuracy_score (test_y,y_pred)
print ' Recall:%.4f '% Metrics.recall_ Score (test_y,y_pred)
print ' F1-score:%.4f '%metrics.f1_score (test_y,y_pred)
print ' precesion:%.4f '% Metrics.precision_score (test_y,y_pred)
Metrics.confusion_matrix (test_y,y_pred)

OUT[23]:

auc:1.0000
acc:1.0000
recall:1.0000
f1-score:1.0000
precesion:1.0000
Array ([[13,  0],
       [0, a]], Dtype=int64

Yeah, perfect model, perfect prediction! 4.3 visual output

#对于预测的输出有三种方式
? bst.predict
Signature:bst.predict (data, Output_margin=false, ntree_limit=0, pred_leaf= False, Pred_contribs=false, Approx_contribs=false)

pred_leaf:bool When this
    option was on, the output would be a M Atrix of (Nsample, ntrees) with each record
    indicating the predicted leaf index to each of the.
    Note that the leaf index of "a" is unique to the tree, so you could find leaf 1 in
    both 1 and tree 0.

Pred_contribs:bool When this
    option is on, the output would be a matrix of (Nsample, nfeats+1) and each record
    I Ndicating the feature contributions (SHAP values) for that
    prediction. The sum of all feature contributions are equal to the prediction.
    The "bias is" added as the final column, on top of the regular features.
4.3.1 score

The default output is scoring, which is nothing to say, directly on the code.

ypred = Bst.predict (dtest)
ypred

OUT[32]:

Array ([0.20081411,  0.80391562,  0.20081411,  0.80391562,  0.80391562,
        0.80391562,  0.20081411,  0.80391562,  0.80391562,  0.80391562,
        0.80391562,  0.80391562,  0.80391562,  0.20081411,  0.20081411,
        0.20081411,  0.20081411,  0.20081411,  0.20081411,  0.20081411,
        0.20081411,  0.80391562,  0.20081411,  0.80391562,  0.20081411], Dtype=float32 )

Here, you can observe the first problem of the article: Why is the score almost all the same value? First of all, look at the other two outputs. the leaf node 4.3.2 belongs to

When the pred_leaf=true is set, the leaf nodes of each sample in all trees are output

Ypred_leaf = Bst.predict (dtest, pred_leaf=true)
ypred_leaf

OUT[33]:

Array ([[1, 1, 1, ..., 1, 1, 1],
       [2, 2, 2, ..., 2, 2, 2], [1, 1, 1,
       ..., 1, 1, 1],
       ..., 
       [1, 1, 1, ..., 1, 1, 1],
       [2, 2, 2, ..., 2, 2, 2],
       [1, 1, 1, ..., 1, 1, 1]]

The dimensions of the output are [sample number, number of trees], and the number of trees defaults to 100, so the Ypred_leaf dimension is [100*100].

The first line of data is explained that in all 100 trees in xgboost, the predicted leaf nodes are 1 (relative to each tree).

How do you look at the score of each tree and the corresponding leaf node? There are two ways to visualize trees or direct output models.

Xgb.to_graphviz (BST, num_trees=0)
#可视化第一棵树的生成情况

#直接输出模型的迭代工程
Bst.dump_model ("Model.txt")
Booster[0]:
0:[f3<0.75] yes=1,no=2,missing=1
    1:leaf=-0.019697
    2:leaf=0.0214286
booster[1]:
0:[f2<2.35] yes=1,no=2,missing=1
    1:leaf=-0.0212184
    2:leaf=0.0212
booster[2]:
0:[f2< 2.35] yes=1,no=2,missing=1
    1:leaf=-0.0197404
    2:leaf=0.0197235
booster[3]: ...

The above command allows you to output the iterative process of the model, and you can see that each tree has two leaf nodes (the tree is simpler). We then sum up the value of the leaf node 1 in each tree, and perform the corresponding function conversion, which is the predicted value of the first sample.

Here, take the first sample as an example, you can see that the sample in all trees belong to the first leaf, so the cumulative value, get the following values.

Similarly, taking the second sample as an example, you can see that the sample belongs to the second leaf in all the trees, so the cumulative value gets the following values.

LEAF1   -1.381214
leaf2    1.410950

At the very beginning of using the Xgboost model, we set the ' objective ': ' binary:logistic ', so we use the function to convert the cumulative value to the actual rating:


F (x) =1/(1+exp (−x)) f (x) = 1/(1+exp (x))

1/float (1+np.exp (1.38121416))
out[24]: 0.20081407112186503 1/float
(1+np.exp ( -1.410950))
out[25]: 0.8039157403338895

This corresponds to the ypred = Bst.predict (dtest) score. 4.3.3 characteristic importance

Next, we look at another way of output, which is the importance of the feature phase to the score.

Ypred_contribs = Bst.predict (dtest, pred_contribs=true)
ypred_contribs

OUT[37]:

Array ([[0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0. ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0.        ,-1.01448286,-0.41277751, 0.04604663], [0.        , 0.        , 0.96967536, 0.39522746, 0.04604663], [0.        , 0. , -1.01448286, -0.41277751, 0.04604663]], Dtype=float32)

The Ypred_contribs dimension of the output is [100,5], and by reading the previous documentation note, you can see that the last column is bias, and the preceding four columns are the impact factor of each feature on the final rating, and you can see that the first two features do not work.

How does this output relate to the final score? The principle is the same, or the previous two examples.

score_a = SUM (ypred_contribs[0])
print score_a
# -1.38121373579
score_b = SUM (ypred_contribs[1))
Print Score_b
# 1.41094945744

The same score, the same treatment situation.

To this, this issue about the simple implementation of the xgboost algorithm in Python, as well as in the process of implementation: the output of the score, the sample corresponding to the tree node, the individual characteristics of each sample of the impact of the score, as well as the relationship between the above three, have been introduced, accumulated knowledge: happy:!

This article is a blogger original article, without the owner's permission not to reprint

If reproduced, please specify the source reference http://xgboost.readthedocs.io/en/latest/python/python_api.html https://www.kaggle.com/ general/20322

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.