Implementation of Xgboost algorithm and output interpretation problem in Python Platform description dataset training set and test set Xgboost Modeling 1 model initialization setting 2 modeling and Forecasting 3 visual output 31 score 32 of leaf node 33 feature importance reference
the interpretation of Xgboost algorithm and output under Python platform 1. Description of the problem
Recently, a number of machine learning tasks were done using the xgboost algorithm in the Python environment, and the built-in functions were used to visualize the results of the tree, but the values of the leaf value were scanty. At the same time, also encountered the use of Xgboost built-in predict test set for scoring predictions, found that a number of sample sets of output score is the same. How do you explain this problem? By flipping through the related questions on the stack Overflow, as well as the issue answer on the GitHub search, it should be a preliminary understanding of the problem, special to share! 2. Data sets
Here, use the classic iris data to illustrate. Use the two classification problem to illustrate, so here are only the first 100 rows of data.
From Sklearn import datasets
iris = Datasets.load_iris ()
data = iris.data[:100]
print Data.shape
# ( 100L, 4L)
#一共有100个样本数据, Dimension 4 d
label = iris.target[:100]
print label
#正好选取label为0和1的数据
[0 0 0 0 0 0 The 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-----------------------+---
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1-1
3. Training set and test set
From sklearn.cross_validation import train_test_split
train_x, test_x, train_y, test_y = train_test_split (data, Label, random_state=0)
4. Xgboost Modeling
4.1 Model initialization Settings
Import Xgboost as XGB
DTRAIN=XGB. Dmatrix (train_x,label=train_y)
DTEST=XGB. Dmatrix (test_x)
params={' booster ': ' Gbtree ',
' objective ': ' binary:logistic ',
' eval_metric ': ' AUC ',
' max_depth ': 4,
' lambda ': ten,
' subsample ': 0.75,
' colsample_bytree ': 0.75,
' min_child_ Weight ': 2,
' eta ': 0.025,
' seed ': 0,
' nthread ': 8,
' silent ': 1}
watchlist = [(Dtrain, ' train ' )]
4.2 Modeling and forecasting
Bst=xgb.train (params,dtrain,num_boost_round=100,evals=watchlist)
ypred=bst.predict (dtest)
# Set Thresholds, Output some evaluation indicators
y_pred = (ypred >= 0.5) *1 from
sklearn import metrics
print ' AUC:%.4f '% metrics.roc_auc_score (t est_y,ypred)
print ' ACC:%.4f '% Metrics.accuracy_score (test_y,y_pred)
print ' Recall:%.4f '% Metrics.recall_ Score (test_y,y_pred)
print ' F1-score:%.4f '%metrics.f1_score (test_y,y_pred)
print ' precesion:%.4f '% Metrics.precision_score (test_y,y_pred)
Metrics.confusion_matrix (test_y,y_pred)
OUT[23]:
auc:1.0000
acc:1.0000
recall:1.0000
f1-score:1.0000
precesion:1.0000
Array ([[13, 0],
[0, a]], Dtype=int64
Yeah, perfect model, perfect prediction! 4.3 visual output
#对于预测的输出有三种方式
? bst.predict
Signature:bst.predict (data, Output_margin=false, ntree_limit=0, pred_leaf= False, Pred_contribs=false, Approx_contribs=false)
pred_leaf:bool When this
option was on, the output would be a M Atrix of (Nsample, ntrees) with each record
indicating the predicted leaf index to each of the.
Note that the leaf index of "a" is unique to the tree, so you could find leaf 1 in
both 1 and tree 0.
Pred_contribs:bool When this
option is on, the output would be a matrix of (Nsample, nfeats+1) and each record
I Ndicating the feature contributions (SHAP values) for that
prediction. The sum of all feature contributions are equal to the prediction.
The "bias is" added as the final column, on top of the regular features.
4.3.1 score
The default output is scoring, which is nothing to say, directly on the code.
ypred = Bst.predict (dtest)
ypred
OUT[32]:
Array ([0.20081411, 0.80391562, 0.20081411, 0.80391562, 0.80391562,
0.80391562, 0.20081411, 0.80391562, 0.80391562, 0.80391562,
0.80391562, 0.80391562, 0.80391562, 0.20081411, 0.20081411,
0.20081411, 0.20081411, 0.20081411, 0.20081411, 0.20081411,
0.20081411, 0.80391562, 0.20081411, 0.80391562, 0.20081411], Dtype=float32 )
Here, you can observe the first problem of the article: Why is the score almost all the same value? First of all, look at the other two outputs. the leaf node 4.3.2 belongs to
When the pred_leaf=true is set, the leaf nodes of each sample in all trees are output
Ypred_leaf = Bst.predict (dtest, pred_leaf=true)
ypred_leaf
OUT[33]:
Array ([[1, 1, 1, ..., 1, 1, 1],
[2, 2, 2, ..., 2, 2, 2], [1, 1, 1,
..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[2, 2, 2, ..., 2, 2, 2],
[1, 1, 1, ..., 1, 1, 1]]
The dimensions of the output are [sample number, number of trees], and the number of trees defaults to 100, so the Ypred_leaf dimension is [100*100].
The first line of data is explained that in all 100 trees in xgboost, the predicted leaf nodes are 1 (relative to each tree).
How do you look at the score of each tree and the corresponding leaf node? There are two ways to visualize trees or direct output models.
Xgb.to_graphviz (BST, num_trees=0)
#可视化第一棵树的生成情况
#直接输出模型的迭代工程
Bst.dump_model ("Model.txt")
Booster[0]:
0:[f3<0.75] yes=1,no=2,missing=1
1:leaf=-0.019697
2:leaf=0.0214286
booster[1]:
0:[f2<2.35] yes=1,no=2,missing=1
1:leaf=-0.0212184
2:leaf=0.0212
booster[2]:
0:[f2< 2.35] yes=1,no=2,missing=1
1:leaf=-0.0197404
2:leaf=0.0197235
booster[3]: ...
The above command allows you to output the iterative process of the model, and you can see that each tree has two leaf nodes (the tree is simpler). We then sum up the value of the leaf node 1 in each tree, and perform the corresponding function conversion, which is the predicted value of the first sample.
Here, take the first sample as an example, you can see that the sample in all trees belong to the first leaf, so the cumulative value, get the following values.
Similarly, taking the second sample as an example, you can see that the sample belongs to the second leaf in all the trees, so the cumulative value gets the following values.
LEAF1 -1.381214
leaf2 1.410950
At the very beginning of using the Xgboost model, we set the ' objective ': ' binary:logistic ', so we use the function to convert the cumulative value to the actual rating:
F (x) =1/(1+exp (−x)) f (x) = 1/(1+exp (x))
1/float (1+np.exp (1.38121416))
out[24]: 0.20081407112186503 1/float
(1+np.exp ( -1.410950))
out[25]: 0.8039157403338895
This corresponds to the ypred = Bst.predict (dtest) score. 4.3.3 characteristic importance
Next, we look at another way of output, which is the importance of the feature phase to the score.
Ypred_contribs = Bst.predict (dtest, pred_contribs=true)
ypred_contribs
OUT[37]:
Array ([[0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. ,-1.01448286,-0.41277751, 0.04604663], [0. , 0. , 0.96967536, 0.39522746, 0.04604663], [0. , 0. , -1.01448286, -0.41277751, 0.04604663]], Dtype=float32)
The Ypred_contribs dimension of the output is [100,5], and by reading the previous documentation note, you can see that the last column is bias, and the preceding four columns are the impact factor of each feature on the final rating, and you can see that the first two features do not work.
How does this output relate to the final score? The principle is the same, or the previous two examples.
score_a = SUM (ypred_contribs[0])
print score_a
# -1.38121373579
score_b = SUM (ypred_contribs[1))
Print Score_b
# 1.41094945744
The same score, the same treatment situation.
To this, this issue about the simple implementation of the xgboost algorithm in Python, as well as in the process of implementation: the output of the score, the sample corresponding to the tree node, the individual characteristics of each sample of the impact of the score, as well as the relationship between the above three, have been introduced, accumulated knowledge: happy:!
This article is a blogger original article, without the owner's permission not to reprint
If reproduced, please specify the source reference http://xgboost.readthedocs.io/en/latest/python/python_api.html https://www.kaggle.com/ general/20322