Xgboost plotting API and GBDT combination feature practice
write in front:
Recently in-depth study some tree model related knowledge points, intend to tidy up a bit. Just last night to see the echoes on GitHub to share a wave of machinelearningtrick, hurriedly get on the train to learn a wave. The great God this wave rhythm shares the Xgboost related dry goods, but also has some content not to share .... It's worth watching. I looked mainly at: Xgboost's leaf node position creates a new feature encapsulated function. Have read the relevant post, such as Byran the great God of this article: http://blog.csdn.net/bryan__/article/details/51769118, but he has never practiced. This article is based on the Bryan Great God Blog and echoes of the Great God Code on the GBDT combination features of the practice of understanding and expansion, in addition to explore Xgboost plotting API, learning-oriented.
Official API Description:
Http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn 1. Introduction to the principle of combining characteristics of GBDT structure
From the blog of the Byran and the new features of the GBDT model, it is possible to understand the GBDT combinatorial features better:
The idea of the paper is very simple, it is to train the GBDT model with the existing features, then use the GBDT model to construct the new features, and finally put these new features into the original features to train the model. The new eigenvector of the construct is a value of 0/1, and each element of the vector corresponds to the leaf node of the tree in the GBDT model. When a sample point passes through a tree and eventually falls on a leaf node in the tree, the leaf node corresponds to a value of 1 in the new eigenvector, and the other leaf nodes of the tree have an element value of 0. The length of the new eigenvector is equal to the sum of the leaf node numbers contained in all the trees in the GBDT model.
An example is described. The two trees in the following figure are learned by GBDT, the first tree has 3 leaf nodes, and the second tree has 2 leaves. For an input sample point x, if it falls in the second leaf node at the end of the first tree, the first leaf node in the second tree is finally dropped. The new eigenvectors obtained by GBDT are [0, 1, 0, 1, 0], where the first three bits in the vector correspond to the 3 leaf nodes of the tree, and the next two bits correspond to the 2 leaf nodes of the second tree.
The key point in practice is how to obtain each leaf node of each tree in the tree model after training each sample. I've seen it before. You can set Pre_leaf=true to get each sample leaf_index on each tree, open the Xgboost official document to look up the API:
Originally this parameter is inside the predict, after the primitive characteristic carries on the simple assistant training, the original data as well as the test data carries on the new_feature= bst.predict (d_test, pred_leaf=true) can obtain one (Nsample, ntrees) The result matrix, which is the index of each sample on each tree. Knowing this method, I studied the code of the echoes great God carefully and found that he did not use this, as follows:
Can see he is using the Apply () method, here is a little puzzled, in Xgboost official API and did not see this method, so I went to Sklearn GBDT API looked under, sure enough there is apply () method can get leaf indices:
There are differences in the code because the Xgboost has its own interface and Scikit-learn interface. At this point, the basic understanding of the use of GBDT (XGBOOST) structure combination features of the implementation method, followed by two interfaces to practice a wave. 2. Practice of combining features with GBDT structure
Departure from the departure ~
(1). Package Import and data preparation
From sklearn.model_selection import Train_test_split to
Pandas import DataFrame from
sklearn import metrics< C2/>from sklearn.datasets Import make_hastie_10_2 from
xgboost.sklearn import xgbclassifier
Import Xgboost as Xgb
#准备数据, y would have been [ -1:1],xgboost with an interface invitation tag is [0:1], the 1 turned into 1.
X, y = make_hastie_10_2 (random_state=0)
x = DataFrame (x)
y = DataFrame (y)
y.columns={"label"}
Label={-1:0,1:1}
Y.label=y.label.map ( Label)
X_train, X_test, y_train, y_test = Train_test_split (X, y, test_size=0.2, random_state=0) #划分数据集
y_ Train.head ()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
Label |
843 |
1 |
9450 |
0 |
7766 |
1 |
9802 |
1 |
8555 |
1 |
(2). Xgboost Two types of interface definitions
#XGBoost自带接口 params={' ETA ': 0.3, ' max_depth ': 3, ' min_child_weight ': 1, ' gamma ': 0.3, ' subsample ': 0. 8, ' Colsample_bytree ': 0.8, ' booster ': ' Gbtree ', ' objective ': ' binary:logistic ', ' nthread ': ' Scale_po ' S_weight ': 1, ' lambda ': 1, ' seed ': +, ' silent ': 0, ' eval_metric ': ' AUC '} d_train = Xgb. Dmatrix (X_train, label=y_train) D_valid = XGB. Dmatrix (X_test, label=y_test) d_test = XGB. Dmatrix (x_test) watchlist = [(D_train, ' Train '), (d_valid, ' valid ')] #sklearn接口 CLF = Xgbclassifier (n_estimators=30, #三十棵树 learning_rate =0.3, max_depth=3, Min_child_weight=1, gamma=0.3, subsample=0.8, Colsample_byt ree=0.8, objective= ' binary:logistic ', nthread=12, Scale_pos_weight=1, Reg_lambda=1, seed=27) model_b st = Xgb.train (params, d_train, watchlist, early_stopping_rounds=500, verbose_eval=10) Model_sklearn=clf.fit (X_ Train, Y_train) y_bst= model_bst.predict (d_test) y_sklearn= Clf.predict_proba (x_Test) [:, 1]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Print ("Xgboost_ comes with interface AUC score:%f"% Metrics.roc_auc_score (y_test, Y_bst))
print ("Xgboost_sklearn Interface AUC score :%f "% Metrics.roc_auc_score (y_test, Y_sklearn))
1 2 1 2
(3). Create two new sets of features
Print ("Original train Size:", X_train.shape)
print ("Original Test Size:", X_test.shape)
# #XGBoost自带接口生成的新特征
train_new_ Feature= model_bst.predict (D_train, pred_leaf=true)
test_new_feature= model_bst.predict (d_test, Pred_leaf=True )
Train_new_feature1 = DataFrame (train_new_feature)
test_new_feature1 = DataFrame (test_new_feature)
Print ("new feature set (with interface):", Train_new_feature1.shape)
print ("New test set (with interface):", Test_new_feature1.shape)
# New features generated by the Sklearn interface
train_new_feature= clf.apply (x_train) #每个样本在每颗树叶子节点的索引值
test_new_feature= clf.apply (x_ Test)
Train_new_feature2 = DataFrame (train_new_feature)
test_new_feature2 = DataFrame (test_new_feature)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Print ("new feature set (Sklearn interface):", Train_new_feature2.shape)
print ("New test set (Sklearn interface):", Test_new_feature2.shape)
1 2 1 2
Train_new_feature1.head ()
1 1
|
0 |
1 |
2 |
3 |
4< /th> |
5 |
6 |
7 |
8 |
9 |
... |
|
|
+ |
up |
, |
, |
|
>