Xgboost Introduction and actual combat (actual argument)
Preface
Several of the previous posts are learning the principle of knowledge, it is time to model on the data ran. The data used in this article from Kaggle, I believe that the students learn the machine to know it, kaggle on a number of old topics have been open, suitable for beginners to practice, above there are many old drivers of the program sharing and discussion, very convenient for beginners to get started. The data used this time is classify handwritten digits using the famous mnist data-handwritten numeral recognition, each sample is equivalent to a picture pixel matrix (28x28), each image element is a feature. The advantage of using this data is that it is not necessary to do feature engineering, it is convenient for the hands-on model.
Xgboost installation look here: http://blog.csdn.net/sb19931201/article/details/52236020
Expand: Xgboost plotting API and GBDT combination feature practice DataSet
1. Data introduction: Data are used in mnist datasets that are widely used in machine learning communities:
The data for this competition were taken to the Mnist dataset. The Mnist ("Modified National Institute of Standards and Technology") the dataset is a classic within the Machine Learning com Munity that has been extensively studied. More detail about the dataset, including Machine Learning algorithms This have been tried on it and their levels of succes S, can is found at http://yann.lecun.com/exdb/mnist/index.html.
2. Training data Set (total 42,000 samples): As shown in the following figure, the first column is a label label, followed by a total of 28x28=784 pixel features, with a feature range of 0-255.
3. Test DataSet (Total 28,000 records): That's the dataset we're going to predict, no tag value, training the appropriate model through TRAIN.CSV, and then predicting the 28000 category (0-9 Digital label) based on the Test.csv feature dataset.
4. Result Data Sample: Two columns, one column ID, one column is the predictive label value.
Xgboost Model Tuning, training (python)
1. Import related library, read data
Import NumPy as NP
import pandas as PD
import xgboost as XGB from
sklearn.cross_validation import train_test_s Plit
#记录程序运行时间
import time
start_time = Time.time ()
#读入数据
train = Pd.read_csv ("Digit_recognizer /train.csv ")
2. Dividing data sets
#用sklearn. Cross_validation Training Data Set division, where the training set and cross-validation set ratio of 7:3, you can set
Train_xy,val = Train_test_split On Demand (train, Test_ size = 0.3,random_state=1)
y = Train_xy.label
X = Train_xy.drop ([' label '],axis=1)
val_y = Val.label
val_x = Val.drop ([' label '],axis=1)
#xgb矩阵赋值
xgb_val = xgb. Dmatrix (val_x,label=val_y)
Xgb_train = XGB. Dmatrix (X, label=y)
xgb_test = XGB. Dmatrix (Tests)
3.xgboost model
params={' booster ': ' Gbtree ', ' objective ': ' Multi:softmax ', #多分类的问题 ' num_class ': 10, # category number, with Multisoftmax and ' gamma ': 0.1, #
Used to control whether after pruning parameters, the larger the more conservative, generally 0.1, 0.2 such children.
' Max_depth ': 12, # Build the depth of the tree, the greater the more easy to fit ' lambda ': 2, # The larger the parameters of the weighting value of the control model complexity, the more L2 the model is, the more difficult it is to fit. ' Subsample ': 0.7, # Random sampling training Samples ' colsample_bytree ': 0.7, # The column sampled when the tree was generated ' min_child_weight ': 3, # This parameter defaults to 1, is each leaf inside H and at least how much, the positive and negative sample
In the 0-1 classification of this imbalance #, assuming H is near 0.01, Min_child_weight 1 means that at least 100 samples are required for leaf nodes.
#这个参数非常影响结果, the minimum value of the second order guide in the leaf node is controlled, the smaller the parameter, the easier the overfitting.
' Silent ': 0, #设置成1则没有运行信息输出, preferably set to 0. ' ETA ': 0.007, # as learning rate ' seed ': 1000, ' nthread ': 7,# number of CPU threads # ' eval_metric ': ' AUC '} plst = List (Params.items ()) Num_rounds = 5000 # iterative times watchlist = [(Xgb_train, ' Train '), (Xgb_val, ' Val ')] #训练模型并保存 # Early_stopping_rounds When the number of iterations is set, Early_stoppi
Ng_rounds can stop training in a certain number of iterations without raising the accuracy rate (plst, Xgb_train, Num_rounds, watchlist,early_stopping_rounds=100) Model.save_model ('./model/xgb.model ') # is used to store the trained model print "Best Best_ntree_limit", Model.best_ntree_limit
4. Forecast and save
Preds = Model.predict (xgb_test,ntree_limit=model.best_ntree_limit)
np.savetxt (' Xgb_submission.csv ', np.c_[ Range (1,len (Tests) +1), preds],delimiter= ', ', header= ' Imageid,label ', comments= ', fmt= '%d ')
#输出运行时长
cost_ Time = Time.time ()-start_time
print "Xgboost success!", ' \ n ', "Cost Time:", Cost_time, "(s) ..."
5. Variable Information
evaluation of forecast results
Upload the predicted xgb_submission.csv file to Kaggle and see the system score.
Because the number of iterations and the depth of the tree set are relatively large, the program is still in training, and so this run after the two parameters to adjust smaller, to see the running time has predicted the accuracy of changes.
let's leave a hole here ... I will post the results of different versions of the operation.
version: 1 Run Length: Intermediate program error, about 2500s
Parameters:
Results:
version: 2 running Length: 2275s
Parameters:
Results:
.
.
.
Knot
Fruit
Figure
film
.
.
. attached: Complete code
#coding =utf-8 "" "Created on 2016/09/17 by I was once crossed by mountains and rivers," "Import NumPy as NP import pandas as PD import xgboost as XGB fro M sklearn.cross_validation Import train_test_split #from xgboost.sklearn import xgbclassifier #from sklearn import Cross_ Validation, metrics #Additional Scklearn functions #from sklearn.grid_search import GRIDSEARCHCV #Perforing grid Searc H # #import Matplotlib.pylab as Plt #from matplotlib.pylab import rcparams #记录程序运行时间 import Time start_time = Time.time ( ) #读入数据 train = Pd.read_csv ("digit_recognizer/train.csv") tests = Pd.read_csv ("digit_recognizer/test.csv") params={' bo Oster ': ' Gbtree ', ' objective ': ' Multi:softmax ', #多分类的问题 ' num_class ': 10, # category number, with Multisoftmax and ' gamma ': 0.1, # used to control whether after pruning the parameters
Number, the greater the more conservative, generally 0.1, 0.2 such.
' Max_depth ': 12, # Build the depth of the tree, the greater the more easy to fit ' lambda ': 2, # The larger the parameters of the weighting value of the control model complexity, the more L2 the model is, the more difficult it is to fit. ' Subsample ': 0.7, # Random sampling training Samples ' colsample_bytree ': 0.7, # The column sampled when the tree was generated ' min_child_weight ': 3, # This parameter defaults to 1, is each leaf inside H and at least how much, the positive and negative sample When this imbalance is 0-1 classified, #, assuming H is near 0.01, Min_cA hild_weight of 1 means that a minimum of 100 samples should be included in the leaf node.
#这个参数非常影响结果, the minimum value of the second order guide in the leaf node is controlled, the smaller the parameter, the easier the overfitting.
' Silent ': 0, #设置成1则没有运行信息输出, preferably set to 0. ' ETA ': 0.007, # as learning rate ' seed ': 1000, ' nthread ': 7,# number of CPU threads # ' eval_metric ': ' AUC '} plst = List (Params.items ()) Num_rounds = 5000 # Iteration Count Train_xy,val = Train_test_split (train, test_size = 0.3,random_state=1) #random_state is of the big influence for Val-auc y = Train_xy.label X = Train_xy.drop ([' label '],axis=1) val_y = Val.label val_x = Val.drop ([' label '],axis=1] Xgb_ val = xgb. Dmatrix (val_x,label=val_y) Xgb_train = XGB. Dmatrix (X, label=y) xgb_test = XGB. Dmatrix (tests) watchlist = [(Xgb_train, ' Train '), (Xgb_val, ' Val ')] # training Model # Early_stopping_rounds when the number of iterations is set larger , Early_stopping_rounds can stop training at a certain number of iterations without raising the accuracy rate (plst, Xgb_train, Num_rounds, Watchlist,early_ stopping_rounds=100) Model.save_model ('./model/xgb.model ') is used to store the trained model print "Best Best_ntree_limit", model.best_ Ntree_limit print "ran here model.predict" preds = Model.predict (xgb_tesT,ntree_limit=model.best_ntree_limit) np.savetxt (' Xgb_submission.csv ', Np.c_[range (1,len (Tests) +1), Preds], Delimiter= ', ', header= ' Imageid,label ', comments= ', fmt= '%d ') #输出运行时长 cost_time = Time.time ()-start_time print "Xgboost success! ", ' \ n '," Cost Time: ", Cost_time," (s) "