This blog content is in the last blog scikit feature selection, Xgboost regression prediction, model optimization on the basis of the actual combat optimization, so before reading this blog, please go to see the previous article.
The work I did earlier was basically about feature selection, and I wanted to write about some of the little experiences with xgboost parameter adjustments. I have also seen a lot of relevant content on the site before, basically translation from an English blog, but also the pit is a lot of article steps are not complete, the new look very easy to confused. Because I am also a novice, in the process also stepped on a lot of pits, I hope this blog can help everyone. Now, let's get to the point.
First of all, luckily, Scikit-learn provides a function to help us better tune the parameters:
Sklearn.model_selection. Gridsearchcv
Common parameter Interpretation: Estimator: The classifier used, if the game is the use of xgboost words, is generated model. For example: Model = XGB. Xgbregressor (**other_params) Param_grid: The value is a dictionary or list, that is, the value of the parameter that needs to be optimized. For example: Cv_params = {' n_estimators ': [550, 575, 600, 650, 675]}
Scoring: Accuracy evaluation criteria, default None, then need to use the score function, or as scoring= ' ROC_AUC ', according to the selected model, different evaluation criteria. A string (function name), or a callable object, that requires its functional signature such as: scorer (estimator, X, y) and, if it is none, the estimator error estimate function. The scoring parameters are selected as follows:
Specific reference address:http://scikit-learn.org/stable/modules/model_evaluation.html
This is the actual combat I use R2 this score function, of course, you can also according to their actual needs to choose.
At the beginning of a tuning parameter, you typically initialize some values first: learning_rate:0.1 n_estimators:500 max_depth:5 min_child_weight:1 subsample:0.8 colsample_bytre e:0.8 gamma:0 reg_alpha:0 reg_lambda:1
Links: xgboost commonly used parameters list
You can set the initial value according to your actual situation, the above is just some experience.
The parameters are usually in the following order:
1, the best number of iterations: N_estimators
if __name__ = ' __main__ ': Trainfilepath = ' dataset/soccer/train.csv ' testFilePath = ' dat Aset/soccer/test.csv ' data = Pd.read_csv (Trainfilepath) x_train, Y_train = featureset (data) X_test = LOADTESTD ATA (testfilepath) cv_params = {' n_estimators ': [V, M, M, m]} other_params = {' Learning_rate ': 0.1, ' N_estimators ': $, ' max_depth ': 5, ' min_child_weight ': 1, ' seed ': 0, ' subsample ': 0.8, ' colsample_byt Ree ': 0.8, ' gamma ': 0, ' Reg_alpha ': 0, ' Reg_lambda ': 1} model = XGB. Xgbregressor (**other_params) OPTIMIZED_GBM = GRIDSEARCHCV (Estimator=model, Param_grid=cv_params, scoring= ' R2 ', cv=5, Verbose=1, n_jobs=4) optimized_gbm.fit (X_train, y_train) Evalute_result = Optimized_gbm.grid_scores_ print (' per round Iteration Run Result: {0} '. Format (evalute_result)) print (best value for parameter: {0} '. Format (OPTIMIZED_GBM.BEST_PARAMS_)) print (' Best model score: {0} '.) Format (optimized_gbm.best_score_))
writing here, you need to be reminded that there is a key in the code:
Model = XGB. Xgbregressor (**other_params) two * must not be omitted. Many people may not pay attention to, plus many online tutorials are estimated from other people there directly copy, there is no running results, so directly used the model = XGB. Xgbregressor (Other_params). The tragedy is that if this works directly, the following error will be reported:
Xgboost.core.xgboosterror:b "Invalid Parameter format for max_depth expect int but value ...
Do not believe, please see link: xgboost Issue
The above is the blood lesson ah, oneself do not run once code, never know what will appear bugs.
The results after the run are:
[Parallel (n_jobs=4)]: done by | elapsed: 1.5min finished
iteration results per round: [mean:0.94051, std: 0.01244, params: {' n_estimators ':}, mean:0.94057, std:0.01244, params: {' n_estimators ':}, mean:0.94061, std:0. 01230, params: {' n_estimators ':}, mean:0.94060, std:0.01223, params: {' n_estimators ': M}, mean:0.94058, std:0.01 231, params: {' n_estimators ':}]
Best value for parameter: {' n_estimators ': n}
Best model score: 0.9406056804545407
The optimal iteration number is 600 times by the output result. However, we can not think that this is the final result, because the interval is too large, so I tested a set of parameters, this time the size of smaller:
Cv_params = {' N_estimators ': [+, 575, 650, 675]}
other_params = {' Learning_rate ': 0.1, ' n_estimators ': N, ' ma X_depth ': 5, ' min_child_weight ': 1, ' seed ': 0,
' subsample ': 0.8, ' colsample_bytree ': 0.8, ' gamma ': 0, ' Reg_alpha ': 0, ' Reg_lambda ': 1}
The results after the run are:
[Parallel (n_jobs=4)]: done by | elapsed: 1.5min finished
iteration results per round: [mean:0.94065, std: 0.01237, params: {' n_estimators ': +}, mean:0.94064, std:0.01234, params: {' n_estimators ': 575}, mean:0.94061, std:0. 01230, params: {' n_estimators ':}, mean:0.94060, std:0.01226, params: {' n_estimators ': 650}, mean:0.94060, std:0.01 224, params: {' n_estimators ': 675}]
the best value of the parameter: {' n_estimators ': The
Best model score: 0.9406545392685364
Sure enough, the best number of iterations turns to 550. One might ask, "Do you want to continue to reduce the granularity of the test?" This I think can see personal situation, if you want higher precision, of course, the smaller the granularity, the more accurate the results, we can slowly go to debugging, I am not one of them to do.
2. The next parameter to be debugged is min_child_weight and max_depth:
Note: Each time a parameter is adjusted, the other_params corresponding parameter should be updated to the optimal value.
Cv_params = {' Max_depth ': [3, 4, 5, 6, 7, 8, 9, ten], ' min_child_weight ': [1, 2, 3, 4, 5, 6]}
other_params = {' Learning _rate ': 0.1, ' n_estimators ': 5, ' min_child_weight ': 1, ' seed ': 0,
' subsample ': 0.8, ' colsample_bytr EE ': 0.8, ' gamma ': 0, ' Reg_alpha ': 0, ' Reg_lambda ': 1}
The results after the run are:
[parallel (n_jobs=4)]: Done Tasks | elapsed:1.7min [Parallel (n_jobs=4)]: doing the tasks | Elapsed:12.3min [Parallel (n_jobs=4)]: Done 240 | Elapsed:17.2min finished per-round iteration results: [mean:0.93967, std:0.01334, params: {' min_child_weight ': 1, ' Max_depth ': 3}, mean:0. 93826, std:0.01202, params: {' min_child_weight ': 2, ' Max_depth ': 3}, mean:0.93739, std:0.01265, params: {' Min_child_wei Ght ': 3, ' Max_depth ': 3}, mean:0.93827, std:0.01285, params: {' min_child_weight ': 4, ' Max_depth ': 3}, mean:0.93680, STD : 0.01219, params: {' min_child_weight ': 5, ' max_depth ': 3}, mean:0.93640, std:0.01231, params: {' min_child_weight ': 6, ' Max_depth ': 3}, mean:0.94277, std:0.01395, params: {' min_child_weight ': 1, ' Max_depth ': 4}, mean:0.94261, std:0.01173,
Params: {' min_child_weight ': 2, ' Max_depth ': 4}, mean:0.94276, std:0.01329 ...] Best value for parameter: {' Min_child_weight ': 5, ' max_depth ': 4} Best Model score: 0.94369522247392
The best value of the parameters is known from the output: {' Min_child_weight ': 5, ' max_depth ': 4}. (The output of the code was omitted by me because the result was too long, as the following is true)
3, then we start debugging parameters: Gamma:
Cv_params = {' Gamma ': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
other_params = {' Learning_rate ': 0.1, ' n_estimators ': +, ' max_ Depth ': 4, ' min_child_weight ': 5, ' seed ': 0,
' subsample ': 0.8, ' colsample_bytree ': 0.8, ' gamma ': 0, ' Reg_alpha ': 0, ' r Eg_lambda ': 1}
The results after the run are:
[Parallel (n_jobs=4)]: worked out of | elapsed: 1.5min finished
iteration results per round: [mean:0.94370, std: 0.01010, params: {' gamma ': 0.1}, mean:0.94370, std:0.01010, params: {' gamma ': 0.2}, mean:0.94370, std:0.01010, params: {' Gamma ': 0.3}, mean:0.94370, std:0.01010, params: {' gamma ': 0.4}, mean:0.94370, std:0.01010, params: {' gamma ': 0.5}, mean:0.94370, std:0.01010, params: {' gamma ': 0.6}]
the best value for the parameter: {' gamma ': 0.1}
Best Model score: 0.94369522247392
The best value of the parameter is known from the output: {' gamma ': 0.1}.
4, followed by Subsample and Colsample_bytree:
Cv_params = {' subsample ': [0.6, 0.7, 0.8, 0.9], ' colsample_bytree ': [0.6, 0.7, 0.8, 0.9]}
other_params = {' Learning_ra Te ': 0.1, ' n_estimators ': subsample, ' max_depth ': 4, ' min_child_weight ': 5, ' seed ': 0,
' 0.8 ': ' Colsample_bytree ' : 0.8, ' gamma ': 0.1, ' Reg_alpha ': 0, ' Reg_lambda ': 1}
The results of the run display the best value for the parameter: {' subsample ': 0.7, ' Colsample_bytree ': 0.7}
5, followed by: Reg_alpha and Reg_lambda:
Cv_params = {' Reg_alpha ': [0.05, 0.1, 1, 2, 3], ' REG_LAMBDA ': [0.05, 0.1, 1, 2, 3]}
other_params = {' Learning_rate ': 0 .1, ' n_estimators ': The ' max_depth ': 4, ' min_child_weight ': 5, ' seed ': 0,
' subsample ': 0.7, ' colsample_bytree ': 0.7, ' Gamma ': 0.1, ' Reg_alpha ': 0, ' Reg_lambda ': 1}
The results after the run are:
[Parallel (n_jobs=4)]: Done Tasks | elapsed: 2.0min
[Parallel (n_jobs=4)]: 125 | Elapsed: 5.6min finished
per round iteration results: [mean:0.94169, std:0.00997, params: {' Reg_alpha ': 0.01, ' Reg_lambda ': 0.01} , mean:0.94112, std:0.01086, params: {' Reg_alpha ': 0.01, ' Reg_lambda ': 0.05}, mean:0.94153, std:0.01093, params: {' Reg _alpha ': 0.01, ' Reg_lambda ': 0.1}, mean:0.94400, std:0.01090, params: {' Reg_alpha ': 0.01, ' Reg_lambda ': 1}, mean:0.9382 0, std:0.01177, params: {' Reg_alpha ': 0.01, ' reg_lambda ': MB, mean:0.94194, std:0.00936, params: {' Reg_alpha ': 0.05, ' Reg_lambda ': 0.01}, mean:0.94136, std:0.01122, params: {' Reg_alpha ': 0.05, ' Reg_lambda ': 0.05}, mean:0.94164, std:0.0 1120]
the best value for the parameter: {' Reg_alpha ': 1, ' Reg_lambda ': 1}
Best Model score: 0.9441561344357595
The best value of the parameters is known from the output: {' Reg_alpha ': 1, ' Reg_lambda ': 1}.
6, the last is learning_rate, generally this time to adjust the small learning rate to test:
Cv_params = {' Learning_rate ': [0.01, 0.05, 0.07, 0.1, 0.2]}
other_params = {' Learning_rate ': 0.1, ' n_estimators ': 550, ' Max_depth ': 4, ' min_child_weight ': 5, ' seed ': 0,
' subsample ': 0.7, ' colsample_bytree ': 0.7, ' gamma ': 0.1, ' Reg_alpha ': 1, ' Reg_lambda ': 1}
The results after the run are:
[Parallel (n_jobs=4)]: done by | elapsed: 1.1min finished
iteration results per round: [mean:0.93675, std: 0.01080, params: {' learning_rate ': 0.01}, mean:0.94229, std:0.01138, params: {' learning_rate ': 0.05}, MEAN:0.94110, STD : 0.01066, params: {' learning_rate ': 0.07}, mean:0.94416, std:0.01037, params: {' learning_rate ': 0.1}, mean:0.93985, St d:0.01109, params: {' learning_rate ': 0.2}]
best value of parameter: {' learning_rate ': 0.1}
Best Model score: 0.9441561344357595
The best value of the parameter is known from the output: {' learning_rate ': 0.1}.
We can clearly see that with the tuning of parameters, the best model score is constantly improving, which also verifies from the other hand that tuning is indeed a certain role. However, we can also note that the best score does not improve too much. To remind you, this score is calculated from the score function set earlier, namely:
OPTIMIZED_GBM = GRIDSEARCHCV (Estimator=model, Param_grid=cv_params, scoring= ' R2 ', cv=5, verbose=1, n_jobs=4)
In the scoring= ' R2 '. In the actual situation, we may need to use a variety of scoring functions to judge the quality of the model.
Finally, we can get the predicted results by throwing the best combination of parameters into the model for training:
def trainandtest (X_train, Y_train, x_test):
# Xgboost Training process, the following parameters are just debugging the best parameter combination
model = XGB. Xgbregressor (learning_rate=0.1, n_estimators=550, max_depth=4, min_child_weight=5, Seed=0,
subsample=0.7, colsample_bytree=0.7, gamma=0.1, Reg_alpha=1, reg_lambda=1)
model.fit (X_train, Y_train)
# to predict the test set
ans = Model.predict (x_test)
Ans_len = Len (ans)
id_list = Np.arange (10441, 17441)
Data_arr = [] for
row in Range (0, Ans_len):
data_arr.append ([Int (Id_list[row]), Ans[row]])
np_data = Np.array (Data_arr)
# Write file
pd_data = PD. Dataframe (Np_data, columns=[' id ', ' y '])
# Print (pd_data)
pd_data.to_csv (' submit.csv ', Index=none)
# Show important Features
# plot_importance (model)
# plt.show ()
Well, the process of the tuning is basically over here. As I mentioned above, in fact, the tuning parameters for the improvement of the accuracy of the model has some help, but this is limited. The most important thing is to improve it by means of data cleaning, feature selection, feature fusion, model merging and so on.
The complete code can be downloaded on my github. (Declaration point, My Code quality is not very good, we refer to the idea on the line)
more dry Goods, welcome to listen to my gitchat: