http://geek.csdn.net/news/detail/201207
Xgboost:extreme Gradient Boosting
Project Address: Https://github.com/dmlc/xgboost
Tianqi Chen http://homes.cs.washington.edu/~tqchen/was originally developed to implement an extensible, portable, distributed gradient boosting (GBDT, GBRT or GBM) algorithm for a library that can be Installed and applied to C++,python,r,julia,java,scala,hadoop, many co-authors are now developing maintenance.
The algorithm applied by Xgboost is gradient boosting decision tree, which can be used for both classification and regression problems.
Then what is Gradient boosting.
Gradient boosting is one of boosting's methods, the so-called boosting, is a weak separator f_i (x) together to form a strong classifier F (X) method.
So there are three elements of boosting:
A loss function to being optimized:
For example, in the classification problem with cross entropy, regression problem with mean squared error.
A weak learner to make predictions:
such as decision trees.
An additive model:
Several weak learners are added together to form a strong learner, so that the objective loss function is minimized.
Gradient boosting is to try to correct the residuals of all previous weak learners by adding a new weak learner, so that multiple learners are added together to make a final prediction, and the accuracy rate is higher than the individual one. This is called Gradient because the gradient descent algorithm is used to minimize the loss when new models are added.
The first implementation of Gradient boosting is AdaBoost (Adaptive boosting).
AdaBoost is a number of weak classifiers, by means of voting to change the weights of each classifier, so that the sub-error classifier to obtain a larger weight value. At the same time, the distribution of samples is changed in each cycle, so that the samples that are incorrectly categorized will also receive more attention.
why use Xgboost.
As already known, Xgboost is the implementation of the gradient boosting decision tree, but generally speaking, gradient boosting implementation is relatively slow, because each time a tree is constructed and added to the entire model sequence.
The xgboost is characterized by fast computational speed and good model performance, which is the goal of this project for two points.
The performance is fast because it has this design: parallelization:
You can use all of the CPU cores to parallelize your achievements during training. Distributed Computing:
Use distributed computing to train very large models. Out-of-core Computing:
Out-of-core Computing can also be performed for very large datasets. Cache optimization of data structures and algorithms:
better use of hardware.
The figure below shows the comparison of Xgboost with other gradient boosting and bagged decision trees, which can be seen faster than the baseline configuration in R, Python,spark,h2o.
Another advantage is that the model in the prediction problem is very good, the following is a few Kaggle winner post-match interview link, you can see the effect of xgboost in the actual combat. Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup competition. Link to the arxiv paper. Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato truely Native? Competition. Link to the Kaggle interview. Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview. how to apply.
First to use Xgboost to do a simple two classification problem, the following data as an example, to determine whether the patient will abstainers diabetes in 5 years, the first 8 columns is a variable, the last column is the predicted value of 0 or 1.
Data Description:
Https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Download the dataset and save it as a "pima-indians-diabetes.csv" file:
Https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data
1. Basic applications
Introduction of Xgboost and other packages
From NumPy import Loadtxt to
xgboost import xgbclassifier from
sklearn.model_selection import train_test_ Split from
sklearn.metrics import Accuracy_score
Separating variables and labels
DataSet = Loadtxt (' pima-indians-diabetes.csv ', delimiter= ",")
X = Dataset[:,0:8]
Y = dataset[:,8]
Divide data into training sets and test sets, test sets are used to predict, training sets are used to learn models
Seed = 7
test_size = 0.33
x_train, X_test, y_train, y_test = Train_test_split (X, y, Test_size=test_size, Random_st Ate=seed)
The Xgboost has a packaged classifier and a regression that can be modeled directly with Xgbclassifier, which is the Xgbclassifier document:
Http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
Model = Xgbclassifier ()
model.fit (X_train, Y_train)
The result of Xgboost is the probability that each sample belongs to the first class, which needs to be converted to a value of 0 1 with round.
y_pred = Model.predict (x_test)
predictions = [Round (value) for value in y_pred]
Get accuracy:77.95%
accuracy = Accuracy_score (y_test, predictions)
print ("accuracy:%.2f%%"% (accuracy * 100.0))
2. Monitor model performance
Xgboost can evaluate the performance of the model on the test set while the model is being trained, and can also output the score for each step, simply
Model = Xgbclassifier ()
model.fit (X_train, Y_train)
Into:
Model = Xgbclassifier ()
eval_set = [(X_test, Y_test)]
model.fit (X_train, Y_train, early_stopping_rounds=10, Eval_metric= "Logloss", Eval_set=eval_set, Verbose=true)
Then it will print out the Logloss after each tree is added.
[+] validation_0-logloss:0.487867
[] validation_0-logloss:0.487297
[] validation_0-logloss:0. 487562
and print out the points of the Early stopping:
stopping. Best iteration:
[+] validation_0-logloss:0.487297
3. Output feature importance
Gradient boosting also has the advantage of being able to give the characteristic importance of a well-trained model,
This allows you to know which variables need to be preserved and which can be discarded.
The following two classes need to be introduced:
From Xgboost import plot_importance from
matplotlib import Pyplot
Compared to the previous code, it is the importance of adding two lines after fit to draw a feature
Model.fit (X, y)
plot_importance (model)
pyplot.show ()
4. Parameter adjustment
How to adjust the parameters, the following is the three parameters of the general practice of the best values, you can set them to this range, and then draw learning curves, and then the mediation parameters to find the best model: Learning_rate = 0.1 or smaller, the smaller you need to add more weak learner; tree_ depth = 2~8; subsample = 30%~80% of the training set;
Next we use the GRIDSEARCHCV to carry on the adjustment to participate more conveniently:
The combination of the parameters that can be adjusted is:
The number and size of the tree (N_estimators and max_depth).
The learning rate and the number of trees (Learning_rate and n_estimators).
The ranks of the subsampling rates (subsample, Colsample_bytree and Colsample_bylevel).
The following is an example of learning rate:
Introduce these two classes first
From sklearn.model_selection import GRIDSEARCHCV from
sklearn.model_selection import stratifiedkfold
Set the Learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] to be adjusted, compared to the original code by adding a grid search line after the model:
Model = Xgbclassifier ()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
Param_grid = dict (learning_rate=lear ning_rate)
kfold = Stratifiedkfold (n_splits=10, Shuffle=true, random_state=7)
Grid_search = GridSearchCV ( Model, Param_grid, scoring= "Neg_log_loss", N_jobs=-1, cv=kfold)
Grid_result = Grid_search.fit (X, Y)
At the end of the year, the best learning rate is 0.1.
Best: -0.483013 using {' Learning_rate ': 0.1}
Print ("Best:%f using%s"% (Grid_result.best_score_, Grid_result.best_params_))
We can also print out the corresponding score for each learning rate using the following code:
means = grid_result.cv_results_[' Mean_test_score ']
STDs = grid_result.cv_results_[' Std_test_score ']
params = grid_result.cv_results_[' params ']
for mean, STDEV, param in zip (means, STDs, params):
print ("%f (%f) with:%r"% (mean, STDEV, param))
-0.689650 (0.000242) with: {' learning_rate ': 0.0001}