Original: http://blog.csdn.net/zc02051126/article/details/46771793
Using Xgboost in Python
The following is an introduction to Xgboost's Python module, which reads as follows:
* Compiling and Importing Python modules
* Data Interface
* Parameter setting
* Training Model L
* Early Termination procedure
* Forecast
A walk through Python example for UCI Mushroom datasets is provided.
Installation
First install the C + + version of Xgboost, and then go to the folder under the root directory of the source file to wrappers
execute the following script to install the Python module
install
After the installation is complete, import the Xgboost Python module as follows
import xgboost as xgb
=
Data interface
Xgboost can load text data in LIBSVM format, and the loaded data format can be numpy for two-dimensional arrays and xgboost binary cache files. The loaded data is stored in the object DMatrix
.
- You can use the following methods when loading data in LIBSVM format and binary cache files
dtrain = xgb.DMatrix(‘train.svm.txt‘)dtest = xgb.DMatrix(‘test.svm.buffer‘)
- When loading an array of numpy into an
DMatrix
object, you can use the following method
data = np.random.rand(5,10) # 5 entities, each contains 10 featureslabel = np.random.randint(2, size=5) # binary targetdtrain = xgb.DMatrix( data, label=label)
- When you convert
scipy.sparse
formatted data to a DMatrix
format, you can use the following methods
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )dtrain = xgb.DMatrix( csr )
DMatrix
Save the formatted data in a Xgboost binary format, which will increase the load speed the next time it is loaded, using the following method
dtrain = xgb.DMatrix(‘train.svm.txt‘)dtrain.save_binary("train.buffer")
- You can handle
DMatrix
missing values in the following ways:
dtrain = xgb.DMatrix( data, label=label, missing = -999.0)
- When you need to set weights for a sample, you can use the following method
w = np.random.rand(5,1)dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w)
Parameter settings
Xgboost Save the parameters using the Key-value format. Eg
* Booster (Basic learner) parameters
param = {‘bst:max_depth‘:2, ‘bst:eta‘:1, ‘silent‘:1, ‘objective‘:‘binary:logistic‘ }param[‘nthread‘] = 4plst = param.items()plst += [(‘eval_metric‘, ‘auc‘)] # Multiple evals can be handled in this wayplst += [(‘eval_metric‘, ‘[email protected]‘)]
- You can also define validation data sets to verify the performance of the algorithm
evallist = [(dtest,‘eval‘), (dtrain,‘train‘)]
=
Training model
With the parameter list and the data, you can train the model.
* Training
10bst = xgb.train( plst, dtrain, num_round, evallist )
- Save the Model
You can save the model after the training is complete, or you can view the structure inside the model
bst.save_model(‘0001.model‘)
- Dump Model and Feature Map
Can dump the model to TXT and review the meaning of model
# dump modelbst.dump_model(‘dump.raw.txt‘)# dump model with feature mapbst.dump_model(‘dump.raw.txt‘,‘featmap.txt‘)
- Load model
The model can be loaded in the following ways
bst = xgb.Booster({‘nthread‘:4}) #init modelbst.load_model("model.bin") # load data
=
Early termination of the program
If there is evaluation data, the program can be terminated prematurely so that the optimal number of iterations can be found. If you want to terminate the program prematurely, you must have at least one evaluation data in the parameter evals
. If there ' s more than one, it'll use the last.
train(..., evals=evals, early_stopping_rounds=10)
The model would train until the validation score stops improving. Validation error needs to decrease on least every to early_stopping_rounds
continue training.
If early stopping occurs, the model would have both additional fields: bst.best_score
and bst.best_iteration
. Note train()
that would return a model from the last iteration, not the best one.
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC).
=
Prediction
After your training/loading a model and preparing the data, you can start to do prediction.
data = np.random.rand(7,10) # 7 entities, each contains 10 featuresdtest = xgb.DMatrix( data, missing = -999.0 )ypred = bst.predict( xgmat )
If early stopping is enabled during training, you can predict with the best iteration.
ypred = bst.predict(xgmat,ntree_limit=bst.best_iteration)
Xgboost: Using Xgboost in Python