Xgboost: Using Xgboost in Python

Source: Internet
Author: User
Tags svm xgboost python xgboost

Original: http://blog.csdn.net/zc02051126/article/details/46771793

Using Xgboost in Python

The following is an introduction to Xgboost's Python module, which reads as follows:
* Compiling and Importing Python modules
* Data Interface
* Parameter setting
* Training Model L
* Early Termination procedure
* Forecast

A walk through Python example for UCI Mushroom datasets is provided.

Installation

First install the C + + version of Xgboost, and then go to the folder under the root directory of the source file to wrappers execute the following script to install the Python module

install
    • 1

After the installation is complete, import the Xgboost Python module as follows

import xgboost as xgb
    • 1

=

Data interface

Xgboost can load text data in LIBSVM format, and the loaded data format can be numpy for two-dimensional arrays and xgboost binary cache files. The loaded data is stored in the object DMatrix .

    • You can use the following methods when loading data in LIBSVM format and binary cache files
dtrain = xgb.DMatrix(‘train.svm.txt‘)dtest = xgb.DMatrix(‘test.svm.buffer‘)
    • 1
    • 2
    • When loading an array of numpy into an DMatrix object, you can use the following method
data = np.random.rand(5,10) # 5 entities, each contains 10 featureslabel = np.random.randint(2, size=5) # binary targetdtrain = xgb.DMatrix( data, label=label)
    • 1
    • 2
    • 3
    • When you convert scipy.sparse formatted data to a DMatrix format, you can use the following methods
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )dtrain = xgb.DMatrix( csr )
    • 1
    • 2
    • DMatrixSave the formatted data in a Xgboost binary format, which will increase the load speed the next time it is loaded, using the following method
dtrain = xgb.DMatrix(‘train.svm.txt‘)dtrain.save_binary("train.buffer")
    • 1
    • 2
    • You can handle DMatrix missing values in the following ways:
dtrain = xgb.DMatrix( data, label=label, missing = -999.0)
    • 1
    • When you need to set weights for a sample, you can use the following method
w = np.random.rand(5,1)dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w)
    • 1
    • 2
Parameter settings

Xgboost Save the parameters using the Key-value format. Eg
* Booster (Basic learner) parameters

param = {‘bst:max_depth‘:2, ‘bst:eta‘:1, ‘silent‘:1, ‘objective‘:‘binary:logistic‘ }param[‘nthread‘] = 4plst = param.items()plst += [(‘eval_metric‘, ‘auc‘)] # Multiple evals can be handled in this wayplst += [(‘eval_metric‘, ‘[email protected]‘)]
    • 1
    • 2
    • 3
    • 4
    • 5
    • You can also define validation data sets to verify the performance of the algorithm
evallist  = [(dtest,‘eval‘), (dtrain,‘train‘)]
    • 1

=

Training model

With the parameter list and the data, you can train the model.
* Training

10bst = xgb.train( plst, dtrain, num_round, evallist )
    • 1
    • 2
    • Save the Model
      You can save the model after the training is complete, or you can view the structure inside the model
bst.save_model(‘0001.model‘)
    • 1
    • Dump Model and Feature Map
      Can dump the model to TXT and review the meaning of model
# dump modelbst.dump_model(‘dump.raw.txt‘)# dump model with feature mapbst.dump_model(‘dump.raw.txt‘,‘featmap.txt‘)
    • 1
    • 2
    • 3
    • 4
    • Load model
      The model can be loaded in the following ways
bst = xgb.Booster({‘nthread‘:4}) #init modelbst.load_model("model.bin") # load data
    • 1
    • 2

=

Early termination of the program

If there is evaluation data, the program can be terminated prematurely so that the optimal number of iterations can be found. If you want to terminate the program prematurely, you must have at least one evaluation data in the parameter evals . If there ' s more than one, it'll use the last.

train(..., evals=evals, early_stopping_rounds=10)

The model would train until the validation score stops improving. Validation error needs to decrease on least every to early_stopping_rounds continue training.

If early stopping occurs, the model would have both additional fields: bst.best_score and bst.best_iteration . Note train() that would return a model from the last iteration, not the best one.

This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC).

=

Prediction

After your training/loading a model and preparing the data, you can start to do prediction.

data = np.random.rand(7,10) # 7 entities, each contains 10 featuresdtest = xgb.DMatrix( data, missing = -999.0 )ypred = bst.predict( xgmat )
    • 1
    • 2
    • 3

If early stopping is enabled during training, you can predict with the best iteration.

ypred = bst.predict(xgmat,ntree_limit=bst.best_iteration)

Xgboost: Using Xgboost in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.