Installation and use of xgboost in Python environment

Source: Internet
Author: User
Tags xgboost

Xgboost is a large-scale parallel boosted tree tool that is currently the fastest and best open source boosted tree toolkit, 10 times times faster than a common toolkit. In the field of data science, a large number of Kaggle players use it for data mining competitions, which include more than two kaggle competitions. In the industrial scale, the distributed version of Xgboost has extensive portability, supports running on various platforms such as yarn, MPI, Sungrid engine, and retains the various optimizations of the parallel version of the stand-alone, making it a good solution to the problem of industrial scale.

This article mainly introduces the installation and use of xgboost in the Python environment.

First install the C + + version of Xgboost, and then go to the folder under the root directory of the source file to wrappers execute the following script to install the Python module

Python setup.py Install

Download URL: https://github.com/dmlc/xgboost, (Installation in Windows environment needs to be compiled first)

How to use:

1. Data import

Example of a data grid model

The Import method is:

        Dtrain = XGB. Dmatrix ('train.txt')        = XGB. Dmatrix ('test.txt')

2. Parameter Settings

1param = {'Booster':'Gbtree','max_depth': 10,'ETA': 0.3,'Silent': 1,'Num_class': 2,'Objective':'Multi:softprob' }2Watchlist = [(Dtest,'Test'), (Dtrain,'Train')]

Set parameters and adjust, set validation data set

Parameter explanation:

Parameter for Tree Booster
  • ETA [default=0.3]
    • In order to prevent overfitting, the shrinkage step used in the update process. After each elevation calculation, the algorithm directly gains the weight of the new feature. ETA increases the computational process more conservatively by reducing the weight of the feature.缺省值为0.3
    • The value range is: [0,1]
  • Gamma [default=0]
    • Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm would be.
    • Range: [0,∞]
  • max_depth [Default=6]
    • The maximum depth of the number.缺省值为6
    • The value range is: [1,∞]
  • Min_child_weight [Default=1]
    • The smallest sample weight in the child node and. If a leaf node has a sample weight and is less than min_child_weight, the split process ends. In the current regression model, this parameter refers to the minimum number of samples required to build each model. The larger the mature algorithm, the more conservative
    • The value range is: [0,∞]
  • Max_delta_step [default=0]
    • Maximum Delta Step we allow each of the tree ' s weight estimation to be. If the value is set to 0, it means there is no constraint. If It is set to a positive value, it can help making the update step more conservative. Usually this parameter isn't needed, but it might help in the logistic regression when the class is extremely imbalanced. Set it to value of 1-10 might help control the update
    • The value range is: [0,∞]
  • subsample [Default=1]
    • The sub-samples used to train the model represent the proportions of the entire sample set. A setting of 0.5 means that the xgboost will randomly extract 50% of the sub-samples from the entire sample set to create a tree model, which prevents overfitting.
    • The value range is: (0,1]
  • Colsample_bytree [Default=1]
    • The scale at which the feature is sampled when the tree is established.缺省值为1
    • Range of values: (0,1]
Parameter for Linear Booster
    • Lambda [default=0]
      • The penalty coefficient of L2 regular
    • Alpha [default=0]
      • The penalty coefficient of L1 regular
    • Lambda_bias
      • The L2 on the bias. 缺省值为0(there is no bias on L1, because bias is not important when L1)
Task Parameters
  • Objective [Default=reg:linear]
    • Define learning tasks and corresponding learning goals, and the optional objective function is as follows:
    • "Reg:linear" – Linear regression.
    • "Reg:logistic" – Logistic regression.
    • "Binary:logistic" – The logistic regression problem of the two classification, the output is probability.
    • "Binary:logitraw" – The logistic regression problem for the two classification, the output is WTX.
    • "Count:poisson" – The Poisson regression of the counting problem, the output is Poisson distribution.
    • In Poisson regression, the default value for Max_delta_step is 0.7. (used to safeguard optimization)
    • "Multi:softmax" – allows Xgboost to use Softmax objective function to handle multi-classification problems, and to set parameter Num_class (number of categories)
    • "Multi:softprob" – like Softmax, but the output is a vector of ndata * nclass, which can be reshape into a matrix of ndata rows nclass columns. No row of data represents the probability that a sample belongs to each category.
    • "Rank:pairwise" –set xgboost to does ranking task by minimizing the pairwise loss
  • Base_score [default=0.5]
    • The initial prediction score of all instances, global bias
  • Eval_metric [Default according to objective]
    • The evaluation indicators required to verify the data, different target functions will have the default evaluation indicator (RMSE for regression, and error for classification, mean average precision for ranking)
    • Users can add a variety of evaluation indicators, for Python users to pass parameter pairs to the program, instead of the map parameter list parameter will not overwrite ' eval_metric '
    • The choices is listed below:
    • "Rmse": Root mean square error
    • "Logloss": Negative Log-likelihood
    • "Error": Binary classification Error rate. It is calculated as # (wrong cases)/# (all cases). For the predictions, the evaluation would regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
    • "Merror": Multiclass classification Error rate. It is calculated as # (wrong cases)/# (all cases).
    • "Mlogloss": Multiclass Logloss
    • "AUC": area under, the curve for ranking evaluation.
    • "NDCG": Normalized discounted cumulative Gain
    • "Map": Mean average precision
    • "[Email protected]", "[email protected]": N can be assigned as a integer to cut off the top positions in the lists for Eva Luation.
    • "ndcg-", "map-", "[email protected]", "[email protected]": In Xgboost, NDCG and map would evaluate the score of a list without Any positive samples as 1. By adding "–" in the evaluation metric Xgboost would evaluate these score as 0 to be consistent under some conditions.
      Training repeatively
  • Seed [default=0]
    • The seed of the random number.缺省值为0
Console Parameters

The following parameters is only used in the console version of Xgboost
* Use_buffer [default=1]
-whether to create a binary cache file for input, the cache file can speed up the calculation. 缺省值为1 
* Num_round
-Boosting iteration count.
* Data
-The path of the input data
* Test:data
-Path to test data
* Save_period [default=0]
-Represents the model that holds the i*save_period iteration. For example, save_period=10 indicates that every 10 iterations of the calculation Xgboost will save the intermediate result, and a setting of 0 indicates that the model is persisted for each calculation.
* task [Default=train] Options:train, pred, eval, dump
-Train: training is obvious
-pred: Prediction of test data
-eval: Define evaluation indicators by Eval[name]=filenam
-Dump: Save learning model to text format
* model_in [default=null]
-The path to the model is used in test, eval, and dump, and if defined in training Xgboost will then enter the model to continue training
* model_out [default=null]
-the maintenance path of the model after the completion of the training, if not defined, outputs a result similar to 0003.model, and 0003 is the result of the third training model.
* Model_dir [default=models]
-The path saved by the output model.
* Fmap
-feature map, used for dump model
* name_dump [default=dump.txt]
-Name of model dump file
* name_pred [default=pred.txt]
-Forecast Results File
* Pred_margin [default=0]
-outputs the predicted boundary, not the converted probability

3. Model Training

1         BST = xgb.train (param, Dtrain, num_round, watchlist)2         precision = bst.predict (dtest)

Train the model and predict

1         Print (Metrics.accuracy_score (labels,preds)) 2         Print (Metrics.precision_score (labels, preds)) 3         Print (Metrics.recall_score (labels, preds))

Output indicator

4. Model Save and load

Save the Model

Bst.save_model ('0001.model')

Load model

1 Bst.load_model ("00001.model"#  load Data

Installation and use of xgboost in Python environment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.