Installation and use of xgboost in Python environment

Last Update:2017-11-01 Source: Internet

Author: User

Tags xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Xgboost is a large-scale parallel boosted tree tool that is currently the fastest and best open source boosted tree toolkit, 10 times times faster than a common toolkit. In the field of data science, a large number of Kaggle players use it for data mining competitions, which include more than two kaggle competitions. In the industrial scale, the distributed version of Xgboost has extensive portability, supports running on various platforms such as yarn, MPI, Sungrid engine, and retains the various optimizations of the parallel version of the stand-alone, making it a good solution to the problem of industrial scale.

This article mainly introduces the installation and use of xgboost in the Python environment.

First install the C + + version of Xgboost, and then go to the folder under the root directory of the source file to wrappers execute the following script to install the Python module

Python setup.py Install

Download URL: https://github.com/dmlc/xgboost, (Installation in Windows environment needs to be compiled first)

How to use:

1. Data import

Example of a data grid model

The Import method is:

        Dtrain = XGB. Dmatrix ('train.txt')        = XGB. Dmatrix ('test.txt')

2. Parameter Settings

1param = {'Booster':'Gbtree','max_depth': 10,'ETA': 0.3,'Silent': 1,'Num_class': 2,'Objective':'Multi:softprob' }2Watchlist = [(Dtest,'Test'), (Dtrain,'Train')]

Set parameters and adjust, set validation data set

Parameter explanation:

Parameter for Tree Booster

ETA [default=0.3]
- In order to prevent overfitting, the shrinkage step used in the update process. After each elevation calculation, the algorithm directly gains the weight of the new feature. ETA increases the computational process more conservatively by reducing the weight of the feature.缺省值为0.3
- The value range is: [0,1]
Gamma [default=0]
- Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm would be.
- Range: [0,∞]
max_depth [Default=6]
- The maximum depth of the number.缺省值为6
- The value range is: [1,∞]
Min_child_weight [Default=1]
- The smallest sample weight in the child node and. If a leaf node has a sample weight and is less than min_child_weight, the split process ends. In the current regression model, this parameter refers to the minimum number of samples required to build each model. The larger the mature algorithm, the more conservative
- The value range is: [0,∞]
Max_delta_step [default=0]
- Maximum Delta Step we allow each of the tree ' s weight estimation to be. If the value is set to 0, it means there is no constraint. If It is set to a positive value, it can help making the update step more conservative. Usually this parameter isn't needed, but it might help in the logistic regression when the class is extremely imbalanced. Set it to value of 1-10 might help control the update
- The value range is: [0,∞]
subsample [Default=1]
- The sub-samples used to train the model represent the proportions of the entire sample set. A setting of 0.5 means that the xgboost will randomly extract 50% of the sub-samples from the entire sample set to create a tree model, which prevents overfitting.
- The value range is: (0,1]
Colsample_bytree [Default=1]
- The scale at which the feature is sampled when the tree is established.缺省值为1
- Range of values: (0,1]

Parameter for Linear Booster

Lambda [default=0]
- The penalty coefficient of L2 regular
Alpha [default=0]
- The penalty coefficient of L1 regular
Lambda_bias
- The L2 on the bias. 缺省值为0(there is no bias on L1, because bias is not important when L1)

Task Parameters

Objective [Default=reg:linear]
- Define learning tasks and corresponding learning goals, and the optional objective function is as follows:
- "Reg:linear" – Linear regression.
- "Reg:logistic" – Logistic regression.
- "Binary:logistic" – The logistic regression problem of the two classification, the output is probability.
- "Binary:logitraw" – The logistic regression problem for the two classification, the output is WTX.
- "Count:poisson" – The Poisson regression of the counting problem, the output is Poisson distribution.
- In Poisson regression, the default value for Max_delta_step is 0.7. (used to safeguard optimization)
- "Multi:softmax" – allows Xgboost to use Softmax objective function to handle multi-classification problems, and to set parameter Num_class (number of categories)
- "Multi:softprob" – like Softmax, but the output is a vector of ndata * nclass, which can be reshape into a matrix of ndata rows nclass columns. No row of data represents the probability that a sample belongs to each category.
- "Rank:pairwise" –set xgboost to does ranking task by minimizing the pairwise loss
Base_score [default=0.5]
- The initial prediction score of all instances, global bias
Eval_metric [Default according to objective]
- The evaluation indicators required to verify the data, different target functions will have the default evaluation indicator (RMSE for regression, and error for classification, mean average precision for ranking)
- Users can add a variety of evaluation indicators, for Python users to pass parameter pairs to the program, instead of the map parameter list parameter will not overwrite ' eval_metric '
- The choices is listed below:
- "Rmse": Root mean square error
- "Logloss": Negative Log-likelihood
- "Error": Binary classification Error rate. It is calculated as # (wrong cases)/# (all cases). For the predictions, the evaluation would regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- "Merror": Multiclass classification Error rate. It is calculated as # (wrong cases)/# (all cases).
- "Mlogloss": Multiclass Logloss
- "AUC": area under, the curve for ranking evaluation.
- "NDCG": Normalized discounted cumulative Gain
- "Map": Mean average precision
- "[Email protected]", "[email protected]": N can be assigned as a integer to cut off the top positions in the lists for Eva Luation.
- "ndcg-", "map-", "[email protected]", "[email protected]": In Xgboost, NDCG and map would evaluate the score of a list without Any positive samples as 1. By adding "–" in the evaluation metric Xgboost would evaluate these score as 0 to be consistent under some conditions.
  Training repeatively
Seed [default=0]
- The seed of the random number.缺省值为0

Console Parameters

The following parameters is only used in the console version of Xgboost
* Use_buffer [default=1]
-whether to create a binary cache file for input, the cache file can speed up the calculation. 缺省值为1
* Num_round
-Boosting iteration count.
* Data
-The path of the input data
* Test:data
-Path to test data
* Save_period [default=0]
-Represents the model that holds the i*save_period iteration. For example, save_period=10 indicates that every 10 iterations of the calculation Xgboost will save the intermediate result, and a setting of 0 indicates that the model is persisted for each calculation.
* task [Default=train] Options:train, pred, eval, dump
-Train: training is obvious
-pred: Prediction of test data
-eval: Define evaluation indicators by Eval[name]=filenam
-Dump: Save learning model to text format
* model_in [default=null]
-The path to the model is used in test, eval, and dump, and if defined in training Xgboost will then enter the model to continue training
* model_out [default=null]
-the maintenance path of the model after the completion of the training, if not defined, outputs a result similar to 0003.model, and 0003 is the result of the third training model.
* Model_dir [default=models]
-The path saved by the output model.
* Fmap
-feature map, used for dump model
* name_dump [default=dump.txt]
-Name of model dump file
* name_pred [default=pred.txt]
-Forecast Results File
* Pred_margin [default=0]
-outputs the predicted boundary, not the converted probability

3. Model Training

1         BST = xgb.train (param, Dtrain, num_round, watchlist)2         precision = bst.predict (dtest)

Train the model and predict

1         Print (Metrics.accuracy_score (labels,preds)) 2         Print (Metrics.precision_score (labels, preds)) 3         Print (Metrics.recall_score (labels, preds))

Output indicator

4. Model Save and load

Save the Model

Bst.save_model ('0001.model')

Load model

1 Bst.load_model ("00001.model"#  load Data

Installation and use of xgboost in Python environment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More