python環境下xgboost的安裝與使用

最後更新：2017-11-01 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：instance erro 介紹 ase options bsp 並行 alpha 最快

xgboost是大規模並行boosted tree的工具，它是目前最快最好的開源boosted tree工具包，比常見的工具包快10倍以上。在資料科學方面，有大量kaggle選手選用它進行資料採礦比賽，其中包括兩個以上kaggle比賽的奪冠方案。在工業界規模方面，xgboost的分布式版本有廣泛的可移植性，支援在YARN, MPI, Sungrid Engine等各個平台上面運行，並且保留了單機並行版本的各種最佳化，使得它可以很好地解決於工業界規模的問題。

本文就主要介紹一下xgboost在python環境中的安裝與使用。

首先安裝XGBoost的C++版本，然後進入源檔案的根目錄下的 wrappers檔案夾執行如下指令碼安裝Python模組

python setup.py install

下載網址： https://github.com/dmlc/xgboost，（windows環境下安裝需要先進行編譯）

使用方法：

1.資料匯入

資料格式範例

匯入方法為：

        dtrain = xgb.DMatrix(‘train.txt‘)        dtest = xgb.DMatrix(‘test.txt‘)

2.參數設定

1         param = {‘booster‘:‘gbtree‘,‘max_depth‘:10, ‘eta‘:0.3, ‘silent‘:1, ‘num_class‘:2,‘objective‘:‘multi:softprob‘ }2         watchlist  = [(dtest,‘test‘), (dtrain,‘train‘)]

設定參數並調整，設定驗證資料集

參數解釋：

Parameter for Tree Booster

eta [default=0.3]
- 為了防止過擬合，更新過程中用到的收縮步長。在每次提升計算之後，演算法會直接獲得新特徵的權重。 eta通過縮減特徵的權重使提升計算過程更加保守。預設值為0.3
- 取值範圍為：[0,1]
gamma [default=0]
- minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
- range: [0,∞]
max_depth [default=6]
- 數的最大深度。預設值為6
- 取值範圍為：[1,∞]
min_child_weight [default=1]
- 孩子節點中最小的樣本權重和。如果一個葉子節點的樣本權重和小於min_child_weight則拆分過程結束。在現行迴歸模型中，這個參數是指建立每個模型所需要的最小樣本數。該成熟越大演算法越conservative
- 取值範圍為: [0,∞]
max_delta_step [default=0]
- Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update
- 取值範圍為：[0,∞]
subsample [default=1]
- 用於訓練模型的子樣本占整個樣本集合的比例。如果設定為0.5則意味著XGBoost將隨機的沖整個樣本集合中隨機的抽取出50%的子樣本建立樹模型，這能夠防止過擬合。
- 取值範圍為：(0,1]
colsample_bytree [default=1]
- 在建立樹時對特徵採樣的比例。預設值為1
- 取值範圍：(0,1]

Parameter for Linear Booster

lambda [default=0]
- L2 正則的懲罰係數
alpha [default=0]
- L1 正則的懲罰係數
lambda_bias
- 在偏置上的L2正則。預設值為0（在L1上沒有偏置項的正則，因為L1時偏置不重要）

Task Parameters

objective [ default=reg:linear ]
- 定義學習任務及相應的學習目標，可選的目標函數如下：
- “reg:linear” –線性迴歸。
- “reg:logistic” –羅吉斯迴歸。
- “binary:logistic” –二分類的羅吉斯迴歸問題，輸出為機率。
- “binary:logitraw” –二分類的羅吉斯迴歸問題，輸出的結果為wTx。
- “count:poisson” –計數問題的poisson迴歸，輸出結果為poisson分布。
- 在poisson迴歸中，max_delta_step的預設值為0.7。(used to safeguard optimization)
- “multi:softmax” –讓XGBoost採用softmax目標函數處理多分類問題，同時需要設定參數num_class（類別個數）
- “multi:softprob” –和softmax一樣，但是輸出的是ndata * nclass的向量，可以將該向量reshape成ndata行nclass列的矩陣。沒行資料表示樣本所屬於每個類別的機率。
- “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
base_score [ default=0.5 ]
- the initial prediction score of all instances, global bias
eval_metric [ default according to objective ]
- 校正資料所需要的評價指標，不同的目標函數將會有預設的評價指標（rmse for regression, and error for classification, mean average precision for ranking）
- 使用者可以添加多種評價指標，對於Python使用者要以list傳遞參數對給程式，而不是map參數list參數不會覆蓋’eval_metric’
- The choices are listed below:
- “rmse”: root mean square error
- “logloss”: negative log-likelihood
- “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- “merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
- “mlogloss”: Multiclass logloss
- “auc”: Area under the curve for ranking evaluation.
- “ndcg”:Normalized Discounted Cumulative Gain
- “map”:Mean average precision
- “[email protected]”,”[email protected]”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
- “ndcg-“,”map-“,”[email protected]“,”[email protected]“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
  training repeatively
seed [ default=0 ]
- 隨機數的種子。預設值為0

Console Parameters

The following parameters are only used in the console version of xgboost
* use_buffer [ default=1 ]
- 是否為輸入建立二進位的快取檔案，快取檔案可以加速計算。預設值為1
* num_round
- boosting迭代計算次數。
* data
- 輸入資料的路徑
* test:data
- 測試資料的路徑
* save_period [default=0]
- 表示儲存第i*save_period次迭代的模型。例如save_period=10表示每隔10迭代計算XGBoost將會儲存中間結果，設定為0表示每次計算的模型都要保持。
* task [default=train] options: train, pred, eval, dump
- train：訓練明顯
- pred：對測試資料進行預測
- eval：通過eval[name]=filenam定義評價指標
- dump：將學習模型儲存成文字格式設定
* model_in [default=NULL]
- 指向模型的路徑在test, eval, dump都會用到，如果在training中定義XGBoost將會接著輸入模型繼續訓練
* model_out [default=NULL]
- 訓練完成後模型的保持路徑，如果沒有定義則會輸出類似0003.model這樣的結果，0003是第三次訓練的模型結果。
* model_dir [default=models]
- 輸出模型所儲存的路徑。
* fmap
- feature map, used for dump model
* name_dump [default=dump.txt]
- name of model dump file
* name_pred [default=pred.txt]
- 預測結果檔案
* pred_margin [default=0]
- 輸出預測的邊界，而不是轉換後的機率

3.模型訓練

1         bst = xgb.train(param, dtrain, num_round, watchlist)2         precision = bst.predict(dtest)

訓練模型並預測

1         print(metrics.accuracy_score(labels,preds))2         print(metrics.precision_score(labels, preds))3         print(metrics.recall_score(labels, preds))

輸出指標

4.模型儲存與載入

儲存模型

bst.save_model(‘0001.model‘)

載入模型

1 bst.load_model("00001.model") # load data

python環境下xgboost的安裝與使用

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More