xgboost Distributed Deployment Tutorial
Xgboost is a very good open source tool for gradient enhancement learning. With the optimization of multiple numerical algorithms and non-numerical algorithms (Xgboost:a scalable Tree boosting System), the speed is staggering. Tested for SPARK10 hours to train out the amount of data GBDT (Gradient boosting decision Tree), it takes only 10 minutes for xgboost to use half of the cluster resources. For a variety of reasons, I spent one months on deploying Xgboost in the Hadoop environment, during which I raised a lot of questions on xgboost issues and found bugs for the author. Today, specially write a deployment tutorial, convenient for the needs of peers.
Note: This tutorial is to deploy Xgboost when the cluster gcc-v < 4.8 Libhdfs native target code is not available, so you can cover the overwhelming majority of the issues encountered during deployment. Because the current code is not being tested, this tutorial uses a specific version of the code. This tutorial will run Xgboost dependent files into the Xgboost-packages directory, re-deploy just scp-r xgboost-packages to ${home} directoryget a specific version of XgboostGit Clonedmlc-core, Rabit, xgboost from GitHub
git clone--recursive https://github.com/dmlc/xgboost into xgboost directory, check out version 76c320e9f0db7cf4aed73593ddcb4e0be0673810
git checkout 76c320e9f0db7cf4aed73593ddcb4e0be0673810 into dmlc-core directory, check out version 706f4d477a48fc75cb46b226ea007fbac862f9c2
git checkout 706f4d477a48fc75cb46b226ea007fbac862f9c2 into rabit directory, check out version 112d866dc92354304c0891500374fe40cdf13a50
git checkout 112d866dc92354304c0891500374fe40cdf13a50 Create xgboost-packages directory in ${home} and copy Xgboost to xgboost-package directory
mkdir xgboost-package
cp-r xgboost xgboost-packages/
installing a Build-dependent package
Installing gcc-4.8.0Download the GCC source package and unzip it
TAR-JXVF gcc-4.8.2.tar.bz2 Download the dependent libraries required for compilation
CD gcc-4.8.2
./contrib/download_prerequisites
# Create a directory for compiling files to store
CDs.
Build the compiled output directory
mkdir gcc-build-4.8.2 Enter this directory, execute the following command to generate the makefile file (installed into the ${home} directory)
CD gcc-build-4.8.2
. /gcc-4.8.2/configure--enable-checking=release--enable-languages=c,c++--disable-multilib--prefix=${HOME}
Compile
MAKE-J21 Installation
Make install modify variable toggle default GCC version
Path= $HOME/bin: $PATH
cp-r ~/lib64 ~/xgboost-packages
Installing CMakeDownload cmake-3.5.2 Installation:
TAR-ZXF cmake-3.5.2.tar.gz
cd cmake-3.5.2
./bootstrap--prefix=${home}
gmake
make-j21
make Install
Download Compile libhdfs*Download hadoop-common-cdh5-2.6.0_5.5.0 Compilation
Unzip Hadoop-common-cdh5-2.6.0_5.5.0.zip
CD hadoop-common-cdh5-2.6.0_5.5.0/hadoop-hdfs-project/hadoop-hdfs/ SRC
cmake-dgenerated_javah=/opt/jdk1.8.0_60-djava_home=/opt/jdk1.8.0_60
make
# Copy the compiled target file into the xgboost-packages
cp-r/target/usr/local/lib ${home}/xgboost-packages/libhdfs
Installing XgboostCompile
CD ${home}/xgboost-packages/xgboost
CP make/config.mk./
# change CONFIG.MK using HDFs configuration
# whether use HDFs support During compile
Use_hdfs = 1
hadoop_home =/usr/lib/hadoop
Hdfs_lib_path = $ (HOME)/xgboost-packages/ Libhdfs
#编译
make-j22
Modify part of the code (Env python version >2.7 without modification)
# change dmlc_yarn.py first line
#!/usr/bin/python2.7
# change run_hdfs_prog.py first line
#!/usr/bin/python2.7
Test
# Add necessary Parameters CD ${home}/xgboost-packages/xgboost/demo/distributed-training echo-e "booster = gbtree\nobjective = binary:
Logistic\nsave_period = 0\neval_train = 1 "> mushroom.hadoop.conf # Test code run_yarn.sh #!/bin/bash if [" $# "-lt 2]; Then echo "Usage: <nworkers> <nthreads>" Exit-1 fi # put the local training file to HDFS DAT A_dir= "/user/' WhoAmI '/xgboost-dist-test" #hadoop fs-test-d ${data_dir} && hadoop fs-rm-r ${data_dir} #hadoop F S-mkdir ${data_dir} #hadoop fs-put. /data/agaricus.txt.train ${data_dir} #hadoop fs-put. /data/agaricus.txt.test ${data_dir} # Necessary ENV export ld_library_path=${home}/xgboost-packages/lib64: $JAVA _home
/jre/lib/amd64/server:/${home}/xgboost-packages/libhdfs: $LD _library_path Export Hadoop_home=/usr/lib/hadoop Export hadoop_common_home= $HADOOP _home export HADOOP_HDFS_HOME=/USR/LIB/HADOOP-HDFS export hadoop_mapred_home=/usr/ Lib/hadoop-yarn export hadoop_yarn_home= $HADOOP _mapred_home export hadoop_conf_dir= $HADoop_home/etc/hadoop # Running Rabit, pass address in HDFs. /.. /dmlc-core/tracker/dmlc_yarn.py-n $--vcores $2\--ship-libcxx ${home}/xgboost-packages/lib64 \ q root.machine Learning \-F ${home}/xgboost-packages/libhdfs/libhdfs.so.0.0.0 \. /.. /xgboost mushroom.hadoop.conf nthread=$2 \ data=hdfs://ss-hadoop${data_dir}/agaricus.txt.train \ eval[test]=hdfs:/ /ss-hadoop${data_dir}/agaricus.txt.test \ eta=1.0 \ max_depth=3 \ num_round=3 \ Model_out=hdfs://ss-hadoop /tmp/mushroom.final.model # Get the final model file Hadoop fs-get/tmp/mushroom.final.model Final.model # Use Dmlc-cor e/yarn/run_hdfs_prog.py to setup approperiate env # output Prediction task=pred #.. /.. /XGBOOST.DMLC mushroom.hadoop.conf task=pred Model_in=final.model test:data=. /data/agaricus.txt.test #. /.. /dmlc-core/yarn/run_hdfs_prog.py. /.. /xgboost mushroom.hadoop.conf task=pred Model_in=final.model test:data=. /data/agaricus.txt.test # Print the boosters of Final.model in Dump.raw.txt #. /.. /XGBOOST.DMLC mushroom.hadoop.conf task=dump Model_in=final.model name_dump=dump.raw.txt #. /.. /dmlc-core/yarn/run_hdfs_prog.py. /.. /xgboost mushroom.hadoop.conf task=dump Model_in=final.model name_dump=dump.raw.txt # Use the feature map in printing for Better visualization #. /.. /XGBOOST.DMLC mushroom.hadoop.conf task=dump Model_in=final.model fmap=. /data/featmap.txt Name_dump=dump.nice.txt. /.. /dmlc-core/yarn/run_hdfs_prog.py. /.. /xgboost mushroom.hadoop.conf task=dump Model_in=final.model fmap=. /data/featmap.txt name_dump=dump.nice.txt Cat Dump.nice.txt
Run results
ReferencesHttps://github.com/dmlc/xgboost https://github.com/dmlc/xgboost/issues/854 https://github.com/dmlc/xgboost/ issues/856 https://github.com/dmlc/xgboost/issues/861 https://github.com/dmlc/xgboost/issues/866 https:// github.com/dmlc/xgboost/issues/869 https://github.com/dmlc/xgboost/issues/1150 http://arxiv.org/pdf/ 1603.02754v1.pdf