Xgboost Distributed Deployment Tutorial

Source: Internet
Author: User
Tags mkdir git clone xgboost hadoop fs
xgboost Distributed Deployment Tutorial

Xgboost is a very good open source tool for gradient enhancement learning. With the optimization of multiple numerical algorithms and non-numerical algorithms (Xgboost:a scalable Tree boosting System), the speed is staggering. Tested for SPARK10 hours to train out the amount of data GBDT (Gradient boosting decision Tree), it takes only 10 minutes for xgboost to use half of the cluster resources. For a variety of reasons, I spent one months on deploying Xgboost in the Hadoop environment, during which I raised a lot of questions on xgboost issues and found bugs for the author. Today, specially write a deployment tutorial, convenient for the needs of peers.
Note: This tutorial is to deploy Xgboost when the cluster gcc-v < 4.8 Libhdfs native target code is not available, so you can cover the overwhelming majority of the issues encountered during deployment. Because the current code is not being tested, this tutorial uses a specific version of the code. This tutorial will run Xgboost dependent files into the Xgboost-packages directory, re-deploy just scp-r xgboost-packages to ${home} directoryget a specific version of XgboostGit Clonedmlc-core, Rabit, xgboost from GitHub
git clone--recursive https://github.com/dmlc/xgboost into xgboost directory, check out version 76c320e9f0db7cf4aed73593ddcb4e0be0673810
git checkout 76c320e9f0db7cf4aed73593ddcb4e0be0673810 into dmlc-core directory, check out version 706f4d477a48fc75cb46b226ea007fbac862f9c2
git checkout 706f4d477a48fc75cb46b226ea007fbac862f9c2 into rabit directory, check out version 112d866dc92354304c0891500374fe40cdf13a50
git checkout 112d866dc92354304c0891500374fe40cdf13a50 Create xgboost-packages directory in ${home} and copy Xgboost to xgboost-package directory

  mkdir xgboost-package
  cp-r xgboost xgboost-packages/
installing a Build-dependent package Installing gcc-4.8.0Download the GCC source package and unzip it
TAR-JXVF gcc-4.8.2.tar.bz2 Download the dependent libraries required for compilation
  CD gcc-4.8.2
  ./contrib/download_prerequisites
  # Create a directory for compiling files to store
  CDs.
Build the compiled output directory
mkdir gcc-build-4.8.2 Enter this directory, execute the following command to generate the makefile file (installed into the ${home} directory)
CD  gcc-build-4.8.2
. /gcc-4.8.2/configure--enable-checking=release--enable-languages=c,c++--disable-multilib--prefix=${HOME}
Compile
MAKE-J21 Installation
Make install modify variable toggle default GCC version
Path= $HOME/bin: $PATH
cp-r ~/lib64 ~/xgboost-packages
Installing CMakeDownload cmake-3.5.2 Installation:
TAR-ZXF cmake-3.5.2.tar.gz
cd cmake-3.5.2
./bootstrap--prefix=${home}
gmake
make-j21
make Install
Download Compile libhdfs*Download hadoop-common-cdh5-2.6.0_5.5.0 Compilation
Unzip Hadoop-common-cdh5-2.6.0_5.5.0.zip
CD hadoop-common-cdh5-2.6.0_5.5.0/hadoop-hdfs-project/hadoop-hdfs/ SRC
cmake-dgenerated_javah=/opt/jdk1.8.0_60-djava_home=/opt/jdk1.8.0_60
make
# Copy the compiled target file into the xgboost-packages
cp-r/target/usr/local/lib ${home}/xgboost-packages/libhdfs
Installing XgboostCompile
CD ${home}/xgboost-packages/xgboost
CP make/config.mk./
# change CONFIG.MK using HDFs configuration
# whether use HDFs support During compile
Use_hdfs = 1
hadoop_home =/usr/lib/hadoop
Hdfs_lib_path = $ (HOME)/xgboost-packages/ Libhdfs
#编译
make-j22
Modify part of the code (Env python version >2.7 without modification)
# change dmlc_yarn.py first line
#!/usr/bin/python2.7
# change run_hdfs_prog.py first line
#!/usr/bin/python2.7
Test
# Add necessary Parameters CD ${home}/xgboost-packages/xgboost/demo/distributed-training echo-e "booster = gbtree\nobjective = binary:
Logistic\nsave_period = 0\neval_train = 1 "> mushroom.hadoop.conf # Test code run_yarn.sh #!/bin/bash if [" $# "-lt 2]; Then echo "Usage: <nworkers> <nthreads>" Exit-1 fi # put the local training file to HDFS DAT A_dir= "/user/' WhoAmI '/xgboost-dist-test" #hadoop fs-test-d ${data_dir} && hadoop fs-rm-r ${data_dir} #hadoop F S-mkdir ${data_dir} #hadoop fs-put. /data/agaricus.txt.train ${data_dir} #hadoop fs-put. /data/agaricus.txt.test ${data_dir} # Necessary ENV export ld_library_path=${home}/xgboost-packages/lib64: $JAVA _home
/jre/lib/amd64/server:/${home}/xgboost-packages/libhdfs: $LD _library_path Export Hadoop_home=/usr/lib/hadoop Export hadoop_common_home= $HADOOP _home export HADOOP_HDFS_HOME=/USR/LIB/HADOOP-HDFS export hadoop_mapred_home=/usr/ Lib/hadoop-yarn export hadoop_yarn_home= $HADOOP _mapred_home export hadoop_conf_dir= $HADoop_home/etc/hadoop # Running Rabit, pass address in HDFs. /.. /dmlc-core/tracker/dmlc_yarn.py-n $--vcores $2\--ship-libcxx ${home}/xgboost-packages/lib64 \ q root.machine Learning \-F ${home}/xgboost-packages/libhdfs/libhdfs.so.0.0.0 \. /.. /xgboost mushroom.hadoop.conf nthread=$2 \ data=hdfs://ss-hadoop${data_dir}/agaricus.txt.train \ eval[test]=hdfs:/ /ss-hadoop${data_dir}/agaricus.txt.test \ eta=1.0 \ max_depth=3 \ num_round=3 \ Model_out=hdfs://ss-hadoop /tmp/mushroom.final.model # Get the final model file Hadoop fs-get/tmp/mushroom.final.model Final.model # Use Dmlc-cor e/yarn/run_hdfs_prog.py to setup approperiate env # output Prediction task=pred #.. /.. /XGBOOST.DMLC mushroom.hadoop.conf task=pred Model_in=final.model test:data=. /data/agaricus.txt.test #. /.. /dmlc-core/yarn/run_hdfs_prog.py. /.. /xgboost mushroom.hadoop.conf task=pred Model_in=final.model test:data=. /data/agaricus.txt.test # Print the boosters of Final.model in Dump.raw.txt #. /.. /XGBOOST.DMLC mushroom.hadoop.conf task=dump Model_in=final.model name_dump=dump.raw.txt #. /.. /dmlc-core/yarn/run_hdfs_prog.py. /.. /xgboost mushroom.hadoop.conf task=dump Model_in=final.model name_dump=dump.raw.txt # Use the feature map in printing for Better visualization #. /.. /XGBOOST.DMLC mushroom.hadoop.conf task=dump Model_in=final.model fmap=. /data/featmap.txt Name_dump=dump.nice.txt. /.. /dmlc-core/yarn/run_hdfs_prog.py. /.. /xgboost mushroom.hadoop.conf task=dump Model_in=final.model fmap=. /data/featmap.txt name_dump=dump.nice.txt Cat Dump.nice.txt
Run results
ReferencesHttps://github.com/dmlc/xgboost https://github.com/dmlc/xgboost/issues/854 https://github.com/dmlc/xgboost/ issues/856 https://github.com/dmlc/xgboost/issues/861 https://github.com/dmlc/xgboost/issues/866 https:// github.com/dmlc/xgboost/issues/869 https://github.com/dmlc/xgboost/issues/1150 http://arxiv.org/pdf/ 1603.02754v1.pdf

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.