Deep learning multi-machine multi-card solution-purine

Source: Internet
Author: User
Tags image processing library

Please do not reprint without permission, original zhxfl,http://www.cnblogs.com/zhxfl/p/5287644.html

Directory:

First, Introduction

Second, the Environment configuration

Third, run the demo

Iv. Hardware Configuration Recommendations

V. Other

First, Introduction

Deep learning multi-machine multi-card cluster has become the mainstream, relative to Caffe and mxnet two more active open source, purine appears more worthy of the students in the university reading , because purine code appears more concise, the author's C + + skill is quite vigorous, The idea of its adoption is also very valuable and instructive. But Purine has stopped maintenance, so it is not suitable for enterprise users, but still has a great academic value, especially suitable for university research software architecture, deep learning, and even machine learning GPU cluster performance optimization direction of students reading.

Second, the Environment configuration

2.1, Support c++11

Since purine uses the C++11 standard, the commands for installing gcc-4.8 and g++-4.8 are performed in Ubuntu, and in order to avoid multiple version conflicts, you can uninstall the other versions, lest you want to manage and switch the compiler version. I chose gcc-4.8 here.

1 sudoApt-get--yes--force-yes RemoveGCC-4.6g++-4.6 GCC-4.7g++-4.7 GCC-4.9g++-4.92 sudoapt-Get update3 sudoApt-get--yes--force-yesInstall GCC-4.8g++-4.84 sudoApt-get--yes--force-yesInstallGfortran

2.2. Installation of Cuda and CUDNN

Here I am using the CUDA-7.0 version, now know that the supported version has (6.5,7.0,7.5). It is important to note that if you install a different version of Cuda, you will need to uninstall the original version of Cuda before installing the new one to avoid conflicts.
Here's how to uninstall Cuda:

1 sudo Perl /usr/local/cuda/bin/uninstall_cuda_7. 5. pl

After uninstalling the original version of Cuda, you need to uninstall the driver for Ubuntu, because the CUDA installation package has its own driver.

1 sudo apt-get--yes--force-yes Remove nvidia*

Because the desktop will occupy the driver, Cuda will not be able to install the graphics driver, so before installing CUDA, you need to shut down the desktop before the Cuda installation package, Cuda's installation package can be downloaded from the official website: https://developer.nvidia.com/cuda-downloads

1 sudo /etc/init.d/lightdm stop2sudosh ./cuda_7. 5. 18_linux.run

Next you need to configure CUDNN,CUDNN is the NVIDIA official maintenance of the Deep learning Accelerator library, to some extent, the performance of this library is the fastest, so purine also borrowed cudnn to optimize performance. CUDNN before downloading to register and apply, a little more trouble, the address is as follows: HTTPS://DEVELOPER.NVIDIA.COM/CUDNN, I use the cudnn7.0 version, directly unzip and then copy the corresponding files to the Cuda installation directory.

1 sudo tar -xvf./cudnn-7.0-linux-x64-v3. 0-prod.tgz2sudomv ./cuda/include/*  /usr/local/cuda/ include/3sudo mv./cuda/lib64/*/usr/local/cuda/lib64/4sudo rm-rf cuda/ 

2.3, Installation Libuv

LIBUV is an asynchronous C + + library that can easily implement logic such as "events", "Thread Synchronization", "Thread Wait", "Threads Pool" through LIBUV. LIBUV is an open source library, you can now compile the source code: HTTPS://GITHUB.COM/LIBUV/LIBUV, I am using the 1.x version.

Install the LIBUV dependent Library first:

1 sudo Install Aptitude 2 sudo Install libtool automake autoconf Autogen

Unzip the downloaded installation package for compilation and installation:

1 Echo "Install LIBUV"2 RM-rf./libuv-1. X.zi3 sudo Unziplibuv-1. x.Zip4CD./libuv-1. x/5 sudo SHAutogen.SH6 sudo./configure/7 sudo  Make-J48 sudo  Make Install9Cd.. /Ten RM-rf./libuv-1. x/

2.4, CMake Installation

CMake is a cross-platform organization compiler dependent tool, you can download the source code for installation and compilation, Recommended to use 3.3.2 or above, on the one hand because I have verified, the second is because CMake support for Cuda late, it is recommended not to use the following version of 3.3.2, avoid doing no work, cmakehttps://cmake.org/

1 Echo "CMake Install"2 RM-rf./cmake-3.3.23 sudo Tar-XVF./cmake-3.3.2.Tar. GZ4CD./cmake-3.3.2/5./Configure6  Make-j47 sudo  Make Install8Cd.. /9 RM-rf./cmake-3.3.2

2.5, Installation OpenCV

OpenCV is an image processing library, Caffe and purine rely on this library, it is recommended to select more than 3.0 version, http://opencv.org/ Downloads.html, I'm using the 3.0 version here, and compiling with the source code, in addition, it is recommended to close With_cuda and With_ipp, because these two are actually not used in the Purine project, the command is as follows:

1 Echo "Install OpenCV"2 sudoApt-get Remove libopencv-Dev3 RM-RF opencv-3.0.0/4 sudo Unzip./opencv-3.0.0.Zip5CD./opencv-3.0.0/6Cmake-d with_cuda=off-d with_ipp=OFF.7  Make-j48 sudo  Make Install9Cd.. /Ten RM-RF opencv-3.0.0/

2.6. Installing MPI

MPI is a set of parallel programming interface, which is mainly dependent on SPMD (a single program, multiple data streams, specific to the cluster is the logic that each node handles is the same, but the data flow is not the same) the idea of programming design, because purine only support Mpich, So the version I used is MPICH-3.2B4, the address is as follows: https://www.mpich.org/2015/07/25/mpich-3-2b4-released/

1 Echo "Install Mpich"2 RM-rf./mpich-3.2b4/3 Tar-XVF./mpich-3.2b4.Tar. GZ4CD./mpich-3.2b4/5./Configure6  Make-j47 sudo  Make Install8Cd.. /9 RM-rf./mpich-3.2b4/

2.7. Other dependencies:

1#sudoApt-get--yes--force-yesInstallLibprotobuf-dev libleveldb-dev libsnappy-dev libboost-all-dev libhdf5-serial-dev libgflags-dev Libgoogle-glog-dev Liblmdb-dev protobuf-compiler2 sudoApt-get--yes--force-yesInstallLibprotobuf-dev libgflags-dev Libgoogle-glog-dev Liblmdb-dev protobuf-compiler libboost-Dev3 sudoApt-get--yes--force-yesInstallLibatlas-base-dev libatlas3gf-Base4 sudoApt-getInstallLibatlas-base-dev Libatlas3gf-base

Third, run the demo

3.1 Compiling

First from git to pull down the code, the original author's official website in Https://github.com/purine/purine2, but the original git on the project can not be compiled through, its release of CMakeList.txt some bug caused by the failure to compile. You can also use the branch https://github.com/zhxfl/purine2 of my fork, and my branch will make cmakelist.txt changes to make sure it compiles properly. I have done a lot of improvement in the following attempts, but are incremental maintenance, and did not misinterpret the author's original intention, so you can rest assured that the use of my branch code.

1 git clone https://github.com/zhxfl/purine2

After the code has been downloaded, compile it using the command below, if "CMake" is here. The order did not pass smoothly, you can contact me "zhxfl# #mail. ustc.edu.cn", replace "# #" with @, it may be that I previously described the lack of some dependencies caused some problems.

1 cd purine2/2cmake. 3  make-j2

3.2 Demo Database production

The CIFAR-10 data set is selected here to run

1 ~/tmp/purine2/data/cifar-/get_cifar10. SH

To the purine root directory to execute the Cifar10_data build database, the code is as follows, because the code is based on the relative directory, it must be executed in the purine2/directory, after execution will be in the purine2/data/ Two database files Cifar-10-train-lmdb and Cifar-10-test-lmdb two directories are generated under the CIFAR-10 directory.

./test/cifar10_data

3.3 Single-machine multi-card

Before running the demo, you need to configure two files, add the Parallel_config file

0 0 64
0 1 64

These three parameters are indicated (machine number, graphics card designator, iteration corresponding batch_size)

Next to configure the Hostfile file, fill in the IP address of the machine, an address line, and then use the following command to start the purine run, which represents the start of two processes,

1 -hostfile hostfile./test/nin_cifar10

3.4 Multi-machine multi-card

1, in the multi-machine multi-card, in different machines between the configuration of SSH without password authentication.

2, confirm that there is no password authentication between the different machines, first configure the Hosts file on each machine, give each machine a name, for example, we have two nodes, a, B, where a has two video cards. Then the Parallel_config file is configured as follows:

0 0  - 0 1  - 1 0  -

Hostfile

Ab

Copy the two configuration files to the purine directory of the different machines, here to ensure that each machine has a purine directory, and its executables are the same, and because we do not have a distributed file system, we must ensure that each machine has data.

Next, execute the following command:

2 -hostfile hostfile./test/nin_cifar10

In addition, all logs are stored in the system/tmp/directory

Iv. Hardware Configuration Recommendations

If it is for academic research rather than commercialization, a cost-effective hardware solution is recommended:

1, graphics card: Titianx graphics card 2, gtx98ti can also,

2, the motherboard can choose to insert a few video cards, generally also thousands of dollars, such as " gigabyte lga2011-3 ga-x99"

Chassis and power supply what makes the supplier match up, the entire solution can be controlled between the 2,3w.

V. Other

If you're not studying cluster performance, just doing experiments and even looking to develop and research on windows,

Other open source cuda-cnn and CUDA-MCDNN in our labs can be considered, and most of our attempts to modify algorithms are made on these two projects.

If you have other questions, you can contact me: zhxfl# #mail. ustc.edu.cn


Deep learning multi-machine multi-card solution-purine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.