Based on intel® Xeon? caffe* training on multi-node distributed memory systems for the processor E5 product family

Source: Internet
Author: User
Tags dnn intel mkl

Original link

Deep Neural Network (DNN) training is a computationally intensive project that takes days or weeks to complete on a modern computing platform. In a recent article on Intel? Xeon? In single-node Caffe scoring and training for the E5 product family, we demonstrated a 10 times-fold performance improvement in the caffe* framework based on the AlexNet topology and reduced the single-node training time to 5 days. Intel continues to fulfill the machine learning vision outlined in Pradeep Dubey's blog, and in this technical preview we will show how to reduce the training time of Caffe from days to hours in a multi-node, distributed memory environment.

This article describes a preview package that has limited functionality and is not intended for production. The features discussed are now available in the Intel MKL 2017 and intel® Caffe Branch (fork).

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (Berkeley Vision and Learning Center, BVLC) and is one of the most common community frameworks for image recognition. Caffe is often used as a benchmark with alexnet* (an image recognition neural network topology) and imagenet* (a label image database).

The Caffe framework does not support multi-node, distributed memory systems by default, and requires a wide range of adjustments to run on distributed memory systems. We use Intel? The MPI Library performs a strong expansion of the synchronous Minibatch random gradient descent (SGD) algorithm. The computation of one iteration can be extended to multiple nodes, so that multi-phase parallel implementation of multithreading is equivalent to single-node, single-line program implementation.

We use three methods to scale the calculations: data parallelism, model parallelism, and hybrid parallelism. Model parallelism refers to dividing the model or weight into nodes so that the weight of each part is owned by the specified node, and each node processes all data points in one minibatch. Unlike communication between weight and weight gradients, this requires the activation and activation of gradient communication, which is usually the case with data parallelism.

With this more advanced distributed parallelism, we trained AlexNet on all of the 2012 ImageNet large-scale visual identity Challenge (ILSVRC-2012) datasets for only 5 hours on Intel? Xeon? The processor E5 product family's 64-node system cluster achieves 80% of the data set accuracy (ranked top five).

Entry

While we are working to integrate the new features listed in this article into future Intel? Library of Mathematical core functions (Intel? MKL) and Intel? Mathematical Analysis Acceleration Library (Intel? DAAL) version, you can use the technical preview package that accompanies this article to generate the performance results you have demonstrated again, and even to train AlexNet on your own datasets. Previews include single-node and multi-node implementations. Note that the current implementation is limited to the AlexNet topology and may not be used in conjunction with other common DNN topologies.

The package supports the AlexNet topology and adds the ' intel_alexnet ' and ' mpi_intel_alexnet ' models, which add two new ' intelpack ' and ' intelunpack ' layers to the ' bvlc_alexnet ' to and optimized convolution, pooling, standardized layers, and MPI-based implementations for all tiers. We have also changed the validation parameters to improve vectorization performance by verifying that the minibatch size has been increased from 50 to 256, and the number of test iterations from 1,000 to 200 has been reduced so that the amount of images used in the validation run remains the same. The packet contains the ' intel_alexnet ' model in the following folder:

    • Models/intel_alexnet/deploy.prototxt
    • Models/intel_alexnet/solver.prototxt
    • Models/intel_alexnet/train_val.prototxt.
    • Models/mpi_intel_alexnet/deploy.prototxt
    • Models/mpi_intel_alexnet/solver.prototxt
    • Models/mpi_intel_alexnet/train_val.prototxt.
    • Models/mpi_intel_alexnet/train_val_shared_db.prototxt
    • Models/mpi_intel_alexnet/train_val_split_db.prototxt

Both the ' intel_alexnet ' and ' mpi_intel_alexnet ' models support your training and testing ILSVRC-2012 training sets.

To start the package, make sure that all of the general Caffe dependencies and Intel software tools listed in the system requirements and Limitations section are installed on your system.

Run on a single node
    1. Unpack the package.
    2. Specify the path for the database, snapshot location, and image mean file in the ' intel_alexnet ' model file below.
      • Models/intel_alexnet/deploy.prototxt
      • Models/intel_alexnet/solver.prototxt
      • Models/intel_alexnet/train_val.prototxt
    3. Set up the run-time environment for the software tools listed in the system requirements and Restrictions section.
    4. Add the path to the./build/lib/libcaffe.so in the LD_LIBRARY_PATH environment variable.
    5. Set the threading environment as follows:
      $> Export Omp_num_threads=<n_processors * n_cores>
      $> Export Kmp_affinity=compact,granularity=fine

Note: The omp_num_threads must be an even number greater than or equal to 2.

    1. Use this command to perform timings on a single node:
      $>./build/tools/caffe time \
      -iterations <number of iterations> \
      --model=models/intel_alexnet/train_val.prototxt
    2. Use this command to run training on a single node:
      $>./build/tools/caffe train \
      --solver=models/intel_alexnet/solver.prototxt
Running on a cluster
    1. Unpack the package.
    2. Set up the run-time environment for the software tools listed in the system requirements and Restrictions section.
    3. Add the path to the./build-mpi/lib/libcaffe.so in the LD_LIBRARY_PATH environment variable.
    4. Set the NP environment variable for the number of nodes to use as follows:

$> Export np=<number-of-mpi-ranks>

Note: The best performance can be achieved by adding an MPI queue at each node.

    1. In the name of x${np}.hosts , create a node file in the app's root directory. For example, for ibm* platform lsf*, you can run the following command:

$> cat $PBS _nodefile > X${np}.hosts

    1. Specify the path for the database, snapshot location, and image mean files in the following ' mpi_intel_alexnet ' model files:
      • Models/mpi_intel_alexnet/deploy.prototxt,
      • Models/mpi_intel_alexnet/solver.prototxt,
      • Models/mpi_intel_alexnet/train_val_shared_db.prototxt

Note: On some system configurations, the performance of the shared disk system can be a bottleneck. In this case, it is recommended that the image database be pre-allocated to compute nodes for best performance results. Refer to the Readme file in the packet for instructions.

    1. Set the threading environment as follows:

$> Export Omp_num_threads=<n_processors * n_cores>
$> Export Kmp_affinity=compact,granularity=fine

Note: The omp_num_threads must be an even number greater than or equal to 2.

    1. Use this command to perform timings:
      $> mpirun-nodefile x${np}.hosts-n $NP-ppn 1-prepend-rank \

./build/tools/caffe Time \

-iterations <number of iterations> \

--model=models/mpi_intel_alexnet/train_val.prototxt

    1. Use this command to run training:
      $> mpirun-nodefile x${np}.hosts-n $NP-ppn 1-prepend-rank \

./build-mpi/tools/caffe train \

--solver=models/mpi_intel_alexnet/solver.prototxt

System Requirements and Limitations

The preview package has the same software-related components as the Caffe that are not optimized:

    • boost* 1.53.0
    • opencv* 2.4.9
    • protobuf* 3.0.0-beta1
    • glog* 0.3.4
    • gflags* 2.1.2
    • lmdb* 0.9.16
    • leveldb* 1.18
    • hdf5* 1.8.15
    • Red Hat Enterprise Linux * 6.5 or later

Intel software tools:

    • Intel MKL version 11.3 or later
    • Intel MPI Library 5.0

Hardware compatibility:

    • Fourth generation Intel? Cool core? Processor (Code Haswell)

This software is only validated using the AlexNet topology and may not be available for other configurations.

Support

If you have questions and suggestions about this preview package, please contact: mailto:[email protected].

Based on intel® Xeon? caffe* training on multi-node distributed memory systems for the processor E5 product family

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.