A large-scale distributed depth learning _ machine learning algorithm based on Hadoop cluster

Source: Internet
Author: User
Tags hadoop ecosystem

This article is reproduced from: http://www.csdn.net/article/2015-10-01/2825840


Absrtact: Deep learning based on Hadoop is an innovative method of deep learning. The deep learning based on Hadoop can not only achieve the effect of the dedicated cluster, but also has a unique advantage in enhancing the Hadoop cluster, distributed depth learning, performance testing and open source resources. Objective

Over the past 10 years, Yahoo has continued to invest in building and expanding the Apache Hadoop cluster, with more than 40,000 servers and 600PB of data distributed over 19 clusters so far. As introduced at the 2015 Hadoop Summit, we developed an extensible machine learning algorithm on our own servers to classify, sort, and compute word vectors. At present, Hadoop cluster has become the preferred platform for Yahoo's large-scale machine learning.

Deep Learning (Deep Learning, DL) is a key technology requirement for many products in Yahoo. In 2015 RE. At the work depth Learning Summit, the Yahoo Flickr team (Simon Osindero and Pierre Garrigues) elaborated how depth learning was applied to scene detection, object recognition and computational aesthetics. Machine learning helps Flickr automatically tag users ' pictures, making it easy for Flickr end-users to manage and find pictures.

We've recently moved this technology to our Hadoop cluster to make it more profitable for Yahoo products in depth learning technology. The main advantages of deep learning based on Hadoop are:

The depth learning process can be done directly on the Hadoop cluster where we store the data. Avoids unnecessary transmission of data between the Hadoop cluster and the deep learning cluster. Depth learning can be defined as a first-class Apache Oozie workflow, using Hadoop for data processing and spark piping for machine learning. Yarn supports deep learning. A cluster can simultaneously carry on a number of deep learning experiments. Compared with the traditional method, the new method is more effective. In the past, some of our project teams manually dispatched GPU resources by "Notepad," which is painful and is only valid for a few users.

Deep learning based on Hadoop is an innovative approach to deep learning. The industry's existing approach requires a dedicated cluster, while deep learning based on Hadoop not only achieves the effectiveness of a dedicated cluster, but also has several additional advantages. Enhanced Hadoop Cluster

To support deep learning, we added GPU nodes to the Hadoop cluster. Each node has 4 nvidia Tesla K80 operations cards, each with 2 GK210 GPU configurations. The processing power of these nodes is 10 times times the traditional CPU used by our Hadoop cluster.

On the Hadoop cluster, the GPU node has two separate network interfaces, Ethernet and InfiniBand. Ethernet as the main interface for external communications, InfiniBand provides over 10 times times the speed of data transmission between the GPU and supports direct access to GPU memory via RDMA.

By leveraging YARN's recently introduced node tagging feature (YARN-796), we can declare in jobs that the container is loaded on either the CPU or the GPU node. A container on a GPU node can use InfiniBand to exchange data at a very high speed. Distributed Depth Learning: Caffe-on-spark

To support deep learning on these enhanced Hadoop clusters, we developed a complete set of distributed computing tools based on open source software libraries, which are Apache Spark and Caffe. We can use the command line below to submit a deep learning computing task to the cluster GPU node. Spark-submit–master Yarn–deploy-mode Cluster
–files Solver.prototxt, Net.prototxt
–num-executors <# of Executors>
–archives caffe_on_grid.tgz
–conf spark.executorenv.ld_library_path= "./caffe_on_grid.tgz/lib64"
–class Com.yahoo.ml.CaffeOnSpark Caffe-on-spark-1.0-jar-with-dependencies.jar
-devices <# of GPUs per executor>
-conf Solver.prototxt
-input hdfs://<training file>
-model Hdfs://<model file>

In the above command line, the user can specify the number of spark executor to use (–num-executors), the number of GPU allocated per executor (-devices), the path where the training data is stored on the HDFs, and the path where the model is stored on the HDFs. The user uses the standard Caffe configuration file to determine the Caffe algorithm and the topology of the depth network (ex.solver.prototxt, Net.prototxt).

As shown in the figure above, some executor are loaded in the yarn spark. Each executor is assigned to a HDFS-based training data partition, and then multiple caffe-based training threads are opened. Each training thread is handled by a specific GPU. After a batch of training samples is processed using the reverse propagation algorithm, the gradient values of the model parameters are exchanged between the training threads. These gradient values are exchanged in MPI allreduce form between the GPU of multiple servers. We upgraded the Caffe to support the use of multiple GPU on a single server and synchronized the DL model with the RDMA protocol.

Caffe-on-spark let us set the strengths of both Caffe and spark and apply them to large-scale, in-depth learning tasks that are as easy to operate as other spark applications. Multiple GPU in a cluster is used to train models based on HDFS mass datasets. Performance test

Caffe-on-spark supports (a) multiple GPU, (b) Multiple machines for in-depth learning. To reflect the advantages of our approach, we performed a performance comparison test on the Imagenet 2012 dataset.

First, we study the alexnet datasets in a single spark executor using 1, 2, 4, 8 GPU respectively. As shown in the following figure, the training time is shortened as the number of GPU increases. When the number of GPU is 4, we can achieve the accuracy of 50% with the 15/43=35% that only takes the time required by a single GPU. The batch size of all the above execution procedures is 256. There is no significant improvement in the performance of the 4 GPU compared with the 8 GPU. Because the amount of data that each GPU processes is too small to take full advantage of hardware performance.

We then conducted a distributed performance comparison test on the Googlenet dataset, which was deeper than the alexnet test and used more convolution operations, requiring more computational power. In each round, we allocate a batch size of 32 to each GPU, and 32n is the most efficient size when n GPU is involved in the operation. Our distributed algorithms are designed to generate models and achieve the same exact rate as a single GPU. with 4 servers (4x8 GPU) training, the top-5 accuracy rate can exceed 80% (20% error) within 10 hours. Note 1 GPU training can only reach 60% top-5 accuracy (40% error) after 40 hours.


The size of the googlenet expands as the number of GPU increases. For a 60% top-5 accuracy rate (40% error), 8 GPU speeds 680% faster than 1 GPU. The following table shows the rate of increase for the accuracy of the 70% and 80% top-5. If we carefully adjust the batch data size (not to set the batch size to 32n), the speed can be increased. Open Source Resources

Adhering to Yahoo's Open source commitment, we uploaded a part of the code to Github.com/bvlc/caffe:

#2114 ... Allow Caffe to use multiple GPU on a single computer

#1148 ... Supports the transfer of data between computers with the RDMA protocol

#2386 ... Improved Caffe data pipeline and prefetching technology

#2395 ... Increase timing information

#2402 ... Change the IO dependency of Caffe to optional

#2397 ... The solution code of reconstructing Caffe

In subsequent articles in the next few weeks, we will share the specifics of the caffe-on-spark design and implementation. If the community is interested enough, we may be able to open source code to implement it. Please tell us your thoughts bigdata@yahoo-inc.com summary

This article initially describes the practice of integrating the Apache Hadoop ecosystem and depth learning into the same heterogeneous (GPU+CPU) cluster. Early performance comparisons encouraged us and planned to devote more effort to Hadoop, Spark and Caffe to make deep learning more effective on our clusters. We look forward to working with friends in the open source community in related fields.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.