How Yahoo implements large-scale distributed deep learning on Hadoop Clusters

Source: Internet
Author: User

How Yahoo implements large-scale distributed deep learning on Hadoop Clusters

Over the past decade, Yahoo has invested a lot of energy in the construction and expansion of Apache Hadoop clusters. Currently, Yahoo has 19 Hadoop clusters, including more than 40 thousand servers and over Pb of storage. They developed large-scale machine learning algorithms on these clusters and created Hadoop clusters as Yahoo's preferred large-scale machine learning platform. Recently, Cyprien Noel, Jun Shi, and Andy Feng from Yahoo's Big ML team wrote articles about Yahoo's practice of building large-scale distributed deep learning on Hadoop clusters.

Deep Learning (DL) is a feature required by many Yahoo products. For example, the scenario detection, object recognition, and aesthetic computing functions of Flickr depend on deep learning. To benefit more products from machine learning, they recently introduced the DL feature to the Hadoop cluster. Deep Learning on Hadoop has the following benefits:

  • Deep Learning is executed directly on the Hadoop cluster, which avoids data moving between the Hadoop cluster and the separate deep learning cluster;
  • Like Hadoop Data Processing and Spark machine learning pipeline, deep learning can also be defined as a step in the Apache Oozie workflow;
  • YARN can work well with deep learning. Multiple deep learning experiments can be performed simultaneously on a single cluster. Compared with traditional methods, deep learning is extremely efficient.

DL on Hadoop is a new type of deep learning method. To implement this method, Yahoo has mainly done the following two aspects:

  • Enhance the Hadoop cluster: they add GPU nodes to the Hadoop cluster. Each node has four Nvidia Tesla K80 cards and each card has two GK 210 GPUs. These nodes provide 10 times the processing capability of traditional commercial CPU nodes. The GPU node has two independent network interfaces: Ethernet and Infiniband. The former serves as an external communication interface, and the latter is 10 times faster. It is used to connect GPU nodes in the cluster and provide support for direct access to GPU memory through RDMA. With the latest node Tag feature of YARN, you can specify whether the container runs on the CPU or GPU in the job.
  • Create Caffe-on-Spark: This is a distributed integrated solution they created based on the open source software libraries Apache Spark and Caffe. With it, you can submit deep learning jobs to the GPU node cluster by using several simple commands, you can also specify the number of Spark executor processes to be started, the number of GPUs allocated to each executor, the storage location of training data on HDFS, and the storage path of the model. You can use the standard Caffe configuration file to specify the Caffe solver and deep network topology. Spark on YARN starts a specified number of executors, each of which is assigned a HDFS training data partition, and starts multiple Caffe-based training threads.

After the above work was completed, they benchmark the method on two datasets. Tests on the ImageNet 2012 dataset show that it takes only 35% of the time to use four GPUs to achieve 50% accuracy compared to using one GPU. Tests on the GoogLeNet dataset show that the speed of 8 GPUs to reach 60% top-5 accuracy is 6.8 times that of 1 GPU.

This shows that their methods are effective. To make distributed deep learning on Hadoop clusters more efficient, they plan to continue to invest in Hadoop, Spark, and Caffe.

Yahoo has published some of its code on GitHub. Interested readers can learn more.

You may also like the following articles about Hadoop:

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.