On March 29, 2017, during the 2017 YunQi Computing Conference held in Shenzhen, Alibaba Cloud’s Chief Science Officer Dr Jingren Zhou officially launched the updated version of its machine learning platform “PAI 2.0”, intending to drastically reduce the technical threshold and development cost for AI.
In recent years, we’re witnessing a renewed surge in AI, sweeping across application scenarios such as city management, traffic management, industrial manufacturing, healthcare, and law enforcement. At the conference, Dr Jingren Zhou said “In the past year, we helped our customers develop several high importance AI applications. AI has become a technology within our reach, helping people solve real world problems. This is the purpose for PAI.”
PAI (short for Platform of Artificial Intelligence) was China’s first machine learning platform. Launched by Alibaba Cloud in 2015, not only did it reduce the cost of storage and compute through high performance cloud compute, it also lowered the technical threshold through built-in tools and algorithms. PAI 2.0 being the complete upgrade from PAI 1.0, it represents Alibaba Cloud’s advancement in building core AI technologies.
The main features for Alibaba Cloud’s PAI 2.0 includes:
Full compatibility of deep learning frameworks
Ever since the deep learning model AlexNet crushed the second-place winner by 10 percentage points at the ImageNet competition in 2012, Deep learning has seen a tremendous increase in application in the fields of voice, image, and text recognition/generation. Of course, all of this wouldn’t have been possible without the advances in model construction, hardware computing capacity, and optimization techniques, along with the huge accumulation of data. As a result, both industry and academia have released several open source tools and frameworks intended for building deep learning models, such as Caffe, Theano, Torch, MXNet, TensorFlow, Chainer, and CNTK etc. With TensorFlow, Caffe, and MXNet being the most mainstream open source deep learning frameworks in the world. TensorFlow has the advantage of having the most open source algorithms and models, Caffe is a classical and easy-to-use framework in the image field, and MXNet having superior distributed performance.
By using complex model designs to analyze large amounts of data, deep learning can achieve far superior results compared to classic shallow machine learning models. However, this type of modeling strategy places a higher demand on fundamental training tools. In this sense, today’s open source tools are generally lacking by various degrees in terms performing, GPU memory support, and ecosystem completeness. They are also generally not very user-friendly for normal users。
PAI 2.0’s programming interface offers complete compatibility to the 3 main deep learning frameworks: TensorFlow, Caffe, and MXNet. All the user needs to do is to upload locally written code onto the cloud, and it can be executed.
For infrastructure compute resources, PAI 2.0 provides a powerful cloud based mixed compute resource offering, including CPU, GPU, and FPGAs. For GPUs, PAI 2.0 can provide flexible multi-card scheduling.
With the help of these frameworks and PAI 2.0’s powerful compute resources, users can easily send compute tasks to corresponding distributed compute clusters and perform deep learning model training and forecasting.
A rich and innovative algorithm library
PAI 2.0 provides over 100 different algorithm packages, covering the most common scenarios such as classification, regression, clustering. It even provides business specific algorithms for mainstream applications such as text analysis, contextual analysis, and recommendations.
“All our algorithms stem from use cases internal to the Alibaba Group, so they have all been tested in petabyte scale of data and complex application scenarios, guaranteeing their maturity and stability,” Alibaba Cloud CSO Jingren Zhou said.
Support even larger scales of training data
PAI 2.0 added algorithms based on the parameter server architecture. Not only can it process data in parallel, it can also split models, separating large models into multiple partitions, where each parameter server only stores one partition, and all parameter servers come together to construct the complete model.
Another innovation in PAI 2.0 is its failure re-attempt feature. In distributed systems, where hundreds or thousands of nodes are all running together, it is common to see a few node failures. If there is no failure re-attempt mechanism in place, the overall tasks has a chance to fail, requiring the task to be completely re-submitted to the cluster scheduling. The Parameter Server algorithm can support hundred billion features, trillion models and samples, and petabytes of training data. It is especially useful for e-commerce and advertising industries where recommendation use cases command enormous amounts of data.
User-friendly application interface
In terms of application interface, PAI does not have complex formulas and code logic. All the user sees are various types of packaged algorithms. PAI 2.0 has integrated numerous visualization tools to make the deep learning black box more transparent: every step will have a visual monitoring screen.
Dr Jingren Zhou performed a live demo of building an experiment, where all it takes to start training the AI model is to configure the input data source and output. The workflow can be created through drag-and-dropping packages, drastically increasing the efficiency in creating test models. The visual interface greatly aids users in understanding the nature of problems, and the effects of deep learning.
Today, large scale deep learning optimization is still an emerging technical field. As it is multi-disciplinary by nature, encompassing various fields such as distributed computing, operating systems, computer architecture, numerical optimization, machine learning modelling, and compiler technologies etc. Based on the focuses of optimization, the optimization strategy can be further categorized as compute optimization, GPU memory optimization, communication optimization, performance forecast modelling, and software-hardware collaboration optimization. The new PAI platform’s optimization efforts are focused on the following 4 directions.
GPU Memory Optimization
The main focus in memory optimization is the optimization of GPU memory. In deep learning applications, the uniqueness of its compute tasks means GPUs are almost always chosen as the compute device. As GPUs are classic high-throughput heterogeneous compute devices, its hardware design restrictions dictate its memory resource is incredible rare. The current GPU offered on the PAI platform, the M40, only has 12GB of memory. Complex models, such as a 151 layer ResNet can easily meet or even exceed this threshold.
Figure 1: 36 layer ResNet model example
PAI performed a series of GPU memory optimization efforts, in hopes to lessen the system load when constructing models, providing the user with a broader design space in their model size. Through applying task-specific GPU memory distributor and automated model-partitioning framework support, PAI greatly relieved the restriction on GPU memory for model creation tasks. More specifically, the automated model-partitioning framework can estimate a model’s GPU memory cost based on its network features, and proceed to automatically partition the model, allowing for model parallelism. While performing automated model partitioning, the framework will also consider the communication cost introduced by model partitioning, optimizing the trade-off between supporting large models and compute efficiency through heuristic methods.
An eternal topic in large scale deep learning, or large scale machine learning is how to accelerate training tasks through distributed computing. The recurrent iterative machine learning training tasks means the classical map-reduce method of parallel data processing method is no longer suitable for this scenario. For deep learning training tasks, with its small sample sized training unit step, this problem is even more severe.
According to Amdahl’s Law, the degree at which a compute task’s performance can be improved is determined by the fraction of the improvable part in the overall task execution time. But deep learning training tasks’ distributed compute will often introduce additional communication costs, reducing the fraction of time for the improvable part. And thus, limiting the possible improvement in performance from distribution in the first place.
Through numerous optimization strategies such as pipeline communication, late-multiply, hybrid-parallelism and heuristic-based model average, the PAI platform optimized the communication costs of distributed training by various degrees. As a result, it was able to achieve an improvement in convergence acceleration ratio in both public and in-house models.
In pipeline communication, through sectioning data (models and gradients) that are waiting to be communicated into individual data blocks flowing across numerous compute nodes, it can break the bandwidth restriction on individual network cards, and to a certain extent control the communication costs to a constant time complexity.
Figure 2: Pipeline communication
In Late-multiply, based on features of fully connected layers, such as small amounts computations and large scale of models, PAI optimizes the logic of reducing gradients between inter-nodes by adjusting the traditional distributed logic, where multiple workers calculate local gradients and exchange values with each other, to a more compact logic, where multiple workers calculate back-propagation gradients and activations of the upper and lower layers in the fully connected layer. This will realize significant performance increases if the fully connected layer contain a large enough number of hidden neurons.
Hybrid-parallelism is the hybrid strategy incorporating both data parallelism and model parallelism, targeting the different features of various model networks. For compute-intensive parts the system will opt for data parallelism, and for communication-intensive parts the system will opt for model parallelism. The end result is a nice balance between accelerated distributed compute speed and decreased communication costs. The following image shows the application of this optimization strategy in TensorFlow’s AlexNet model.
Figure 3: AlexNet with hybrid-parallelism
Figure 4: AlexNet model demonstration
Performance forecast model
For users tasked with building the models, they are often only concerned with the most efficient way to complete their model training task, and not with how many GPUs are used, or what distribution strategy was executed to complete their task. But due to the complexity in today’s deep learning training tools and tasks, these users are often forced to care, through leaky abstraction pipelines, about how many GPU cards and CPU cores, what communication mediums, and what distribution strategy must be executed in order to effectively complete their training tasks.
Hence, the performance forecast model is created in hopes to liberate these users from the specific details of the training tasks. In short, once a user provides a model architecture and the expected resource and time costs, the modeling plus heuristic strategies, the PAI platform will forecast how much hardware resources and what distribution strategy to take in order to meet the user’s expectations.
Software-hardware interaction optimization
The three aforementioned optimization strategies focuses on a task’s offline training process. But deep learning’s specific business scenarios also have an online application process. As a classic complex model, the consumption, compute performance, and dynamic model update cost means the online application of deep learning model is no easy task. Online application on the PAI platform not only realized optimization on the software level, it also ventured into optimizing the software-hardware interaction. Currently, the Alibaba Cloud team is working on achieving FPGA based online inference software-hardware interaction optimization. PAI’s approach will be different from others in the industry. We abstracted this problem into a domain-specific custom hardware compiler optimization problem. Through this abstraction, we can create a general solution for multiple problems, thereby satisfying the requirements for a multitude of models and scenarios.
In his last segment, Dr Jingren Zhou demonstrated PAI 2.0’s related uses cases and widely adopted applications.