In January this year, Google's head of AI, Jeff Dean, then the head of Google's brain, and the 2017 Turing Award winner, the architecture giant David Patterson (when the award was not yet announced) jointly published the Golden Age of Computer architecture: the Empowerment machine learning Revolution The article. The paper points out that machine learning algorithms are revolutionizing the challenges of human society, so it is not difficult to imagine that in the near future, whether it is a data center or terminal equipment, there will be dedicated to machine learning computing hardware. What are the features of such hardware? In this article, the author lists a number of key issues that hardware designers need to take into account, which are also instructive for deep learning researchers.
On the other hand, with the gradual failure of Moore's law and the need for deep learning as a bottomless pit, deep-learning researchers are also thinking: Can the algorithm be improved to make it more adaptable to computing hardware? Can you help optimize your system configuration? At the first "Tsinghua-Google AI Symposium" recently held, Jeff Dean talked about "what kind of model we want to design next", Azalia Mirhoseini, a researcher from Google's brain, gave a keynote speech "How to optimize the system with enhanced learning methods".
Google AI Director Jeff Dean
If we link this work together, it seems to us that in the field of deep learning, the ARXIV thesis is growing beyond Moore's law, how do Google brain researchers think about hardware and software problems to achieve optimal system performance and efficiency.
Machine learning "Beyond Moore's Law"
In the "Golden Age" paper, Jeff and David introduced a lot of design-specific hardware ideas, taking the two-generation machine learning ASIC developed by Google (TPUv1 for accelerated reasoning and TPUv2 for accelerated training) as an example. Hardware design to look at at least 5 years after the model: now start an ASIC design, it can be put into use after 2 years, and a dedicated hardware needs to be able to maintain at least 3 years of competitiveness is valuable. So under the premise of the design of deep-learning hardware to consider what issues? In the article, the author lists six such key points, according to the order of "pure architecture related" to "Pure machine learning Algorithm", respectively: training, batch size, sparsity and embedding, parameter quantization and refinement, neural network with soft memory mechanism and meta-learning.
Training
The first generation of TPU designed by Google since 2013 is designed for reasoning rather than training, and to some extent it is more difficult to design a hardware architecture for training:
- First, the arithmetic of training is more than 3 times times the reasoning.
- Second, because the entire activation value is stored for reverse propagation, the required storage for the training process is much higher than reasoning.
- Finally, the scalability of training is far less than that of reasoning, because a lot of expensive sequence calculations are required.
However, designing an ASIC for training is valuable, because the researcher's time is precious and patience is limited, and if an experiment takes 30 days or more, probably most people will choose to give up exploring this direction.
The second generation of TPU for training development. In several speeches, Jeff mentioned some successful applications of TPU V2, including using One-fourth pods to speed up the training of Google's search sequencing model to 14.2 times times, and to speed up the model training for processing pictures to 9.8 times times.
TPUv2 in Google's internal application case
Moreover, TPUV2 has almost linear extensibility, 64 TPU can form a TPU pod, providing the highest 11.5 pflops of the calculation force. It takes 1402 minutes to train ResNet-50 to 76% accuracy with a piece of TPUv2 and One-second pods (32 TPUv2) for 45 minutes (31.2 times times).
TPUv2 almost linearly scalable
Although TPU cloud price is not cheap, but time is life, life is money. In addition, if you are interested in machine learning research and are committed to open source of their own work, Google is in the form of TensorFlow research cloud to the researchers to provide free 1000 pieces of TPU.
Batch size (batch size)
Is the bigger the better or the smaller the better? This is a question that remains controversial in the study.
Intuitively, the ideal choice is a random gradient descent with a momentum of 1 of the batch size (SGD with momentum at a minibatch size of 1), when the unit calculation results in a maximum accuracy increase. Alternatively, selecting a batch larger than 1 is equivalent to raising the input dimension directly to one dimension (Jeff says: I have a headache when I think of it here.) However, in the hardware currently used for training, in the case of the GPU, it executes the basic unit of the program, that is, each warp contains 32 threads (threads), so if your batch size is not an integer multiple of 32, it will result in less efficiency, so the current model usually uses 32 or 64 as the batch size.
Batch size and computational efficiency
However, since 2017, there have been some promising studies showing that we can use 8192 or even 32768 as a batch scale to efficiently train the convolution neural network for images.
The error rate of the validation set remains relatively low until the batch size rises to around 8k
Source:accurate, Large Minibatch sgd:training ImageNet in 1 Hour (1706.02677)
It is worth mentioning that Yann LeCun a strong opposition to increased volume holdings, and he once forwarded a paper in support of small batches in April this year, saying, "Doing training with huge quantities is harmful to health ... Choosing 32 instead of 1 in batches can only mean that our hardware is poor. 」
Oh, 8192 that paper is the work of Facebook, the authors list includes Ross Girshick, Jiayanqing, and He Cai Ming ...
Sparsity and Embedding (sparsity and embeddings)
"We want a bigger model, but we want each sample to only activate a small part of it. "Another trend that Jeff has repeatedly mentioned.
"What kind of model do we want?" 」
The big model is good, because the huge number of parameters means that we can remember the characteristics of every aspect of the dataset. But if we need to activate the entire model when we're dealing with every single data, that means huge computational costs. Therefore, the ideal state is to have a large number of separate parts of the huge model, each part of the different division of labor, when the data into the model, the model according to task requirements to activate a few departments, so that most remain idle. Such a feature can also be called "coarse-grained sparsity."
Coarse-grained sparsity
Source:exploring the regularity of Sparse Structure in convolutional neural Networks (1705.08922)
In a ICLR2017 paper, Google proposed a concept called the hybrid expert layer (MoE). Each "Expert" is equivalent to a handful of parameters in a neural network, but is easier to train than the parameters in the normal neural network, and the first layer consists of more than 2000 experts.
The structure of the MoE layer
source:outrageously Large Neural networks:the sparsely-gated mixture-of-experts Layer (1701.06538)
During the training process, in addition to learning the model parameters, you also learn how to do routing (routing), that is, how to select an active "expert" based on a sample. In the language task, the model learned how to choose "Experts" according to the context: Experts 381 good at talking about scientific research, experts 752 good at "leadership", if involved in speed, then to experts 2004 bar.
What has been learned from route selection
source:outrageously Large Neural networks:the sparsely-gated mixture-of-experts Layer (1701.06538)
In the English-French translation task, the scale of the model has risen by 35 times times compared to the previous STOA model GNMT, but it can be done in one-sixth of the training time with fewer GPUs.
Compared to MoE, the case of applying a more extensive dynamic route is the embedding mechanism. Whether you're mapping words from a tens of thousands of-dimensional, single-heat vector to a hundreds of-dimensional word embedding, or giving each YouTube video a thousands of-dimensional representation that captures its relationship with other videos, it's a need for a sample from a huge data structure (possibly up to hundreds of G) to randomly read very little data (dozens of or hundreds of bytes, less than 1KB).
There are few efficient read solutions for dynamic routing options in the existing hardware architecture.
Parameter quantification and refinement (quantization and distillation)
The common denominator of sparsity and embedding is to preserve the "big model" and focus on how to pinpoint "small portions" of them. The quantification and refinement of parameters directly pursues the "small model".
Parameter quantization is another way of saying low-precision operations.
It is now common practice to use floating-point numbers in the training phase, whereas in inference the fixed-point number is used. For example, in the case of TPU, all inferences are expressed only in 8 of a specific number of points. The principle of implementation is to find the minimum number of bits required to express the integer portion of each layer, and then use the remainder of the 8 bits to represent its decimal point, after the training is completed, based on the maximum and minimum values of the parameters and the activation portion of the layers. Empirical studies have shown that reducing the accuracy from 32 bits to 8 bits only affects the performance of googlenet and VGG-16 in small amounts, but if you continue to fall to 6 bits, the model effect will be significantly affected.
The influence of inference parameter quantization on precision
Source:going deeper with Embedded FPGAs Platform for convolutional neural Network
Cadlab.cs.ucla.edu/~jaywang/papers/fpga16-cnn.pdf
It is mentioned that only a few studies have focused on how to use low-precision operations during the training phase, and most of the results are still concentrated on small datasets such as mnist,cifar-10. However, low-precision training is gradually getting more attention, ICLR2018, Baidu and Nvidia put forward a "hybrid precision training Method", in the forward and back calculation using the FP16 operation, in the weight update using FP32 calculation, the classification task on ImageNet, Pascal VOC 2007 The task of multiple large datasets, such as the detection task on the WMT15, and the translation task on the other, achieves the accuracy achieved with the FP32, while saving the demand for calculation and nearly half of the storage requirements. Today, Nvidia has given a sample of the SDK for training with mixed precision.
The refining method is proposed by Hinton on NIPS2014, trying to get the complex model to study the classification problem first, then the last layer Softmax learned soft classification as knowledge, training simple model to predict soft classification. The simple model (with fewer layers and fewer neurons per layer) can achieve the same accuracy as complex models. Refining methods make people think about whether they can train small models directly. Small models and large models require special hardware features that are quite different, so the development direction of the model is also an important image factor in the direction of hardware development.
Neural network with a soft memory mechanism (Networks with Soft memory)
This section highlights some deep learning techniques, such as attention mechanisms, that have special needs for storage and storage access. The traditional memory mechanism accesses only one value in the table where the data is stored, but the soft memory mechanism represented by the attention mechanism requires a weighted average of all the values in the table.
Compared to the acceleration of a particular operation, the deep learning ASIC that is currently available or has entered the post-development cycle emphasizes the optimization of data flow and storage. Remi El-ouazzane, the former Movidius CEO, talked about the design philosophy of the visual processing Unit VPU, referring to the fact that almost all architectural designs in VPU were designed for the same goal: optimizing Data flow. In the current terminal deep learning calculation, the energy consumption for data transmission is 10 times times or more for computing, so to maximize performance and minimize power consumption, the only way is to increase data locality and reduce the number of external memory accesses. The logic of Intel Nervana NNP, which is dedicated to accelerating training, is the same.
The same is true of FPGA logic. A large number of pins and can be based on the algorithm to customize the data path (datapath) of the logic unit, so that it does not need to be like the GPU need to repeatedly tune out of the data in the off-chip storage, when the ideal state, as long as the data once flow into and out, the algorithm is completed.
Meta-learning (learning to learn, L2L)
In contrast to the "progress" of machine learning, deep learning is the process of selecting a manually selected fixed feature extraction process into a machine-selectable, training feature extraction. The researchers only need to select a series of basic model structures and parameters, which can be taken over by the machine for feature extraction and distribution fitting.
In the above five parts, whatever the structure and skill of the model, making these decisions is still the work of mankind. In the idea of meta-learning, the human decision-making work is further replaced by a large number of calculations and automated experiments by machines.
Among the methods of automatic machine learning technology, Google has chosen to strengthen learning methods. The accuracy of the model is considered a "reward signal". In ICLR2017 's best paper, "Neural network structure search with reinforcement learning", Google's researchers searched for the best CNN and LSTM RNN structures for CIFAR-10 and PTB datasets.
Structure of common LSTM structure and structure search out
Source:neural Architecture Search with reinforcement learning (1611.01578)
In fact, not just the model structure, "meta-learning with reinforcement learning" approach to all aspects of deep learning: Select the input data preprocessing path, select the Activation function, select the optimization and update policy, and select the hardware configuration.
This time, Google brain researcher Azalia's speech is to optimize the hardware configuration as the theme. The traditional hardware configuration is based on greedy heuristic method, which requires engineers to have a deep understanding of all aspects of hardware, from calculating force to bandwidth. Even so, as the model grows larger and more devices are used, the resulting configuration is becoming more and more difficult to generalize.
Google Brain researcher Azalia Mirhoseini
Therefore, the expectation of convergence time for a particular configuration as a reward signal to assign each operation to a different device becomes an attractive solution. The algorithm learns the configuration that does not accord with human intuition, but it is 27.8% faster and saves nearly 65 hours than the experts have designed.
Meta-learning Gets the computational hardware configuration and effects
Source:device Placement optimization with reinforcement learning (1706.04972)
Meta-learning points to the path that allows us to more efficiently utilize large-scale computing resources while "saving" the workforce of machine learning specialists. In addition, meta-learning also lays the foundation for fast software-to-hardware integration in this deep learning algorithm and computing devices that are rapidly updating iterations.
Combine all these visions, and in what form will the next phase of deep learning take place?
In his speech, Jeff summed up the following:
- Just a much larger model of sparse activation.
- A single model that solves multiple tasks.
- Learn new paths dynamically in a large model, and add new paths continuously.
- Hardware that is dedicated to machine learning over-counting
- Efficiently configure the machine learning model on the hardware.
Purple module for new tasks and new nodes added for this, bold red lines represent new paths for resolving new tasks
Does your research approach help to achieve one of these goals? Will it benefit from such a model?
No matter what the answer is, one thing is certain: researchers, engineers, architecture designers, in the present, want to move toward universal AI attack on the road, these identities, no less.
Thinking deep learning from computer architecture