Current depth neural network model compression and acceleration Method Quick overview of current depth neural network model compression and acceleration method

Last Update:2018-08-22 Source: Internet

Author: User

Tags compact

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"This paper presents a comprehensive overview of the depth of neural network compression methods, mainly divided into parameter pruning and sharing, low rank decomposition, migration/compression convolution filter and knowledge refining, this paper on the performance of each type of methods, related applications, advantages and shortcomings of the original analysis. ”

Large-scale neural networks have a large number of hierarchies and nodes, so it is very important to consider how to reduce the amount of memory and computation they need, especially for real-time applications such as online learning and incremental learning. In addition, the recent popularity of smart wearable devices has provided an opportunity for researchers to deploy in-depth learning applications on portable devices with limited resources (memory, CPU, energy consumption, bandwidth, and so on). The efficient depth learning method can significantly affect the distributed system, embedded devices and FPGA for artificial intelligence. A typical example is resnet-50[5], which has 50-tier convolution networks, more than 95MB of storage requirements, and the floating-point multiplication times needed to compute each picture. If you prune some of the redundant weights, it will probably save 75% of the parameters and 50% of the calculation time. It is important to use these methods to compress models for devices such as handsets and FPGA, which have only megabytes of resources.

Achieving this goal requires a combination of disciplines to find solutions, including but not limited to machine learning, optimization, computer architecture, data compression, indexing, and hardware design. In this paper, we review the work of compressing and accelerating deep neural networks, which are widely concerned by the deep learning community and have achieved great progress in recent years.

We divide these methods into four categories: parameter pruning and sharing, low rank decomposition, migration/compression convolution filters, and knowledge refinement. The method based on parameter pruning (parameter pruning) and sharing focuses on exploring redundant parts of model parameters and attempts to remove redundant and unimportant parameters. Based on the method of low rank decomposition (Low-rank factorization), matrix/tensor decomposition is used to estimate the most informative parameters in deep CNN. Based on the migration/compression convolution filter (Transferred/compact convolutional filters) method, a special structure convolution filter is designed to reduce the complexity of storage and computation. Knowledge refinement (knowledge distillation), in turn, learns a refining model that trains a more compact neural network to reproduce the output of a large network.

In table 1, we briefly summarize the four methods. Usually parameter pruning and sharing, low rank decomposition, and knowledge refining methods can be used for DNN through the full connection layer and the convolution layer, they can achieve competitive performance. In addition, the use of migration/compression filters is only applicable to full convolution neural networks. Low rank decomposition and migration/compression filter methods provide an end-to-end process, and they are easily implemented directly in CPU/GPU environments. While parameter pruning and sharing use different methods, such as vector quantization, binary encoding, and coefficient constraints to perform these tasks, they usually take some processing steps to achieve the ultimate goal.

Table 1. Different model compression methods.

As for training protocols, models based on parameter pruning/sharing and low rank decomposition can be extracted from the pre-training model or trained from scratch, which is more flexible and efficient. Migration/Compression filters and knowledge refining models only support training from scratch. These methods are designed independently and complement each other. For example, migration layer and parameter pruning/sharing can be used together, and model quantization/binary (binarization) can be used with low rank decomposition to achieve further acceleration. The author introduces each kind of method in detail, including characteristics, advantages and defects.

Parameter trimming and sharing

Depending on how redundancy is reduced (information redundancy or parameter space redundancy), these techniques can be further divided into three categories: Model quantization and binary, parameter sharing, and structured matrices (structural matrix).

A. Quantification and binary process

Network quantization compresses the original network by reducing the number of bits required to represent each weight. Gong et al. [6] and Wu et al. [7] Use K mean scalar quantization for parameter values. Vanhoucke et al. [8] shows that a 8-bit parameter quantization can achieve a significant acceleration at the same time that the accuracy loss is minimal. The study in [9] uses the 16-bit fixed-point representation (fixed-point representation) in the CNN training based on random repair (stochastic rounding), which significantly reduces memory and floating-point operations while the classification accuracy is almost lost.

[10] The proposed approach is to first trim the unimportant connections, retraining the sparse connected networks. Then use the right to share the weight of the quantization connection, then the weight of the quantization and the Codebook (codebook) using Huffman code, to further reduce the compression rate. As shown in Figure 1, this method first learns the connection through normal network training, then trims the less weighted connections and finally trains the network to learn the final weight of the remaining sparse connections.

Flaw: The accuracy of such a two-dollar network can be greatly reduced when dealing with large CNN networks such as googlenet. Another disadvantage is that the existing binary method is based on simple matrix approximation, ignoring the effect of binary on accuracy loss.

Figure 1. [10] The three-stage compression method mentioned in: Pruning, quantification (quantization) and Huffman coding. Pruning reduces the number of weights that need to be encoded, and the quantization and Huffman codes reduce the number of bits that are encoded for each weight. The sparse representation of the metadata contains the compression rate. The compression mechanism does not result in any loss of accuracy.

B. Pruning and sharing

Network pruning and sharing has been used to reduce network complexity and solve the problem of fitting. There is an early application of the pruning method called deviation weight attenuation (biased Weight decay), in which the optimal brain damage (Optimal Brain Damage) and the optimal brain surgery (Optimal Brain surgeon) method based on the loss function of Hessian Matrices reduce the number of connections, their research shows that the pruning method is more accurate than pruning methods based on importance (such as weight Ddecay method).

Pitfalls: There are some potential problems with pruning and sharing methods. First, if the L1 or L2 regularization is used, the pruning method requires more iterations to converge, and all pruning methods require the sensitivity of the layer to be set manually, that is, the need for precise tuning of the parameters, which in some applications will appear to be lengthy and heavy.

C. Design of structured matrices

If a m x n order matrix only needs less than the MXN parameter to describe, it is a structured matrix (structured). Usually such a structure can not only reduce memory consumption, but also accelerate the speed of reasoning and training by fast matrix-vector multiplication and gradient computation.

Low rank decomposition and sparsity

A typical CNN convolution kernel is a 4D tensor, and it should be noted that there may be a lot of redundancy in the tensor. The idea based on tensor decomposition may be a promising way to reduce redundancy. The whole connection layer can also be regarded as a 2D matrix, and low rank decomposition is also feasible.

All approximate processes are done in one layer after another, and the parameters of the layer are fixed when a layer passes the low rank filter approximation, and the previous layer has been fine-tuned with a refactoring error standard (reconstruction error criterion). This is a typical low rank method for compressing the 2D convolution layer, as shown in Figure 2.

Figure 2. The low rank approximation of the CNN model compression (Low-rank approximation). Left: Original convolution layer. Right: The convolution layer with rank K for low rank constraints.

Table 2. Performance comparison of low rank model and its baseline model on ILSVRC-2012 datasets.

Defect: The low rank method is very suitable for model compression and acceleration, which complements the recent development of deep learning, such as dropout, modifier units (rectified unit) and maxout. However, the implementation of the low rank method is not easy because it involves computationally costly decomposition operations. Another problem is that the current approach performs low rank approximations on a per-layer basis and cannot perform very important global parameter compression because different layers have different information. Finally, decomposition requires a lot of retraining to achieve convergence.

Migration/Compression Convolution filter

Using the migration convolution layer to compress the CNN model is inspired by the research in [42], this paper introduces the equal variable group theory (equivariant group theory). Make x as input, Φ (•) as a network or layer, T (•) as the transformation matrix. The concept of the invariant can be defined as:

Even if the transformation matrix T (•) is converted to input x and then transmitted to the network or layer φ (•), the result is consistent with the representation of the X mapping to the network remapping map.

According to the theory, it is reasonable to apply the transformation matrix to layer or filter φ (•) to compress the whole network model.

Table 3. Performance comparisons on CIFAR-10 and CIFAR-100 data sets based on the different methods of migrating convolution filters.

Defect: The method of applying migration information to convolution filter needs to solve several problems. First, the performance of these methods is comparable to that of a wide/flat architecture (such as vggnet), but cannot be compared to narrower/special architectures such as googlenet, residual Net. Second, the migration assumption is sometimes too powerful to guide the algorithm so that the results on some datasets are unstable.

Knowledge Refining

As far as we know, Caruana and others [49] first proposed the use of knowledge migration (KT) to compress the model. They trained a compression model by integrating the pseudo data labeled by the strong classifier, and reproduced the output of the original large network. However, their work is limited to shallow networks. The idea has recently been extended in [50] to knowledge refinement (knowledge DISTILLATION/KD), which compresses the depth and width of a network into a shallow model that mimics the functionality that complex models can achieve. The basic idea of KD is to reduce the knowledge refinement of the large teacher model (teacher models) to a smaller model through soft Softmax learning the category distribution of teacher output.

The work in [51] introduces the KD compression framework, which is to reduce the depth of network training by following the student-teacher paradigm, a student-teacher paradigm that punishes students by softening the teacher's output. The framework compresses deep Network (teacher) integration into a student network of the same depth. To accomplish this, students learn to train to predict the teacher's output, the real category label. Although the KD method is very simple, it also shows the desired results in various image classification tasks.

Disadvantage: The KD based approach can make deeper models more shallow and significantly reduce computational costs. However, there are some drawbacks, such as the KD method can only be used to classify tasks with Softmax loss function, which hinders its application. Another disadvantage is that the assumptions of the model are sometimes too restrictive to be as good as other methods.

Table 4. Model compression is used in different representative studies of baseline models.

Discussion and challenges

The compression and acceleration of the depth model is still at an early stage, and there are still challenges: most of the current state-of-the-art methods are based on a well-designed CNN model, which limits the degree of freedom to change configurations (for example, network structure and hyper-parameters). In order to deal with more complex tasks, a more reliable model compression method is also needed. Pruning is an efficient way to compress and accelerate CNN. At present, most pruning techniques are designed to reduce the connection between neurons. On the other hand, pruning the channel can directly reduce the width of the feature map and compress the model. This works, but there are challenges, because reducing the channel can significantly change the next level of input. It is also important to determine how this type of problem is resolved. As mentioned earlier, structured matrices and migration convolution filter methods must make the model have human prior knowledge, which has a significant impact on the performance and stability of the model. It is important to study how to control the impact of imposing prior knowledge. The Knowledge refining (knowledge DISTILLATION/KD) approach has many benefits, such as direct acceleration of the model without the need for specific hardware or implementations. Developing a KD based approach and exploring how to improve performance is still worth a try. Hardware restrictions on a variety of small platforms (for example, mobile devices, robots, self-driving cars) remain a major impediment to deep CNN expansion. How to make full use of the limited available computing resources and how to design specific compression methods for these platforms is still a challenge.

Thesis: A Survey of Model Compression and acceleration for Deep neural Networks

Thesis Link: https://arxiv.org/abs/1710.09282

The Deep convolution Neural Network (CNN) has been shown to be very accurate in many visual recognition tasks. However, the current deep convolution neural network model is very expensive to compute resources and memory, which is difficult to be applied in terminal deployment and low latency demand scenarios. Therefore, a natural solution is to compress and accelerate the deep convolution neural network under the premise that the classification accuracy rate is not decreased significantly. In recent years, this field has achieved great development. In this paper, we will introduce the recent compression and acceleration of the CNN model of advanced technology. These technologies can be broadly grouped into four categories: parameter pruning and sharing (parameter pruning and sharing), low rank decomposition (Low-rank factorization), migration/Compression convolution filters (transfered/compact convolutional filter) and knowledge refinement (knowledge distillation). The method of parameter pruning and sharing will be described in detail at the beginning of the paper, and several other methods will be introduced in this paper. We analyze each kind of method's performance, the related application, the superiority and the flaw and so on the original analysis. This article will then introduce several other recent success methods, such as dynamic networks and random depth networks (stochastic depths network). We will then examine the evaluation matrix (evaluation matrix), the primary dataset used to evaluate model performance and near-term benchmarks. Finally, we summarize and discuss the existing challenges and possible development directions.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More