Wang, Min, Baoyuan Liu, and Hassan Foroosh. "Factorized convolutional neural Networks." ArXiv preprint (2016).
This paper focuses on the optimization of the convolution layer in the deep network, which has three unique features:
-Can be trained directly . You do not need to train the original model first, then use the sparse, compressed bits and so on to compress.
-Maintain the original input and output of the convolution layer, it is easy to replace the already designed network.
- simple to implement , can be obtained by the classic convolution layer combination.
Using this method to design the classification network, precision and GoogLeNet1, ResNet-182, VGG-163 equivalent, the model size is only 2.8M. multiplication times 470x109 470\times 10^9, only AlexNet4 65%. Standard convolution layer
Let's review the convolution process first. The standard convolution places the volume kernel (orange) on the input data on I (left) and a pixel (blue) for the output O (right) of the phase multiplication.
The size of the convolution kernel on one channel is K2 k^2, and the input and output channels are m,n m,n respectively.
In the current popular network, the main function of convolution layer is to extract the feature, which will always keep the image size unchanged. The steps to shrink the image are generally implemented by the pooling layer. For the writing simplicity, the input output is considered to be the same size as the HxW h\times W.
The number of multiplication required to compute an output pixel is:
K2XM K^2\times m
Total multiplication times are:
K2xmxnxhxw k^2\times m \times n \times h\times W
M,n M,n embodies the excavation of features, the value of a large, often hundreds of; in contrast, K K is generally around 1~5, rarely more than 7, and high level features are realized through multiple small-dimensional convolution layers. optimization of convolution layer
In this paper, the optimization variants of three kinds of convolution layers are introduced. Use of grass roots (bases)
Set the kernel size K2 k^2, the input/output channel number M,n M,n, this method puts the volume integral into two steps.
In the first step, you enter a separate operation for each channel. The results of B-B layer are computed in each channel under the action of the same dimensional convolution nuclei. M-channel data becomes MXB M\times B channel. Each channel of intermediate results is called a base basis.
In the second step, each channel is merged, but the volume kernel size used is 1.
The number of multiplication is
k2xbxmxhxw+bxmxnxhxw= (k2+n) Xbxmxhxw k^2\times b \times m \times h \times w + b \times m \times n \times h \times w = (k^2 + N) \times b \times m \times h\times W
The percentage of multiplication required for the traditional convolution is:
K2B+NBK2N \frac{k^2b+nb}{k^2n}
Note that the big head here is n N, as long as the b<k