"Thesis translation" Mobilenets:efficient convolutional neural Networks for Mobile Vision applications

Source: Internet
Author: User

mobilenets:efficient convolutional neural Networks for Mobile Vision applicationspaper Link:https://arxiv.org/pdf/1704.04861.pdf Abstract and prior work is a little, lazy. 1. Introductionintroduces an efficient network architecture and two hyper-parameters to build a very small, low latency (fast) model that can easily match the design requirements of mobile and embedded vision applications. The introduction of two simple global hyper-parameters allows the model to effectively compromise between speed and accuracy. Mobilenets is mainly constructed from deep separable convolution (depthwise separable convolutions) (originally mentioned in a paper, then used in inception models to reduce the operations of the previous layers). The flattened network is built by a fully decomposed convolution and shows the potential of a great factorization network. another way to get a small network is to train a good model. The compression-based method in the literature is productquantization, hashing and pruning,vectorquantization and Huffman encode compression. In addition, a variety of factoring methods are put forward to speed up the training of a good network. Another way to train a small network is distillation (distillation), which uses a larger network to teach smaller networks. It complements our approach and is covered in some of the use cases in section fourth. Another emerging approach is the low bit networks. 2. Mobilenet Architecture2.1 depthwise separable convolutionThe mobilenet model is based on deep separable convolution (depthwise separable convolutions), which is a factorized convolutions, and factorized The convolutions decomposes the normalized convolution into deep convolution and 1x1 convolution (pointwise convolution). Standard Convolation is a step in which input filtering (kernels) and combine input to a series of outputs (output channels) are performed. and depthwise separable convolutions divides it into two layers: one for filtering and one for combine, and this decomposition process (factorization) has the effect of greatly reducing the amount of computation and the size of the model. As shown, 2 (a) shows a standard convolution process, 2 (b) and 2 (c) are decomposed 2 (a) into (factorized) depthwise separable convolutions process. a standard convolution layer input dimension is DFXDFXM's feature map F, which produces dfxdfxn (DF in the paper, but I think it should be DG here.) Because the width of the output feature graph is not necessarily the same as the input, look at stride and padding. The latter sentence is also said to be DG) of the characteristic Atlas G. Where DF is a square input feature map of the space width and height, M is the number of input channels (inputdepth), DG is a square output feature map space width and height, n is the number of output channels (outputdepth). If the convolution stride is assumed to be 1 and the padding is considered, the output feature map of the normalized convolution operation is:The computational complexity of the standard convolution is:The computational complexity depends on input channel m, output channel N, convolution kernel DKXDK and feature map dfxdf. the standard convolution operation in order to generate a new representation needs to filter the feature based on the convolution kernel and the combine feature. This filtering and combine operation can significantly reduce computational complexity by using the decomposition convolution (factorized convolutions), or deep separable convolution (depthwise separable convolution). depthwise separable convolutions includes two layers: Depthwise convolutions and pointwise convolutions. The Depthwise convolutions applies a single filter to each input channel, and pointwise convolutions (1x1 convolution) is used to create a linear overlay of the output of the depthwise layer. The mobilenets uses batchnormalization and Relu nonlinear activation units for both layers of the convolution layer. the depthwise convolution of one filter per input channel can be written as:k is the convolution nucleus of the depthwise convolution, and its size is dk*dk*m. The M filter of the convolution kernel K is applied to the M channel of F to generate the first m channel of the feature graph G. The computational complexity of the depthwise convolution is:The computational complexity of the depthwise separable convolutions is:is the sum of the computational complexity of the depthwise and 1x1 pointwise convolutions. by representing the convolution as a filter and a combination of two process processes, we can reduce the amount of computation:Mobilenets uses the 3x3 depthwise separable convolutions to reduce the computational complexity by 8-9 times compared to standard convolutions, and only reduces the accuracy. 2.2 Network Structure and Trainingthe definition of the Mobilenet schema is shown in table 1. All layers are applied to batchnorm and Relu, except that the final full-join layer has no nonlinear activation function and feeds directly to the SOFTMAX layer for classification. Figure 3 compares the convolution layer containing the regular convolution, batchnorm and relu non-linear activations with depthwise convolution,1x1 pointwise Convolution and the decomposition layer of the nonlinear activation of Batchnorm and relu after each layer. The sampling problem can be handled in the Depthwise convolution and the first layer. The last average pooled layer before the fully connected layer is applied to reduce the spatial resolution to 1 ( Mobilenet Almost all of the calculations are in 1x1 convolution. This can be achieved through a highly optimized universal matrix multiplication (GEMM) function. Typically, the convolution is implemented by GEMM, but it needs to be initially reordered in memory called Im2col to map it to GEMM. 1x1 convolution does not require this reordering in memory and can be implemented directly using GEMM (one of the most optimized numerical algebra algorithms). The calculation time of the Mobilenet 95% is spent on 1x1 convolution, and the 75% parameters as shown in table 2 are also in the 1x1 convolution layer. Almost all other parameters are in the fully connected layer. Mobilenet is trained with TensorFlow using a rmsprop similar to inception V3 's asynchronous gradient descent. However, compared to the training of large models, we use less regularization and data-adding techniques because small models have little difficulty in fitting (not understanding?). )。 When training mobilenets we do not use side heads or label smoothing and reduce the number of image distortions by limiting the size of the crops. In addition, we find it important to use little or no weight attenuation (L2 regularization) on depthwise filters because of their few parameters. In the next section, Imagenet benchmarks, regardless of the size of the model, all models are trained with the same training parameters. 2.3 Width Multiplier:thinner ModelsAlthough the basic mobilenet architecture is already small and low latency, many special use cases or applications still require this model to be smaller and faster. To further build a model of smaller and lower computational overhead, we introduce a very simple parameter α called width multiplier. Its role is to thin the network evenly on each layer. For a given layer and α, the number of input channels m becomes αm, and the number of output channels n becomes αn. The computational complexity of depthwise separable convolutions using the width Multiplierα is:where Α∈ (0,1], usually set to 1, 0.75, 0.5 and 0.25. Α=1 is the basic mobilenet,α<1 is the mobilenets of thin body. The Width multiplier has the effect of reducing the computational complexity and the number of parameters (approximately two of the α-th). With reasonable accuracy, latency, and scale, the Width multiplier can be applied to any model structure to define a new and smaller model. It is used to define a new simplified structure, but it requires retraining. 2.4 Resolution multiplier:reduced representationthe second super-parameter to reduce the computational complexity of neural networks is resolution multiplierρ. We apply it to the input image, and the internal characteristics of each layer are then reduced by the same multiplier. In practical applications, we set the input resolution unseen-type set ρ. we can now represent the computational complexity of the core layer of our network as Depthwise separable convolutions with the width Multiplierα and resolution multiplierρ:where Ρ∈ (0,1], usually set to 224, 192, 160 or 128. Ρ=1 indicates that the underlying mobilenet,ρ<1 represents reduced computation mobilenets. Resolution multiplier can reduce the computational complexity of the two-time square of the ρ. as an example we can look at a typical layer in a mobilenet, and see depthwise separable convolutions,width Multiplier and resolution Multiplier How to reduce complexity and parameters. Table 3 shows the calculation (computation) and number of parameters that the different schema shrinkage methods apply to the convolution layer. The first line is the Mult-adds and paramerters of the full convolutional layer with the input feature map size 14x14x512,kernel k size 3x3x512x512. In the next section, we'll delve into how to weigh between resources and accuracy. 3. Experimentsin this section, we first study depthwise convolutions and the choice of narrowing by reducing the width of the network rather than the number of layers. Then we show the tradeoff of the reduce network based on two hyper-parameters: Width multiplier and resolution multiplier, and compare it with some popular models. Finally, we discuss the application of Mobilenets to a number of different applications. 3.1 Model ChoicesFirst, we show a comparison of mobilenet with Depthwise separable convolutions and models built with full convolutions. In table 4 we see that using depthwise separable convolutions on imagenet reduces the correct rate by 1% compared to full convolutions, And save a lot of money on Mult-adds and parameters. We then show a comparison of the thinner models using the width multiplier and the lighter models that use less layers. To make the mobilenet lighter, the feature size in Table 1 is 14x14x512 5 layer separable filters was removed. Table 5 shows that with similar calculations and number of parameters, it is better to make mobilenets thinner than to make it lighter by about 3%. 3.2 Model Shrinking hyperparametersTable 6 shows the tradeoff between accuracy, computation, and size using the mobilenet structure with width multiplier α. The accuracy is smoothed down until the schema is shrunk to α=0.25. Table 7 shows the trade-off between accuracy and size by using reduced input resolutions under different resolution multipliers. The accuracy drops smoothly with resolution descent. Figure 4 shows the Imagenet dataset, by width multiplierα∈{1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128} Accuracy and computation of 16 models combined. The result is a logarithmic linear (log linear), when the model is very small (α=0.25), the model has a jump. Figure 5 shows the imagenet DataSet, by the width multiplier α∈{1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128} The trade offs between the number of accuracy and parameters of the 16 models combined. Table 8 compares the full mobilenet and the original googlenet, VGG16. In the case that the model size is 32 times times smaller and the computational complexity is 27 times times smaller, mobilenet accurate and VGG16 are similar. The Mobilenet Mobilenet has a higher accurate in cases where the model is smaller than the googlenet and is more than 2.5 times times less complex. Table 9 compares the 160x160 reduced using the width multiplier α=0.5 and reduced resolution Mobilenet . The size of the model is 45 times times smaller and the computational complexity 9.4 times times smaller reduced mobilenet than alexnet good 4%. In the case of the same size and computational complexity 22 times times smaller, Mobilenet is also 4% better than squeezenet. 3.3 Fine grained recognitionWe train the mobilenet model of fine-grained recognition on the Stanford Dogs DataSet. We expanded the [18] approach and collected a larger but noisy training set from the Web. We use noisy web data to pre-train a fine-grained dog recognition model and then fine tune model on the Stanford Dogs training set. The results on the Stanford Dogs Test set are shown in table 10. Mobilenet the results of the state of the art are almost achievable in the case of large reduce computation and size. 3.4 Large Scale geolocalizationplanet[35] will determine where a photograph is taken on the Earth as a classification problem. This method divides the earth into a geographic cell used as the target class, and trains convolutional neural networks on millions of geo-tagged photographs. we use the mobilenet architecture to retrain the planet on the same data. The full planet model is based on a inception V3 architecture and has 52 million parameters and 5.74 billion Mult-adds. The Mobilenet model has only 13 million parameters (typically, 3 million are used in body,1000 for the last layer) and 580,000 Mult-adds. As table 11 shows, the mobilenet version has only slight performance degradation compared to planet in the case of more compact conditions. In addition, it is still substantially due to IM2GPS. 3.5 Face AttributesAnother application of mobilenet is to compress large systems with unknown or esoteric training processes. In the facial attribute classification task, we describe the synergy between mobilenet and distillation. A known deep network migration technology. We seek to reduce the large-scale facial property classifier with 75 million parameters and 1.6 billion mult-adds. This classifier is trained on a Multi-atrribute dataset similar to yfcc100m [32]. we use Mobilenet to extract the facial attribute classifier. Extraction is the training classifier that simulates the output of a larger model, rather than the ground-truth tag, so that it can be trained with a large number (perhaps infinitely) of unlabeled data sets. The scalability of comprehensive extraction (distillation) training and the simple parameterization of mobilenet, end systems need not only regularization (e.g. Weight-decay and early-stopping), but also performance enhancements. From Table 12 It is obvious that the mobilenet-based classifier has a strong model for scaling elasticity: It achieves an approximate average accuracy (mean AP) between different attributes, but only 1% of the computational amount. 3.6 Object DetectionMobilenet can also be deployed as an effective base network on modern target detection systems. Based on the recently earned Coco Challenge, we report the Mobilenet training results for target detection on the Coco data set. In Table 13, the mobilenet and the Vgg and inception V2 based on the FASTER-RCNN and SSD frameworks were compared. In our experiment, SSD used 300 as input resolution (SSD), FASTER-RCNN used 300 and 600 input resolution (FASTER-RCNN, FASTER-RCNN 600). Each image in this FASTER-RCNN model uses 300 RPN extraction boxes. The model passes the Coco training set and the test set (removing 8000 minival images) and testing on the minival. For both frameworks, Mobilenet achieves similar results with other networks with only a small computational complexity and model size. 3.7 Face embeddingsFaceNet is a face recognition model of state of the art. It constructs a human face embedding based on triplet loss. In order to construct a mobile facenet model, we use distillation to train the model by minimizing the output mean variance on the training data by FaceNet and Mobilenet. The results for very small mobilenet models are shown in Table 14. 4. Conclusionwe propose a new model architecture called Mobilenet based on the depthwise separable convolutions. We discussed some important design decisions that can make the model efficient. We then showed how to use the width multiplier and resolution multiplier to build smaller, faster latency by weighing reasonable accuracy, reduce size and mobilenets. Then we compare the different mobilenets and some popular models, and discuss the characteristics of superior size, speed and accuracy. We come to the conclusion by discussing the validity of mobilenet in the application of various tasks. To further help explore mobilenets, we plan to publish TensorFlow-based models. Reference:https://arxiv.org/pdf/1704.04861.pdf

"Thesis translation" Mobilenets:efficient convolutional neural Networks for Mobile Vision applications

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.