Summarize the recent development of CNN Model (i)----ResNet [1, 2] Wide ResNet [3] resnext [4] densenet [5] dpnet [9] nasnet [ten] senet [one] Capsules [12]

Last Update:2018-05-12 Source: Internet

Author: User

Tags pytorch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summarize the recent development of CNN Model (i) from:https://zhuanlan.zhihu.com/p/30746099 Yu June computer vision and deep learning 1. Preface

Long time no update column, recently because of the project to contact the Pytorch, feeling opened the deep learning new world of the door. In his spare time, Pytorch trained the recent CNN model of State-of-the-art in image classification, which is summarized in the article as follows:

ResNet [1, 2]
Wide ResNet [3]
Resnext [4]
Densenet [5]
dpnet [9]
nasnet [10]
Senet [11]
Capsules [12]

This article reproduced the results of the above papers on the CIFAR datasets (including Cifaro10 and cifar100) (except for [9]), and the code has been placed on GitHub:

Junyuseu/pytorch-cifar-models? github.com

This article mainly introduces the first four structures.

2. Analysis and Recurrence results 2.1 ResNet

ResNet is one of the most critical structures in the development of CNN in recent years, and a lot of insight is improved on the basis of resnet, and there are a lot of papers designed to analyze the validity of the residual structure. The success of ResNet is first and foremost due to its simple and effective structure, followed by its wide application. A simple residuals block as shown:

Residual structure block

This unit can be expressed by the following formula:

In most structural blocks of resnet, that is, identity mapping, only a very small number of dimensions are required for dimension matching and a 1x1 convolution layer is used to increase the dimension, and F is the Relu function.

Assuming the loss from the previous layer, using the reverse propagation rules, there are:

It is noted that the gradient does not produce dispersion in the layer propagation, which can explain the validity of residual learning to some extent.

By reproducing the results of ResNet on CIFAR10 in [1], the following table is obtained:

Reproduce resnet results on the CIFAR10

Compared with the original paper, it will be found that the results are better than the results of the paper, 1202-layer network has also tried to run, may not be in accordance with the learning rate set in the original text, resulting in the results have not been convergent, in order to save the GPU resources, there is no end to run the final results, indicating that resnet in the extreme In general, however, such a deep structure is not used.

In order to solve the above problem, the residual structure of preact is proposed in [2], as shown in:

(a) common residual structure block, (b) preact residual structure block

As the name implies, preact refers to the use of BN and activation functions (ReLU) before the convolution layer, as above, we give a formula for this structure to represent:

The above structure has a more ingenious form, according to which, recursively, there are:

For arbitrarily deep l and shallow l set up,

This formula has some very good properties,

i). Any deep element can be represented by an arbitrary shallow element plus a residual function;

II)., any deep element is the result of all previous residuals function sums (plus input layer)

Assuming the loss function is, according to the reverse propagation, there are:

If we ignore the few layers in the preactresnet that are used to increase the dimension, the formula shows that the gradient flow in the entire network does not produce dispersion problems, no matter how deep the network is.

This is also demonstrated by the following experimental results:

Reproduce preactresnet results on the CIFAR data set

In addition to the 1001-tier network, the above results have been better than the original paper, from the table can be drawn,

1. The Preact unit is more effective than the normal residual unit in the extreme depth of the network

2. Even the 1000-layer network, using the same hyper-parameter settings, the Preact network can still be very good convergence;

2.2 Wide ResNet

Wide residual structure block

ResNet shows that by increasing the depth, the network can get better performance, and this insight is about the impact of width on network performance. First we explain what is width. For convolutional layers, the width refers to the output dimension, such as the first convolution layer parameter for RESNET50 ( 64,3,7,7), the width is the output dimension is 64. For a network, the width refers to the total output dimension of all parameter layers. For ease of study, the width of a network is usually controlled by a magnification factor K, as shown in the following table:

Wide ResNet The network structure on the CIFAR dataset

[3] The experimental results show that the increase in width, network performance can also be improved. even a 14-layer wide residual network can achieve better performance than a 1001-layer residual network . At the same time, due to the parallel operation characteristics of the GPU, WRN (wide resnet abbreviation) in the case of a consistent number of parameters Training efficiency is much higher than resnet. The results of the recurrence are as follows:

Reproduce wide ResNet experimental results on CIFAR datasets

The results are better than the results in the paper.

2.3 Resnext

Resnext is the masterpiece of the Kaiming group. [4] In addition to depth and width, "cardinality" is also an important factor affecting network performance. What is the cardinality? As shown

Left: a residual structure block, right: a RESNEXT structure block (radix =32), each layer with # input dimension, filter size, #输出维数表示

Resnext is actually a multi-branched convolutional neural network. The multi-branch network is initially visible in Google's inception structure.

The cardinality is defined in the paper as the size of the conversion set. This definition may not be well understood, so let's take a look at the group convolution.

The group convolution can be traced back to alexnet[6] as early as possible. Krizhevsky The purpose of using the group convolution is to distribute the model to two GPUs for training. In alexnet, group size is 2, and many recent papers, including Xception[7],mobilenet and this resnext, is an application of the convolution of the group. The group size of xception is the input dimension, which is also known as the Depthwise convolution. Xception and Mobilenet both use deep separable convolution, so-called depth separable convolution, in fact, is depthwise convolution plus pointwise convolution (i.e. convolution core size 1x1 convolution)

After we get to the end of the group convolution, let's take a look at the concept of cardinality in resnext. It can be found that the base is actually the group size in the convolution, that is, the number of groups. Depthwise convolution is actually a special case of Resnext.

Three kinds of equivalent forms of Resnext

In the original text, the author implements all of these three architectures and verifies their equivalence. So in the process of reproducing, we only reproduce schema C, because this architecture is easiest to implement by using the group convolution. The results of the recurrence are as follows:

Reproduce Resnext experimental results on the CIFAR data set

The results on the cifar10 were slightly worse than the results in the paper, and the results on the cifar100 were better than the results in the paper, and the lowest error rate was obtained (17.11%).

2.4 densenet

Densenet is the best paper of CVPR for 2017 years. Although Densenet is less influential than ResNet, it also presents a meaningful insight. The best advantage of densenet is to optimize the gradient flow. After ResNet, [8] pointed out that during the ResNet training, the main source of the gradient was the shortcut branch (which also validated our previous derivation of the gradient propagation of the residual structure block). We all know how important it is to keep the gradient flow in the BP process, to prevent gradients from exploding/disappearing when training CNN, since shortcut is so effective, why not add more ? This is the core idea of Densenet: Adding a separate shortcut to each of the previous layers makes it possible to "communicate" directly between any two layers. That is, as shown in the structure:

In the implementation process, the concat operation of channel wide is used to realize the interconnection between any two layers.

The overall structure of the densenet is as follows:

Specific hyper-parameter settings can refer to the paper and code implementation, the results of the recurrence are as follows:

Reproduce densenet experimental results on the CIFAR data set

The result of the recurrence is basically (or more than) the results in the paper. The final result basically reaches the State-of-the-art on the current Cifar dataset.

Personal feeling densenet is not so hot because the effect on the Imagenet dataset is not very good, compared to other models of the same order of magnitude, such as resnext,senet.

3. Summary

From ResNet to Wrn to Resnext, the influence of depth, width and cardinality on the CNN model was verified. From ResNet to Preactresnet to densenet, through the continuous optimization of gradient flow, getting better results.

In this paper, the experimental results of the above 4 papers on the CIFAR data set are reproduced on the Pytorch, and the results are consistent with the original text and even better, the error rate is 3.41% on the Cifar10, and 17.11% error rate is obtained on cifar100.

Reference

[1] K. He, X. Zhang, S. Ren, and J. Sun deep Residual learning for image recognition. In CVPR, 2016.

[2] K. He, X. Zhang, S. Ren, and J. Sun Identity mappings in the deep residual networks. In ECCV, 2016.

[3] S. Zagoruyko and N. Komodakis. Wide residual networks. In Bmvc, 2016.

[4] S. Xie, G. Ross, p. Dollar, Z. Tu and K. He aggregated residual transformations for deep neural networks. In CVPR, 2017

[5] H. Gao, Z. Liu, L. Maaten and K. Weinberger. Densely connected convolutional networks. In CVPR, 2017

[6] K. Alex, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012

[7] C. Fran?ois. Xception:deep Learning with depthwise separable convolutions. In arxiv, 2016

[8] v. Andreas, M. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In NIPS, 2016

[9] Y Chen, J. Li, H. Xiao, X. Jin, S. Yan, J. Feng. Dual path Networks. In NIPS, 2017

[J] B. Zoph, V. Vasudevan, J. Shlens, Q. Le. Learning transferable architectures for scalable image recognition. In arxiv, 2017

J. Hu, L. Shen, G. Sun squeeze-and-excitation networks. In arxiv, 2017

[N] S. Sabour, N. frosst, G. Hinton. Dynamic routing between capsules. In NIPS, 2017

Summarize the recent development of CNN Model (i)----ResNet [1, 2] Wide ResNet [3] resnext [4] densenet [5] dpnet [9] nasnet [ten] senet [one] Capsules [12]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More