Going Deeper1. Background
Before 2006, the whole machine learning theory, it can be said that SVM (support vector machine) of the world. SVM, with its good theoretical foundation, graceful model and comfortable algorithmic nature, captures the hearts of countless researchers.
It is said that the deep learning of one of the big three Yann LeCun, once with the SVM Huang Vapnik on the SVM and the neural network has had the intense and interesting discussion, finally two people disagree, each go home to sleep. So behind the younger brother gradually formed two "door faction".
During that time of argument, Yann LeCun recognized that SVM as a general classification method is very good, but the essence is only a two-story model, he evaluated the nuclear method "is a packaging beautification template matching." Vapnik argues that SVM has a very clear extended control capability, and that neural networks do not have the means to control it. LeCun immediately again: "Compared with the ability to calculate the high complexity function with finite computational power, the extended control can only be second." When the image is recognized, the problems of shift, scale, rotation, light condition and background noise can cause the kernel function with pixel characteristics to be very inefficient. But it's a piece of cake for deep architectures like convolutional networks. ”
2006 Hinton and other three Daniel has published papers, overcoming the training of deep neural network has encountered several major problems. The spring of the deep network has come.
- Gradient dissipation problem
- Automatic selection of attribute issues
- Partially solves the problem of local minima.
However, 2006 years later, although the depth of the network in the theoretical circle caused more attention, but in the industry has not stirred up any waves. So, in 2012, Hinton and his student Alex took part in the ILSVRC2012 and designed a 8-story CNN that won the championship in excess of the second 11%. From this deep learning to fame Sparrow.
2. Evolution
Since 2012, the depth of the network has been increasing year in, is the ILSVRC competition champion's Network layer trend chart:
In 2014, Vgg and Googlenet reached 19 and 22, respectively, and the accuracy was also increased by an unprecedented level. By 2015, Highway Networks reported that 900 layers could converge. Microsoft Research launched the ResNet, so that network depth of 152-layer network successfully improved accuracy, and the convergence layer reached 1200 +. In 2016, it succeeded in increasing the number of effective network training layers to 1001 levels.
It is now possible to say with confidence that the network layer is no longer the main problem. Because at least we have a training tool that can conquer the 1000+ layer.
3. Is the event depth necessary?
At the end of 2013, the University of Toronto's Lei Jimmy BA and Microsoft's rich Caruana published a simple but thoughtful paper: Do deep nets really need to being deep?
In this paper, the author uses a "model compression" method to simulate a deep model with a shallow network. The result is a shallow model with only one layer, but it can achieve the effect of matching with the deep model. For this result, the authors suggest that for shallow networks, there may be better training algorithms waiting for us to find out.
However, another subtext to this result is that the current depth of the network may be a lot of levels are not necessary. Instead of trying to figure out how to improve the level of training, it's better to think about how to design networks and algorithms before you can use every layer of network more efficiently.
Another potential possibility is the width. For each layer of the network, it is not necessarily as wide as possible. At least we can reduce the useless width, so that we can use computing resources more effectively.
Googlenet V.s Vgg
Although the previous article mentioned the two authors have a very well-thought-out "demeanor", but everyone is almost still hesitate to devote themselves to the "Go deeper" rolling torrent.
Googlenet also gives a new technology in depth and width two. In depth, by adding two softmax in the middle of the network, the gradient of the reverse propagation is strengthened, and the low-level characteristics are used to achieve a certain degree of classification, thus accelerating the convergence speed. In width, cores of different widths are used. In particular, 1x1 's convolution core is very creative.
Vgg, in addition to the use of a variety of width of the core, there is not much "artifice". Also because of reality, so the training speed is slow, the model parameters are large. Compared with googlenet, the model parameters are about 7 times times larger. However, it is worth mentioning that although Vgg itself is nothing new, but later researchers on the basis of the development of Bath normalization, prelu and other technical means, to promote the further development of the study.
- Googlenet: "Going deeper with convolutions", 2014.09
- Vgg: "Very deep convolutional Networks for largescale Image recognition"
- Batch normalization:accelerating Deep Network training by reducing internal covariate shift. ICML2015, S.ioffe & C.szegedy.
Breakthrough: The civilian's counter attack
A more compelling breakthrough comes from Switzerland, a country that is not so eye-catching in the field of machine learning. Three authors submitted a paper in ICML2015, blowing the horn of a deeper network. This paper is the famous "Highway Networks".
In this article, the authors point out that they can train up to 900 layers of neural networks using their designed networks (although the number of layers has not improved, it proves "I can")! They also presented the convergence of the highway network, which is up to 100 levels deep in the paper:
According to the authors, inspired by the LSTM network, they put the original training function
y=H (x , WH )
Extended to
y=H (x , WH )?T (x , WT )+x ?(1?T (x , WT ))
Where h is the activation function and T is the affine transformation.
But why is this possible? Which part of the world "touched the feet of God"? It is a pity that there is not much valuable analysis in the article. However, one should be contented. That's good enough, isn't it?
ResNet: A Dance with Dragons
The story took place in December 2015.
The little People's article changed the world. People have finally found that simply piling up the network layer does not get much more. The experimental data supporting this assertion is constantly appearing:
Microsoft Research Jian Sun team has released a deep web-learning machine called residual network to identify images. They pushed the network layer to 152 levels at a stroke and achieved good results on the mainstream data set. They also trained a network of over 1200 layers and used it to compare to the 152-tier network. See
Kaiming He, Xiangyu zhang,shaoqing Ren, &jian Sun. "Deep residual learning for Image recognition". CVPR 2016
Let's take a look at what the so-called ResNet is like.
The normal network is such a unit structure:
The unit structure of the ResNet is this:
Compare the whole picture (really long, look at the head meaning even if ...) ):
The result? Overall is very good.
A few months later, the group again in the article "Identity Mappings in deep residual Networks", the situation of resnet in a more detailed and in-depth theoretical analysis. It also gives more powerful results: The 1000-tier network has the power to further improve accuracy.
At the same time, by means of theoretical analysis, we find an effective method, and in turn, improve the accuracy of the 200-layer network of imagenet data set. The results were independently verified by Facebook and proved to be effective. As follows:
4. Spiral
Materialism philosophy says that things are spiral development.
Reincarnation
2016 the March or April, immediately to the legendary "se years", seems destined to be an eventful. Caruana and other people after two years of polishing the results of the study was again released: "Do deep convolutional Nets really need to is deep (Or even convolutional)?"
Their conclusion this time is:
- The original conclusion is also: the parameters are less accurate.
- The convolutional layer has to be there, and the one or two layers are useless, at least three layers.
- Deep networks are still a lot of waste and could have been lighter and smaller.
It seems that everything is back to the point of origin. But it's not. We know that convolutional layers are important and even necessary. After all, our understanding went up a little bit.
Dawn
In a few months, ResNet seemed to dominate. However, due to the lack of theoretical analysis, it gives other people to further explore the power. Three academics at the University of Chicago also published a masterpiece at the time of March 2016: "Fractalnet:ultra-deep neural Networks without residuals".
This article gives a surprising conclusion to ResNet: Fractalnet demonstrates that path length was fundamental for training ultra-deep neural networks; Residuals is incidental.
At the same time, it gives the reasons why both can have a good training accuracy and can train deep networks: Key is the shared characteristic of fractalnet and resnet:large nominal network depth, But effectively shorter paths for gradient propagation during training.
Since the structure of ResNet is a special case, the former line means that there are other ways to achieve the same effect with fewer layers. Indeed, the example given in the paper proves the matter.
Fractal networks, as a framework, are really beautiful in theory.
In the training method, the author gives the Drop-path training method.
5. Epilogue
Neural networks from shallow to deep, to deeper, to ultra-deep days have passed. Behind, maybe we have to find a way to make the deep network shallow.
This is also a spiral, but also a reincarnation.
First read the book thick, and then read the book thin, it is precisely this truth.
Ultra-Deep network frontier: Going Deeper