CNN began in the 90 's lenet, the early 21st century silent 10 years, until 12 Alexnet began again the second spring, from the ZF net to Vgg,googlenet to ResNet and the recent densenet, the network is more and more deep, architecture more and more complex, The method of vanishing gradient disappears in reverse propagation is also becoming more and more ingenious.
- LeNet
- AlexNet
- Zf
- Vgg
- Googlenet
- ResNet
- Densenet
1, LeNet (1998)
Flash : defines the basic components of CNN, the originator of CNN.
Lenet is a convolutional neural network Huang LeCun presented in 1998 to solve the visual task of handwritten digital recognition. Since then, the most basic structure of CNN has been set up: convolutional layer, pooling layer, fully connected layer. the lenet used in the deep learning framework today are simplified LeNet-5 (5 for 5 layers), slightly different from the original lenet, such as changing the activation function to the relu that is now used.
2, AlexNet (2012)
Alexnet in the 2012 Imagenet competition with more than the second 10.9% of the absolute advantage of winning the championship, from the depth of learning and convolutional Neural network fame, deep learning research springing up, the emergence of alexnet is a convolution of the King of Neural network return.
Sparkle: deeper networks, data augmentation, ReLU, dropout, LRN
Alexnet Use the training techniques:
(1) Data augmentation techniques to increase model generalization ability, horizontal flipping, random cropping, illumination transformation.
Alexnet for the classification of class 1000, the input picture is 256x256 three-channel color picture, in order to enhance the generalization ability of the model, avoid overfitting, the author used the idea of random clipping to the original 256x256 of the image is randomly cropped, to get the size of 3x224x 224 of images, input to network training.
(2) using Relu instead of sigmoid to speed up the convergence rate of SGD, and verifying that the effect is more than sigmoid in the deeper network, successfully solves the gradient dispersion problem of sigmoid in the deep network.
(3) The dropout:dropout principle is similar to the integrated algorithm in the shallow learning algorithm, The method by allowing the neurons of the fully connected layer (the model to introduce dropout in the first two fully connected layers) to a certain probability of loss of activity (for example, 0.5) The inactive neuron is no longer involved in forward and reverse propagation, equivalent to about half of the neurons no longer function. At the time of the test, let the output of all neurons multiply by 0.5. The dropout reference effectively alleviates the overfitting of the model.
(4) Local responce normalization: The basic idea of local response to a layer is that if this is a piece of the network, such as 13x13x256, what LRN does is to choose a location, such as a location, From this position through the entire channel, can get 256 numbers, and normalized. The motivation for local response normalization is that we may not need too many high-activation neurons for each location in this 13x13 image.
The LRN layer appears after the 1th and 2nd convolution layers, and the largest pooled layer appears after the two LRN layers and the last convolutional layer. The Relu activation function is applied at the back of each layer of these 8 layers.
But later, many researchers found that LRN did not play a big role because it didn't matter, and we don't use LRN to train the network now.
3, Zfnet (2013)
Zfnet is the champion of the 2013ImageNet classification task, its network structure has not improved, just tuned the parameters , performance than Alex improved a lot. Zf-net only alexnet the first convolution core from 11 to 7, the step from 4 to 2, and the 3,4,5 convolution layer into 384,384,256. This year's imagenet is still a relatively calm one, its champion Zf-net's fame also has no other classical network architecture loud.
4, Vgg-nets (2014)
Vgg-nets, presented by the University of Oxford Vgg (Visual Geometry Group), is the base network in the first and second place of the 2014 Imagenet Race positioning task. Vgg can be seen as a deepened version of Alexnet. are conv layer + FC layer, at that time it seems to be a very deep network, because the number of layers up to 10 layers, we know from the paper name ("Very deep convolutional Networks for large-scale Visual Recognition "), of course, with the eyes of the present seems Vgg really is not a very deep network.
VGG-16 16 refers to the total number of layers CONV+FC is 16, is not including the Max pool layer!
In order to solve the initialization (weight initialization) and other issues, Vgg is a pre-training approach , which is often seen in classical neural networks, is to train some small networks, and then to ensure that this part of the network stability, and then gradually deepened on this basis.
The structure of VGG-16 is very neat, the depth is much more than Alexnet, which contains a plurality of conv->conv->max_pool such structure, Vgg convolution layer is same convolution , That is, the size of the output image after the convolution is consistent with the input, and its next sample is completely implemented by Max Pooling .
Vgg network followed by 3 fully connected layers, the number of filter (the number of output channels after convolution) starting from 64, and then each after a pooling multiplied , 64, 128, 256, 512,VGG the attention contribution is the use of small size filter , and a regular convolution-pooling operation.
The first layers of the VGG network take up more memory, and the top-level fully connected layer takes up most of the parameters.
Sparkle:
(1) The size of the convolution cores used are 1x1 and 3x3 small convolution cores, which can replace large filter sizes. Two 3*3 is equivalent to a 5*5 convolution core.
the advantages of the 3x3 convolution kernel:(1) More non-linearity of the 3x3 volume base than a large size filter roll base, which makes the decision function more decision-making, and (2) has fewer parameters , Assuming that the input and output of the volume base is the same size C, then three 3x3 convolution layer parameters 3x (3X3XCXC) =27cc; a 7x7 convolution layer parameter is 49CC; so you can think of three 3x3 filter as a 7x Decomposition of 7filter (the middle layer has a nonlinear decomposition).
1*1 convolution Core Advantages: The role is to not affect the input and output dimensions of the case, the input linear deformation, and then through the relu of non-linear processing, increase the network's non-linear expression ability .
(2) The network deepened, the alexnet version.
5, Googlenet (2014)
Googlenet in 2014 of the Imagenet classification task defeated Vgg-nets won the championship, its strength is certainly very deep, googlenet and alexnet,vgg-nets this simple relying on deepening network structure and improve the network performance of the idea is not the same, It also path, in the deepening of the network (22 layer), as well as innovation in the network structure, the introduction of inception structure instead of pure convolution + activation of the traditional operation (this idea was first proposed by the network in network). Googlenet further advances the research on convolutional neural networks to a new height.
Glittering
(1) Introduce inception structure, use 1x1 convolution kernel to reduce dimension, solve the problem of large computational capacity;
(2) The auxiliary loss unit of the middle layer, the middle-layer computing loss and the last loss are combined to update the network;
(3) The back of the full connection layer is replaced by a simple global average pooling, less parameters, although the network depth, but the parameters are only 1/12 times times alexnet;
The convolution stride in the inception structure are all 1, and a 0 padding is used in order to keep the feature response graph size consistent. Finally, each convolution layer is immediately followed by a relu layer. Before the output there is a layer called concatenate, literal meaning is "collocated", that is, 4 sets of different types but the same size of the characteristics of the response diagram of a sheet side by step, forming a new feature response diagram.
Two main things are done in the inception structure: 1. With the use of 3x3 pooling, and the three different scales of 1x1, 3x3 and 5x5, there are 4 ways to extract feature-response graphs of the input. 2. In order to reduce the amount of computation. At the same time, the information is passed through fewer connections to achieve more sparse characteristics, using 1x1 convolution kernel to achieve the dimensionality reduction.
1x1 convolution cores to achieve dimensionality reduction: We start with a 1x1 convolution of 256 to 64, then perform 64 convolution on all inception branches, then use a 1x1 convolution of 64 to 256.
(1) Greatly reduce multiply accumulate plus (MAC) operation;
(2) The total number of parameters of the 1x1 convolution kernel is 1/9 times that of the 3x3 parameters.
GOOGLENET Network structure has 3 loss units , such network design is to help the convergence of the network. The loss unit, which is added to the secondary calculation in the middle tier, is designed to allow low-level features to be well differentiated when calculating losses , thus allowing the network to be better trained. In this paper, the calculation of the two auxiliary loss units is multiplied by 0.3, and then the final loss is added as the final loss function to train the network.
Googlenet also has a bright spot worth mentioning, that is, the back of the full join layer is replaced by a simple global average pooling, the final parameter will be less . And in the alexnet of the last 3 layers of the full-junction parameters of almost 90% of the total parameters, the use of large networks in width and depth allows googlenet to remove the full-join layer, but does not affect the accuracy of the results, in the imagenet to achieve 93.3% accuracy, and faster than Vgg.
6, ResNet (2015)
In 2015 He Cai unveiled the ResNet to sweep all the contestants on ISLVRC and Coco and win the championship.
Sparkle:
(1) The number of layers is very deep, has exceeded hundred layers, did not use dropout;
(2) A residual unit is introduced to solve the degradation problem, each unit contains two convolution, can fit any function;
(3) Remove the full connection layer, only a 1000 FC for classification.
As can be seen from the front, as the network depth increases, the network accuracy should be increased synchronously, of course, should pay attention to the fitting problem. But the problem with the increased depth of the network is that these additional layers are the signal for parameter updates, since the gradient is propagated from the back to the next, and the gradient of the previous layer is small when the network depth is increased. This means that these layers are basically learning to stall, which is the problem of gradient vanishing . The second problem of the deep network is training, when the network deeper means more parameter space, optimization problems become more difficult, so simply to increase the network depth instead of higher training error, deep network convergence, but the network began to degenerate, that is, increase the number of network layers but lead to greater error, The performance of a 56-tier network is not as good as the 20-layer performance, not because of overfitting (training set training errors are still high), which is the annoying degradation problem. Residual network ResNet The design of a residual module allows us to train deeper networks.
Because the author thinks learning residuals F (x) is simpler than direct learning H (x)! Imagine, now that we just have to learn the difference between input and output, the absolute amount becomes relative (H (x)-X is how much the output changes relative to the input), which is much easier to optimize.
Two forms of residual module are shown, and the left is a conventional residual module with two 3x3 convolution cores, but as the network deepens, this residual structure is not very effective in practice. To solve this problem, the "bottleneck residual module" (bottleneck residual block) on the right figure can have a better effect, which is formed by the three convolution layers of 1x1, 3x3 and 1x1, where the 1x1 convolution is able to take off the function of dimension or ascending dimension , Thus, the 3x3 convolution can be carried out on the input of the relatively lower dimension to achieve the purpose of improving the computational efficiency, while providing a rich combination of features .
ResNet has many upgrade versions:
(1) As above, if H (x) and F (Y) are identity mappings, then in the forward and reverse propagation stages of training, the signal can be transferred directly from one unit to another, and training becomes simpler.
(2) Increase the network width, indicating that the performance improvement in the residual is the key, not the depth, but the increase in the width of memory consumption increases, the computational efficiency decreases;
7, Densenet
Densenet (dense convolutional network) is mainly in contrast with resnet and inception networks, but it is a brand-new structure, the network structure is not complex, but very effective, In the CIFAR indicators on the overall beyond the ResNet. It can be said that densenet absorbed the most essential parts of ResNet, and made more innovative work on this, making the network performance further improved.
Sparkle:
(1) dense connection : Alleviate gradient vanishing problem, enhance feature propagation, encourage feature reuse, greatly reduce the number of parameters
Densenet is a convolutional neural network with dense connections. In this network, there is a direct connection between any two layers, that is, the input of each layer of the network is the assembly of the output of all the preceding layers, and the feature graph that the layer learns will be passed directly to all the layers behind it as input. Densenet is a dense block, a block inside the structure of the following, and ResNet in the bottleneck basically consistent: Bn-relu-conv (1x1)-bn-relu-conv (3x3), A densenet is made up of multiple blocks of this block. The layer between each denseblock is called the transition layers, which consists of Bn?>conv (1x1) >averagepooling (2x2).
Does dense connectivity bring redundancy? No! Dense connection The first sense of the word is that it greatly increases the number of parameters and the amount of computation of the network. But in fact, densenet is more efficient than other networks, the key is to reduce the amount of computing on each layer of the network and the reuse of features . Densenet is to let the input of the L layer directly affects all subsequent layers, and its output is: XL=HL ([x0,x1,..., xl?1]), where [x0,x1,..., xl?1] is to merge the previous feature map with the dimensions of the channel. And since each layer contains the output of all the previous layers, it is sufficient to have only a few feature graphs, which is why the number of dnesenet is much less than other models. This dense connection equivalent to each layer is directly connected to input and loss, so you can reduce the gradient vanishing phenomenon, so that the deeper network is not a problem
To be clear, dense connectivity is just in a dense block, there is no dense connectivity between different dense blocks, as shown.
There is no free lunch at the bottom of the day, and the Internet is no exception. A better convergence rate at the same level of depth is naturally at an additional cost. One of its costs is its horrible memory footprint .
Summarize:
(1) Inception-v4, the highest accuracy rate, combined with resnet and Inception;
(2)googlenet, high efficiency , relatively little memory and parameters;
(3) Vgg, accurate rate can, but low efficiency, high memory consumption + more parameters;
(4) AlexNet, the computation is small, but occupies the memory and the accuracy is low;
(5) ResNet, the efficiency depends on the model, the accuracy rate is high.
(6) Vgg, Googlenet, resnet with the most, of which the ResNet default effect is best;
(7) The design network pays attention to the layer number, the cross-layer connection, enhances the gradient circulation;
Deep learning-from lenet to Densenet