Deep Learning-A classic network of convolutional neural Networks (LeNet-5, AlexNet, Zfnet, VGG-16, Googlenet, ResNet)

Source: Internet
Author: User

A summary of the classic network of CNN convolutional Neural Network

The following image refers to the blog: http://blog.csdn.net/cyh_24/article/details/51440344

Second, LeNet-5 network

    • Input Size: 32*32
    • Convolution layer: 2
    • Reduced sampling layer (pool layer): 2
    • Full Connection layer: 2 x
    • Output layer: 1. 10 categories (probability of a number 0-9)

LeNet-5 Network is for gray-scale training, the input image size is 32*32*1, does not include the input layer in the case of a total of 7 layers, each layer contains the training parameters (connection weights). Note: Each layer has multiple feature maps, each feature map extracts a feature of the input via a convolution filter, and then each feature map has multiple neurons.

1, C1 layer is a convolution layer (through convolution operation, can make the original signal features enhanced, and reduce noise)

The first layer uses a 5*5 size filter of 6, and the step s = 1,padding = 0. That is: composed of 6 feature maps feature map, each neuron in the feature map is connected to the neighborhood of the 5*5 in the input, and the resulting feature map size is 28*28*6. The C1 has 156 parameters that can be trained (each filter 5*5=25 a unit parameter and a bias parameter, altogether 6 filters, a total of (5*5+1) *6=156 parameters), a total of 156* (28*28) =122,304 a connection.

2, S2 layer is a lower sampling layer (average pool layer) (using the principle of local correlation of images, sub-sampling of the image, can reduce the amount of data processing and retain useful information, reduce network training parameters and model of the Overfitting degree).

The second layer uses a 2*2 size filter, and the step s = 2,padding = 0. That is: each element in the feature map is connected to the 2*2 neighborhood of the C1 in the corresponding feature graph, there are 6 14*14 feature graphs, and the output feature map size is 14*14*6. The pooling layer has only one set of hyper-parameters F and S, and there are no parameters to learn.

3. C3 layer is a convolution layer

The third layer uses a 5*5 size filter of 16, and the step s = 1,padding = 0. That is: composed of 16 feature maps feature map, each neuron in the feature map is connected to the neighborhood of the 5*5 in the input, and the resulting feature map size is 10*10*16. The C3 has 416 parameters that can be trained (each filter 5*5=25 a unit parameter and a bias parameter, altogether 16 filters, a total of (5*5+1) *16=416 parameters).

4, S4 layer is a lower sampling layer (average pool layer)

The fourth layer uses a 2*2 size filter, and the step s = 2,padding = 0. That is: The 2*2 of each element in the feature map and the corresponding feature graph in the C3

Adjacent domain is connected, there are 16 5*5 feature graphs, the output of the resulting feature map size is 5*5*16. There are no parameters to learn.

5, F5 layer is a fully connected layer

There are 120 of units. Each unit is fully connected to all 400 units of the S4 layer. The F5 layer has 120* (400+1) = 48,120 parameters that can be trained.

Like the classical neural network, the F5 layer calculates the dot product between the input vector and the weight vector, plus a bias.

6, F6 layer is a fully connected layer

There are 84 of units. Each unit is fully connected to all 120 units of the F5 layer. The F6 layer has 84* (120+1) = 10,164 parameters that can be trained.

Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias.

7. Output Layer

The output layer consists of a European radial basis function (Euclidean Radial Basis function) unit, one unit per class, each with 84 inputs.

In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the input is from the parameter vector, the greater the RBF output.

In terms of probabilistic terminology, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space.

Given a loss function, it should be possible to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern).

Summarize:

As the network grows deeper, the width and height of the image are shrinking, and the number of channels is increasing. Currently, one or more convolutional layers are followed by a

The pooling layer, followed by an all-connected layer arrangement is very common.

Layers (layer) The activated dimension (Activation Shape) Size after active (Activation size) Parameters W, b (Parameters)
Input (32,32,1) 1024 0
CONV1 (F=5,s=1) (28,28,6) 4704 (5*5+1) *6=156
POOL1 (14,14,6) 1176 0
CONV2 (F=5,s=1) (10,10,16) 1600 (5*5*6+1) *16=2416
POOL2 (5,5,16) 400 0
FC3 (120,1) 120 120* (400+1) =48120
FC4 (84,1) 84 84* (120+1) =10164
Softmax (10,1) 10 10* (84+1) =850

Third, alexnet network

Alexnet Network total: convolutional layer 5, Pool layer 3, full connectivity layer: 3 (which contains the output layer).

The structure of convolutional neural network is not a simple combination of each layer, it is composed of a "module", within the module,

The arrangement of each layer is fastidious. For example, the structure diagram of alexnet, which is composed of eight modules.

1, alexnet--module one and module two

Structure type: convolution-activation function (ReLU)-downsampling (pooling)-normalization

These two modules are the front section of CNN and form a computing module, which is a standard for a convolution process, from a macro point of view,

is a layer of convolution, a layer of reduced sampling so that the loop, the middle appropriate to insert some functions to control the range of values, so that the subsequent loop calculation.

2, alexnet--module three and module four

Modules three and four are also two same convolution process, the difference is less drop sampling (pooling layer), the reason is related to the size of the input, the characteristics of the data volume is relatively small,

So there is no drop sampling.

3, alexnet--module five

Module Five is also a convolution and pooling process, and module one by one or 21-like. Module five output is actually a small piece of 6\6.

(The general design can go to 1\1 small piece, because Imagenet's image is big, so 6\6 also normal. )

The original input 227\227 pixel image will become 6\*6 so small, the main reason is due to the reduction of sampling (pooling layer),

Of course, the convolution layer will also make the image smaller, one layer down, the image is getting smaller.

4, module Six, seven or eight

Modules six and seven is the so-called fully connected layer, the whole connection layer and the structure of artificial neural network, the node is super many, the connection line is also super,

So this leads to a dropout layer, which does not have enough active layers in addition to the part.

Module Eight is a result of the output, combined with the Softmax to make the classification. There are several types of outputs, and each node holds the probability values belonging to that category.

Alexnet Summary:

    • Input Size: 227*227*3
    • Convolution layer: 5
    • Reduced sampling layer (pool layer): 3
    • Full Connection layer: 2 x
    • Output layer: 1. 1000 Categories

Iv. zfnet

Five, VGG-16 network

Vggnet is a deep convolutional neural network developed by the Computer Vision Group (Visual Geometry Group) of Oxford University and a researcher at Google DeepMind.

Vggnet explores the relationship between the depth of convolutional neural networks and their performance, by repeatedly stacking 3*3 's small convolution cores and 2*2 's largest pooled layer,

Vggnet successfully constructed a convolutional neural network with deep 16~19 layer. Vggnet compared to the previous State-of-the-art network structure, the error rate dropped sharply,

The vggnet paper uses the 3*3 's small convolution core and the 2*2 's largest pooled core to improve performance by deepening the network structure.

The VGG-16 and VGG-19 structures are as follows:

Summarize:

(1) The meaning of 16 in VGG-16 network is: There are 16 layers with parameters, and the total parameter is about 138 million.

(2) VGG-16 network structure is very regular, not so many hyper-parameters, focus on building a simple network, are several convolutional layer behind with a can be compressed

The pool layer of the image size. That is, all of the small convolution cores of 3*3 and the largest pooled layer of 2*2 are used.

Convolution layer: conv=3*3 filters, s = 1, padding = same convolution.

Pooling layer: Max_pool = 2*2, s = 2.

(3) Advantages: Simplifying the structure of convolutional neural networks; disadvantage: The number of training features is very large.

(4) With the deepening of the network, the width and height of the image are constantly decreasing with certain laws,

The number of channels is increasing by one time after each pool is shrunk by just half.

VI. Inception Network (Google)--googlenet network

(i) overview

The safest way to obtain a high-quality model is to increase the depth (number of layers) of the model or its width (layer nucleus or number of neurons)

But here the general design idea of the case will appear the following defects:

1. Too many parameters, if the training data set is limited, easy to fit;

2. The larger the network computing complexity, the more difficult to apply;

3. The deeper the network, the more backward the gradient passes, and it is difficult to optimize the model.

The fundamental way to solve these two drawbacks is to convert all-connected and even general convolution into sparse connections. In order to break the network symmetry and improve

Learning ability, traditional networks use random sparse connections. However, the computational efficiency of computer software and hardware is very poor for non-uniform sparse data,

So the full join layer is re-enabled in alexnet to better optimize parallel operations. The question now is, is there a way

It can not only maintain the sparsity of network structure, but also utilize the high computational performance of dense matrix.

Two Inception Module Introduction

The main idea of inception architecture is to find out how to use dense components to approximate the optimal local sparse nodes.

To do the following instructions:

1. The use of different size of convolution kernel means that different size of the field of perception, the final stitching means the fusion of different scale features;

2. The convolution kernel size is 1*1, 3*3 and 5*5, mainly for easy alignment. After setting the convolution step stride=1,

As long as the padding = 0, 1, 2 respectively, the use of same convolution can be the same dimension features, and then these features are directly spliced together;

3. The article said that many places have shown that pooling is very effective, so inception inside also embedded pooling.

4. The more abstract the network to the later features, and the greater the sensitivity involved in each feature, the ratio of 3x3 to 5x5 convolution increases as the number of layers increases.

Inception: Instead of manually determining the type of filter in a convolutional layer or determining whether you need to create a convolution layer and a pooling layer, that is: no artificial

Decide which filter to use, whether you need a pooling layer, etc., the network determines these parameters, you can add all possible values to the network, connect the output

Up, the web itself learns what parameters it needs.

Naive version of inception network defect: Compute cost. The use of a 5x5 convolution core still results in a huge amount of computation, which takes about 120 million times.

In order to reduce the computational cost, a 1x1 convolution core is used for dimensionality reduction. As follows:

In front of the 3x3 and 5x5 filters, the Max pooling is followed by a 1x1 convolution core, which is then stitched together as a channel/thickness axis.

The final output size is 28*28*256, and the number of convolution parameters is reduced by 4 times times, resulting in the final version of the Inception module:

(iii) Introduction of Googlenet

1. Googlenet--inception V1 Structure

The main idea of googlenet is to do it around these two ideas:

(1). Depth, the number of layers deeper, the article uses 22 layers, in order to avoid the above mentioned gradient vanishing problem,

Googlenet Cleverly added two loss at different depths to ensure that the gradient return disappears.

(2). Width, add a variety of nuclear 1x1,3x3,5x5, and direct Max pooling,

But if you simply apply these to the feature map, the concat-up feature map will be very thick,

Therefore, in order to avoid this phenomenon in the Googlenet inception has the following structure, before the 3x3, 5x5,

After Max pooling, a 1x1 convolution core was added to reduce the feature map thickness.

To do the following instructions:
(1) Obviously Googlenet adopts the inception modular (9) structure, a total of 22 layers, convenient to add and modify; (2) The network finally uses the average pooling instead of the full connection layer, the idea comes from Nin, the number of parameters is only alexnet 1/12, Performance is better than alexnet, it turns out that the TOP1 accuracy can be increased by 0.6%. However, the actual in the end or add a full connection layer, mainly for the convenience of Finetune; (3) Although the full connection is removed, the dropout is still used in the network; (4) In order to avoid gradient disappearance, the network added 2 additional auxiliary Softmax for the forward conduction gradient. The article says that the loss of these two auxiliary classifiers should add a attenuation factor, but the model in Caffe does not have any attenuation. In addition, the two additional softmax will be removed during the actual test.

(5) The googlenet version of the above becomes the inception V1 structure it uses.

2. Inception V2 Structure

Large convolution cores can lead to greater susceptibility, which means that more parameters, such as the 5x5 convolution kernel parameter, are 25/9=2.78 times the 3x3 convolution kernel.

For this reason, the authors suggest that a small network of 2 contiguous 3x3 convolution layers (stride=1) can be used instead of a single 5x5 convolution layer, which is the inception V2 structure,

It also reduces the number of parameters while maintaining the field of sensation, such as:

3. Inception V3 Structure

A large convolution nucleus can be replaced by a series of 3x3 convolution cores, which could be broken down a little bit.

The article considers the NX1 convolution kernel, as shown in the superseded 3x3 convolution:

Therefore, the convolution of any nxn can be replaced by the convolution of the 1xn convolution followed by NX1. In fact, the authors found that using this decomposition effect in the early days of the network

And not good, and in the medium size feature map on the use of the effect will be better, for MXM size of feature map, recommended m between 12 to 20.

Use the NX1 convolution to replace the large convolution core, where n=7 is set to deal with 17x17 size feature map. The structure is formally used in the Googlenet V2.

4, Inception V4 structure, it combines the residual neural network resnet.

Reference Link: http://blog.csdn.net/stdcoutzyx/article/details/51052847

Http://blog.csdn.net/shuzfan/article/details/50738394#googlenet-inception-v2

Seven, residual neural network--resnet

(i) overview

The depth of the deep learning Network has a great impact on the final classification and recognition effect, so the normal idea is to be able to design the network as deep as possible,

But the fact is not so, the regular network stack (plain network) in the Internet is very deep, the effect is getting worse. One of the reasons

That is, the deeper the network, the phenomenon of gradient disappears more and more obvious, the network training effect will not be very good. But now the shallow networks (shallower network)

Can not obviously improve the network recognition effect, so now the problem is how to deepen the network and solve the problem of gradient disappear.

(ii) Residual error module--residual bloack

By overlaying y=x layers (called identity mappings, identity mappings) on a shallow network, you can increase the network with depth without degradation.

This reflects the inability of multilayer nonlinear networks to approximate the identity mapping network. However, the non-degradation is not our goal, we want to have a better performance of the network.

ResNet learns that the residual function f (x) = H (x)-X, if f (x) = 0, is the identity map mentioned above. In fact

ResNet is a special case of "shortcut connections" in connections that is under the identity map, which does not introduce additional parameters and computational complexity.

If the optimization objective function is to approximate an identity mapping, rather than a 0 mapping, then learning to find a disturbance to the identity map is easier than learning a mapping function again.

The residual function generally has a smaller response fluctuation, which indicates that the identity mapping is a reasonable preprocessing.

Residual Module Summary:

Very deep network is difficult to train, there is gradient vanishing and gradient explosion problem, learning skip connection It can be activated from a certain layer, and then quickly feedback to another layer or even deeper, using the skip connection can build residual network ResNet to train a deeper network, The ResNet network is built by the residuals module.

, is a two-layer neural network, in the L layer to activate the operation, get a[l+1], again to activate to get a[l+2]. By the following formula:

A[L+2] Added a[l] residual block, namely: Residual network, directly a[l] back to the deeper layer of the neural network, before the Relu nonlinear activation

Plus A[l],a[l] information directly to reach the deep network. Using residual blocks to train deeper networks, building a ResNet network is done by adding many

Such residual blocks are stacked together to form a deep neural network.

(iii) Residual network--resnet

The residual network, which is connected with 5 residual blocks, trains a neural network with gradient descent algorithm, and if there is no residual, it will find

With the deepening of the network, the training error decreases and then increases, so the training error is more and more small. For the residual network, as the number of layers increases,

Training errors are getting smaller, and this approach can reach deeper layers of the network, helping to solve the problem of gradient vanishing and gradient explosions, and let's train deeper networks

At the same time can ensure good performance.

Examples of why the residual network is performing well:

Suppose there is a very large neural network, input matrix x, output activation value of a[l], added to this network additional two layers, the final output is a[l+2],

You can think of these two layers as a residual module that uses the Relu activation function throughout the network, with all activations greater than or equal to 0.

For large networks, no matter whether the residual block is added to the middle or end of the neural network, it will not affect the performance of the network.

The main reason for the residual network to work is: It's so easy for these extra layers to learn the itentity function.

These residuals block learning the identity function is very easy. You can determine the network performance is not affected, and many times can even improve learning efficiency.

After the model is constructed, the obvious degradation phenomenon is observed on the plain, and the effect of the 34-layer network is better than the 18-layer resnet, and the ResNet convergence rate is much faster than that of plain.

In practice, considering the cost of the calculation, the residual block is calculated and optimized, the two 3x3 convolution layer is replaced with 1x1 + 3x3 + 1x1, such as. The middle 3x3 convolution layer in the new structure is first reduced under one reduced-dimensional 1x1 convolution layer and then restored under another 1x1 convolution layer, maintaining both precision and reduced computational capacity.

This is equivalent to reducing the number of parameters for the same number of layers, so that you can expand into deeper models. So the author proposed 50, 101, 152 layers of resnet, and not only have no degradation problem, error rate is greatly reduced, while the computational complexity is also maintained at a very low degree.

This time resnet the error rate has dropped the other network several streets, but it seems not satisfied, and then set up a more perverted 1202-layer network, for such a deep network, optimization is still not difficult, but there has been the problem of fitting, it is very normal, The authors also said further improvements would be made to the 1202-storey model.

Deep Learning-A classic network of convolutional neural Networks (LeNet-5, AlexNet, Zfnet, VGG-16, Googlenet, ResNet)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.