First, Introduction
Vgg NET, a deep convolutional neural network developed by the Visual Geometry Group (Visual Geometry Group) of Oxford University and a researcher at Google DeepMind, achieved second place in ILSVRC 2014, dropping the Top-5 error rate to 7.3 %。 Its main contribution is to demonstrate that the depth of the network (depth) is a key part of the algorithm's excellent performance. At present, more and more network structures are mainly ResNet (152-1000 layers), goolenet (22 layers), vggnet (19 layers), most models are based on the improvement of these models, using new optimization algorithms, multi-model fusion and so on. So far, Vgg Net has often been used to extract image features.
II. Structure of Vgg net
Figure 1:VGG16 Structure diagram
The input is an RGB image of size 224*224, which calculates the average of three channels at preprocessing (Preprocession), subtracting the average on each pixel (less iterative and faster convergence).
The image is processed by a series of convolution layers, using very small 3*3 convolution cores in the convolution layer, and 1*1 convolution cores in some convolution layers.
The convolution layer Step (stride) is set to 1 pixels, and the 3*3 convolution layer's fill (padding) is set to 1 pixels. The pool layer uses Max pooling, a total of 5 layers, after a portion of the convolution layer, the max-pooling window is 2*2, the step size is set to 2.
The convolution layer is followed by three fully connected layers (fully-connected layers,fc). The first two fully connected layers have 4,096 channels, and the third fully connected layer has 1000 channels for classification. All networks have the same full-connectivity layer configuration.
The fully connected layer is followed by Softmax, which is used for classification.
All hidden layers (middle of each conv layer) use Relu as the activation function. Vggnet does not use local response normalization (LRN), which does not improve performance on ILSVRC datasets, but results in more memory consumption and computation time (lrn:local Response Normalization, partial response normalization, To enhance the generalization capabilities of the network).
Iii. Discussion
1. Choose the convolution kernel with 3*3 because 3*3 is the smallest size that captures the information of the Pixel 8 neighborhood.
2, the use of 1*1 convolution core is not affecting the input and output of the dimension of the situation, the input deformation, and then through the relu of non-linear processing, improve the nonlinear decision-making function.
3, 2 3*3 convolution is equal to 1 5*5 convolution, 3 3*3 stack equals 1 7*7 convolution, the field size is unchanged, and more layers, smaller convolution cores can introduce more nonlinearity (more hidden layers, resulting in more nonlinear functions), improve decision function judgment, and bring fewer parameters.
4, each VGG network has 3 FC layer, 5 pool layer, 1 Softmax layer.
5, in the middle of the FC layer using dropout layer to prevent overfitting, such as:
The diagram on the left is a fully connected layer, and the right side is the fully connected layer after applying dropout.
We know that a typical neural network's training process is to transfer input forward through the network and then reverse the error. Dropout is for this process, randomly delete the hidden layer of the part of the unit, the above process. The steps are:
(1), randomly delete some hidden neurons in the network, keep the input and output neurons unchanged;
(2), the input through the modified network for forward propagation, and then the error through the modified network to reverse transmission;
(3), for another batch of training samples, repeat the above operation (1).
Dropout can effectively prevent overfitting, because:
(1), has achieved a kind of vote function. For a single neural network, it can be batched, even if different training sets may have different degrees of overfitting, but if we use a loss function, it is equivalent to the optimization of the same time, averaging, so it is more effective to prevent the occurrence of overfitting.
(2), reduce the complex co-adaptability between neurons. When the hidden layer neurons are randomly deleted, the fully connected network has a certain sparsity, which effectively reduces the synergy effect of different features. In other words, some characteristics may depend on the joint action of the implicit nodes of the fixed relationship, and by dropout, it is effective to organize some features in the presence of other features, which increases the robustness of the neural network.
6, now use the most is VGG16 (13 layer conv + 3 layer FC) and VGG19 (16 layer conv + 3 layer FC), pay attention to calculate the number of layers is not maxpool layer and Softmax layer, only calculate conv layer and FC layer.
Third, training
Training using multi-scale training (multi-scale), the original image scaled to different size S, and then randomly cut 224*224 pictures, and the picture is horizontal flip and random rgb chromatic aberration adjustment, this can increase a lot of data, to prevent model overfitting has a good effect.
Initially the original picture is cropped, the original picture of the smallest side should not be too small, so that, cut to 224*224, it is equivalent to almost cover the entire picture, so the original image of a different random crop to get the picture is basically no difference, lost the significance of increasing the data set, but also not too large, In this case, the cropped image contains only a small part of the target and is not very good.
In view of the above cropping problem, two kinds of solutions are proposed:
(1), the size of the fixed minimum side is 256;
(2), randomly from [256,512] to determine the range of sampling, so that the original picture size is different, conducive to training, this method is called scale jitter (scales jittering), conducive to training set enhancement.
Iv. Testing
Replace the full-join layer equivalent with the convolution layer for testing, because:
The only difference between the convolution layer and the full-join layer is that the neurons and inputs of the convolutional layer are locally connected and share the weights (weight) with different neurons within the same channel. The convolution layer and the full-join layer are actually the same, so you can convert the full-join layer to the convolution layer, as long as the convolution core size is set to the input space size: for example, the input is 7*7*512, the first layer of the full-join layer output 4096; we can treat it as a convolution core size of 7*7, a step of 1 The output is a convolution layer of 1*1*4096. The advantage of this is that the size of the input image is no longer limited, so the image can be easily predicted by sliding window, and the calculation of the equivalent convolution layer is relatively large, and the calculation amount is reduced, which achieves the goal and is very efficient.
V. Questions
1. Although vggnet reduces convolution parameters, in fact its parameter space is larger than AlexNet, and most of the parameters are from the first fully connected layer, which consumes more computing resources. In the subsequent NIN, it was found that replacing these fully-connected layers with global average pooling had little effect on performance and significantly reduced the number of parameters.
2. The Vgg model (mainly D and E) trained with the Pre-trained method is very large with respect to other method parameters, so it usually takes longer to train a Vgg model, but the public pre-trained model makes it easy to use.
Reference Source: Very deep convolutional Networks for large-scale Image recognition
Vgg Net Learning Notes