Very Deep convolutional Networks for large-scale Image recognition
1. Major contributions
- This paper explores the change of the effect of CNN as the number of layers increases as the number of parameters is basically unchanged. (thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, Which shows that a significant improvementon the Prior-art configurations can is achieved by pushing the depth to 16–19 we ight layers.)
2. Predecessors ' improvements
for the original paper Imagenet classification with the deep convolutional neural networks[2] frame , the main improvement is:
Literature [3]:utilised Smaller receptive window size and smaller stride of the first convolutional layer.
Literature [4]:dealt with training and testing, the networks densely over the whole image and over multiple scales.
3. CNN Network Architecture
To measure the improvement brought by the increased convnet depth in a fair setting, all our convnet layer configurations is designed using the same principles come from [1].
- The input from CNN is a 224x224x3 image.
- The only preprocessing before the input is the minus mean value.
- The convolution kernel size is basically 3x3, with a step length of 1.
- Additional 1x1 cores can be viewed as linear transformations of the input channels.
- There are five max-pooling layers, the pool window is 2x2 and the step size is 2.
- So the hidden layer needs to use rectification non-linearity (RELU).
- You do not need to add the local Response normalization (LRN), because it does not improve the effect but results in computational and memory costs, increasing the computation time.
- The last layer is Soft-max transform layers.
4. CNN Configurations
- The configuration of the Convolutional network is shown in table 1 and is named according to A-E. From one weight layers in the network A (8 conv. and 3 FC layers) to weight layers in the network E (conv. and 3 FC L Ayers).
- The number of channel channels (width) of the convolution layer starts at 64, doubling the number of max-pooling layers, up to 512.
5. Discussion
- Why is a 3x3-sized filters used throughout the paper? This is because 2 contiguous 3x3-sized filters are equivalent to a 5x5-sized filters. The same 3 connected 3x3 size filters is equivalent to a 7x7 size filters.
- So why not just use a 5x5 size or 7x7 size? Take 7x7 as an example:
First of all, three layers are more discriminating than one layer. (first, we incorporate three non-linearrectification layers instead of a single one, which makes the decision function More discriminative.)
Second, assuming the same number of channels C, then the three-layer 3x3 parameter number is 3x (3x3) CXC=27CXC, a layer of 7x7 parameter number is 7X7XCXC=49CXC, greatly reducing the number of parameters.
- The convolution kernel using 1*1 can increase the nonlinearity of discriminant function without affecting the field of view. The core has been used in the literature network in NETWORK[5] networking structure.
6. Training
- In addition to using multiple scale in sample sampling, the experiment basically follows the setting of [2] in this paper. Batch size is 256,momentum is 0.9, the regularization factor is 5x10e-4, the first two levels of the fully connected dropout parameter is set to 0.5, the learning step is initialized to 10e-2, and the step is divided by 10 when the result of the validation set no longer rises, except three times. Stopped when learning 370K iterations (epochs).
- The network of this article is easier to converge than the original network [2] because
A) implicit regularisation imposed by greater depth and smaller conv. Filter sizes
b) Pre-initialisation of certain layers.
- The weight initialization method of the network: first train the shallow network, the A network in, get the weight parameter. Then, when training a deeper network, the first four convolutional layers and the last three fully connected layers are initialized with the parameters in a, and the other layers in the middle still use random initialization. In pre-initialised layers We do not change the learning rate learning rates, allowing them to change in the course of learning learning. For random initialization, we sampled in a normal distribution with a mean value of 1 and a variance of 0.01. The offset item bias is set to 0.
- 224x224 input, the original picture, such as scale, to ensure that the short edge S is greater than 224, and then randomly select the 224x224 window, in order to further data augment, also consider the random horizontal flip and RGB channel transformation.
- Multi-scale Training, the multi-scale significance is that the object in the picture scale changes, multi-scale can better identify the object. There are two ways to do multi-scale training:
a). At different scales, train multiple classifiers, the parameter is s, and the meaning of the parameter is the short edge length when zooming on the original image. In this paper, two classifiers of s=256 and s=384 are trained, and the parameters of s=384 classifier are initialized with s=256 parameters, and a small initial learning rate of 10e-3 is used.
b). Another method is to directly train a classifier, each time the data input, each picture is re-scaled, scaled short edge S randomly selected from [Min, Max], this article uses the interval [256,384], the network parameter initialization when using s=384 parameters.
7. Testing
- The first is proportional scaling, the short side length q is greater than 224, the meaning of Q is the same as s, but S is the training set, q is the parameters of the test set. Q does not have to be equal to S, instead, for an S, use multiple Q values to test and then go on average to make the effect better.
- The test data is then tested in the way [4] in the reference to this document:
A). Converts an all-connected layer into a convolution layer, the first full-connection is converted to a convolution of 7x7, and the second is a convolution of 1x1.
b). Resulting net is applied to the whole image by convolving the filters of each layer with the full-size input. The resulting output feature map is a class score map with the number channels equal to the number of classes, and T He variable spatial resolution, dependent on the input image size.
c). Finally, class score map is spatially averaged (sum-pooled) to obtain a fixed-size vector of class scores of the image.
8. Implement
- Implemented using C + + Caffe Toolbox
- Support for single-system multi-GPU
- Multi-GPU divides batch into multiple gpu-batch, computes on each GPU, obtains the gradient of sub-batch, and averages it as the entire batch gradient.
- A lot of accelerated training methods are proposed in the reference [7] of the paper. The experimental results show that the 4-GPU system can be accelerated 3.75 times times.
9. Experimental 9.1 Configuration Comparison
Using the CNN structure in Figure 1 to experiment with multi-scale training on the C/D/E network structure, it is noted that the test set of this group of experiments has only one scale. As shown in the following:
9.2 Multi-scale Comparison
Test set multi-scale, and take into account the scale differences over the General Assembly resulting in performance degradation, so the test set of the scale q in the upper and lower 32 of s floating. For the training set is an interval scale, the test set scale is the minimum, maximum and median of interval.
9.3 Convnet Fusion
The model fusion method is to take the mean value of the posterior probability estimate. Merging the two best model in Figure 3 and Figure 4 to achieve a better value, the fusion of seven model will become worse.
Ten. Reference
[1]. Simonyan K, Zisserman A. Very deep convolutional Networks for large-scale Image recognition[j]. ARXIV Preprint arxiv:1409.1556, 2014.
[2]. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
[3]. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV, 2014.
[4]. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. overfeat:integrated Recognition, Localiza tion and Detection using convolutional Networks. In Proc. ICLR, 2014.
[5]. Lin, M., Chen, Q., and Yan, S. Network in Network. In Proc. ICLR, 2014.
Deep learning Notes (ii) Very Deepin convolutional Networks for large-scale Image recognition