Very Deep convolutional Networks for large-scale Image recognition reprint please specify: http://blog.csdn.net/stdcoutzyx/article/ details/39736509
This paper is in September this year's paper [1], a relatively new, wherein the point of view felt for convolutional neural network parameter adjustment has a great guiding role, especially summed up. About convolutional Neural Networks (convolutional neural Network, CNN), the author will explain the composition, if the reader is impatient or can be used Google Baidu a bit.
The following content of this paper is the paper's notes, the author first try to extract the focus of a paper notes, if there are shortcomings please read the original point.
1. Main contribution
- The effect of CNN is changed with the increase of the number of layers in the case that the total number of parameters is basically unchanged.
- The method in the paper won the second place in the ILSVRC-2014 competition.
- Ilsvrc--imagenet Large-scale Visual recongnition challenge
2. CNN Improvement
After the paper [2] appeared, there were many ways to improve the structure of CNN proposed. For example:
- Use smaller receptive window size and smaller stride of the first convolutional layer.
- Training and testing the networks densely over the whole image and over multiple scales.
3. CNN Configuration Principals
- The input from CNN is a 224x224x3 image.
- The only preprocessing before the input is the minus mean value.
- 1x1 cores can be viewed as linear transformations of input channels.
- Use a larger convolution kernel size of 3x3.
- Max-pooling is typically done on a 2x2 pixel window with Stride 2.
- In addition to the last fully connected classification layer, rectification non-linearity (RELU) is required for the other layers.
- You do not need to add the local Response normalization (LRN), because it does not improve the effect but results in computational and memory costs, increasing the computation time.
4. CNN Configuration
- The number of channels (width) of the convolution layer is doubled from 64, each over a max-pooling layer, to 512.
- Use filters with 3x3 size throughout the whole net, because a stack of the 3x3 conv layers (without spatial pooling in bet Ween) has a effective receptive of 5x5, and three a stack of 3x3 conv layers have a receptive of 7x7, and so on.
- Why use a three-layer 3x3 instead of a layer of 7x7?
- Firstly, the three layer is more discriminating than the first layer.
- Second, assuming the same number of channels C, then the three-layer 3x3 parameter number is 3x (3x3) CXC=27CXC, a layer of 7x7 parameter number is 7X7XCXC=49CXC. Greatly reduces the number of parameters.
- The convolution kernel using 1*1 can increase the nonlinearity of discriminant function without affecting the field of view. The core can be used in the network structure of the "networking", which can be referenced in reference 12 of the paper.
5. Training
- In addition to using multiple scale, the paper [1] experiment basically follow the setting of paper [2]. Batch size is 256,momentum is 0.9, the regularization factor is 5x10e-4, the first two levels of the fully connected dropout parameter is set to 0.5, the learning step is initialized to 10e-2, and the step is divided by 10 when the result of the validation set no longer rises, except three times. Stopped when learning 370K iterations (epochs).
- The thesis speculated that the network of this paper is more easy to converge than the original network, for two reasons:
- Implicit regularization imposed by greater depth and smaller conv filter sizes
- Pre-initialisation of certain layers. First, the shallow network is trained, a network, the parameters, when training a deeper network such as E, using the parameters obtained in a to initialize the corresponding layer, the new layer of the parameters are randomly initialized. It is important to note that this is the way to initialize without changing the step size.
- 224x224 input, the original picture, such as scale, to ensure that the short side is greater than 224, and then randomly select the 224x224 window, in order to further data augment, also consider the random level affine and RGB channel switching.
- Multi-scale Training, the multi-scale significance is that the object in the picture scale changes, multi-scale can better identify the object. There are two ways to do multi-scale training.
- In different scales, training multiple classifiers, the parameter is s, the meaning of the parameter is to do the original image on the scale of the short edge length. In this paper, two classifiers of s=256 and s=384 are trained, and the parameters of s=384 classifier are initialized with s=256 parameters, and the step size is adjusted to 10e-3.
- Another method is to directly train a classifier, each time the data input, each picture is re-scaled, scaled short edge s randomly selected from [Min, Max], this article uses the interval [256,384], the network parameter initialization when using s=384 parameters.
6. Testing
The test uses the following steps:
- The meaning of the short side length Q greater than 224,q is the same as s, but S is the training set, and Q is the parameter of the test set. Q does not have to be equal to S, instead, for an S, use multiple Q values to test and then go on average to make the effect better.
- The test data is then tested in the way referenced in article 16.
- Converts an all-connected layer to a convolution layer, the first full-connection is converted to a convolution of 7x7, and the second is a convolution of 1x1.
- Resulting net is applied to the whole image by convolving the filters of each layer with the full-size input. The resulting output feature map is a class score map with the number channels equal to the number of classes, and the Var Iable spatial resolution, dependent on the input image size.
- Finally, class score map is spatially averaged (sum-pooled) to obtain a fixed-size vector of class scores of the image.
7. Implementation
- Implemented using C + + Caffe Toolbox
- Support for single-system multi-GPU
- Multi-GPU divides batch into multiple gpu-batch, computes on each GPU, obtains the gradient of sub-batch, and averages it as the entire batch gradient.
- A lot of accelerated training methods are proposed in the reference [9] of the paper. The experimental results show that the 4-GPU system can be accelerated 3.75 times times.
8. Reference
[1]. Simonyan K, Zisserman A. Very deep convolutional Networks for large-scale Image recognition[j]. ARXIV Preprint arxiv:1409.1556, 2014.
[2]. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[c]//advances I n Neural information processing systems. 2012:1097-1105.
Very Deep convolutional Networks for large-scale Image recognition