VERY Deep convolutional NETWORKS for large-scale IMAGE recognition this paper

Source: Internet
Author: User
Tags scale image

The convolutional neural network in Vgg's ILSVRC competition, led by Professor Andrew Zisserman, has made a good score, and this article details network-related matters.

What does the article mainly do? It is in the use of convolutional neural network, in the use of small convolution core and small moving step, the depth of the network to explore the impact of the target recognition rate.

The overall structure of the network

The input of the network is 224*224 RGB picture, followed by the convolution layer, the size of the convolution core is 3*3 has the smallest can preserve the image space of the convolution core, the step is 1 pixels, occasionally there will be 1*1 convolution core, which is equivalent to adding a non-linear transformation. Then back to the pooling layer, its size is 2*2, the step is 2 pixels, and the use of Max-pooling method, and then the three layer of full-attached layer, the first layer is 4,096 units, the second layer is 4,096 units, the third layer is 1000 units, which corresponds to 1000 categories, According to the output of these 1000 units this can be categorized, and then a softmax layer, in fact, is used to calculate the cost function of the network for error. That's it, that's the structure of the whole network. Based on the depth of the network, five network structures have been explored, as follows: (The black font portion is the portion of the network relative to the previous layer)

In, for example, conv3-512 in the network, the size of the convolution core of the network is 3*3, there are 512 feature map. In addition, the specific configuration of the Maxpool is not written.

In addition, in the above network configuration, each pass a maxpooling time, its feature map increases one times oh, not to maxpooling words, feature map number unchanged, this purpose is convenient to increase the depth of the network;

The network uses the smallest convolution core 3*3, in fact, when the two layers of 3*3 convolution layer is equivalent to a 5*5 convolution layer (because their corresponding receive domain is the same), when the three layer of 3*3 convolution layer superimposed on the equivalent of a 7*7 convolution layer oh. Therefore, the method of the article is to reduce the width of the convolution layer, and increase the depth of the convolution layer, there is a benefit to do so, can reduce the number of parameters. For example: The three-layer 3*3 has a convolution layer parameter number is 3*3* Num_featuremap, and a layer of 7*7 convolution layer parameter number is: 7*7*num_featuremap. Is it a lot less than that, hahaha. Since the parameters are reduced, the process is equivalent to an invisible regularization to the network .

Some details about the training of the network:

The network uses the usual Mini-batch gradient descent method to train the network, size is 256, the principle of the renewal of weights is also a common momentum method, the momentum value is 0.9, and the L2 regularization method is used for the weight decay of the network, in which the penalty factor is 5. *10-4, the regularization mechanism of Dropput is adopted for the first two layers of the final fully connected layer, The value of dropout is 0.5; The network's learning rate is roughly set to: the initial value is 0.01, when entering plateau (it can be considered that the recognition rate of the verification value no longer changes, of course, you can choose other Oh, such as: lossvalue), learning rate reduced 10 times times, Change to 0.001, then repeat again, and change to 0.0001 is OK.

The initialization of weights is an important problem: for deep networks, the initial value of the weights of the network is particularly susceptible to the final recognition rate, and the initial value of the network relative to the comparison shallow can be solved by the method of random initialization. So the approach is to randomly initialize (using mean as a Gaussian distribution of 0,variance 0.01) rather shallow network A, and then training to complete, when we train other equivalent to the deep network, we use the weight of network A to initialize other network weights is OK, During the training process, the initial value of our learning rate is also maintained at 0.01 to allow them to learn well.

The selection problem of the input picture of the network:

I found this in the article is very important,,, as if there is no innovation in the network, and then in this area to improve the recognition rate it,, but this is too much ... No point .... What, what? Feel..... Ah,,, this part of it, see also a bit vague oh. Amount:

Network training: For the training of the picture I can also understand, is to adjust the picture to the side length of the smallest (estimated not square bar) for Q,Q 256, or 384, or between (256,512) the value. Then, in the above random sampling 224*224 size area, and then to train on it, or each iteration is re-color sample.   In training, in order to speed up, when we finished training Q is 256, when we train q=384, our network with Q for 256 weights initialized, and then adjust the learning rate to 0.001 retraining. When we train the value between q= (256,512), our network is initialized with Q for 384 weights, then the learning rate is adjusted to 0.001 retraining.

Test the network process: in the test network process, the same is the picture to Multi-scale, that is, different sizes. Then, there are two ways: the first kind of dense evaluation, this bar is fully-convolutional networks. You can refer to the overfeat:integrated recognition I wrote before, Localization and Detection using convolutional Networks The understanding of the article, the second way, is to do multi-crop on the picture, and then sent to the network for testing, so the calculation is very large in fact. Anyway, the text says that they can be complementary. Why? 1, when using dense evaluation, the sampling rate of the window is 2*2*2*2*2=32 pixels (a total of 5 Step 2 max-pooling), and if multi-crop, we can control the step size, 2, the use of dense The small difference between evaluation and Multi-crop is that it is possible that the zero-padding leads edge data is not the same, it is easy to understand (dense evaluation when added to the adjacent pixel value, and Multi-crop supplement is 0).

What does the result show? First look at A and a-lrn, show that the local regularization of the effect ah, in vain to increase the amount of computation, so, the network on the back will not use it. Now look at the network B and network C-1, it is the difference between the C-1 added a layer of convolution core 1 convolution layer, that is equivalent to the introduction of non-linearity, the result, the recognition of the error rate has decreased, indicating the starting point function (so can, unexpectedly,,, how I feel in the blind stir gu it). Look at D-1 and C-1, the difference is that the C-1 in the convolution core for the 1*1 for the 3*3 convolution core, the result has been improved, that is, in the original word is the It is the important to capture the spatial context by using Conv. In the text One sentence I think is very important: the text in the experimental process shows that when the depth of the network increased to 19 layers, the network error recognition rate reached saturation, it has been difficult to decline oh ...

The multi scale evaluation corresponds to three scale images and then asks them for three average, as shown in the results:

Anyway, the result is that the performance of the network has improved. Improved, improved ... In fact, like this approach to improve it is reasonable, I think in the sense that, there is no technical content ah ...

Test the network Process 2: For the test of multi-crop evaluation and dense evaluation contrast, anyway did not see what meaning, how to the network of the full connection layer into the convolution layer, the mud horse do so, the right value how to do AH. Keep the follow-up to read the relevant papers to find the answer.

The implementation of the network:

Network training is based on Caffe, of course, it has been modified, mainly to achieve multi-GPU acceleration. The method used in this paper is to divide a min-batch into multiple parts, and then let each GPU go to the corresponding gradient value, then the gradient value of this GPU is averaged, and then the weight value is updated. Anyway, last 4 bucks.

NVIDIA Titan Series (120,001 pieces) of the GPU, training network to spend 2-3 weeks, if I use my notebook run, a single 620M graphics card, estimated can run a year, ha ....

Image Classification Experiment:

Here to say a little we often see top-1 error and top-5 error is what hair thing ... Top-1 error is the actual false rate, the probability of the largest class as the output, and top-5 error is often used as the ILSVRC official recognition rate test, it refers to the probability of the first 5 as long as there is a corresponding correct class even if it is recognized right, no is wrong ... Before the silly don't understand, now finally know ... In addition, ILSVRC classification, in general, we only know the training set and validation set Oh, then, if you want to test, that good, to the official server to the ILSVRC contest can oh, to prevent others cheating. Remember the 15 contest, an internet giant who was looking for medical and drug-seeking medicine online, during the period from November 28, 2014 to May 13, 2015, the team used at least 30 accounts to submit at least 200 times to the test server, which exceeded the limit of two times per week for the tournament. Finally banned from participating in such contests, HAHAHA!!!

The so-called network fusion (Convnet Fusion, which should be the meaning of translation)

What is this approach? It is simple and simple, and the output of multiple network models is averaged again for recognition. In the ILSVRC game everyone is using, why? Because doing this basically slightly, can improve the network recognition rate. Principle, it may be that the so-called different network models are somewhat complementary (complementarity). This also shows that the network is unstable ... I still haven't found the essence.

In the final article, Appendix B gives the generalization of the network.

The data set used in the feature extraction layer of the network.

To tell the truth, after reading this article, how do I feel? Nothing but trial-and-error ah ah ah ah .... Is artificial intelligence not development foreground?? All are based on mathematical models and algorithms, there are a variety of big data statistics, probability-related knowledge bar bar. , want to let the computer have their own ideas,,, I think it is far away ah ah ah ...

VERY Deep convolutional NETWORKS for large-scale IMAGE recognition this paper

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.