Overview
Although the CNN deep convolution network in the field of image recognition has achieved significant results, but so far people to why CNN can achieve such a good effect is unable to explain, and can not put forward an effective network promotion strategy. Using the method of Deconvolution visualization in this paper, the author discovers some problems of alexnet, and makes some improvements on the basis of alexnet, so that the network achieves better results than alexnet. At the same time, the author uses "ablation Method" (ablation study) to analyze the influence of various areas of the picture on the network classification (in layman's words, "ablation method" is to remove some areas of the image and analyze the performance of the network).
- Anti-convolutional neural Networks (Deconvolutional network)
The deconvolution neural network can be regarded as the reverse process of the convolution core, the pooling layer and so on, which is the same as convolutional neural network. In order to analyze convolutional neural networks with deconvolution neural networks, we need to link the various layers of the deconvolution network to the various layers of convolutional neural networks. As shown, the right half is the convolution process, and the left half is the deconvolution process. Starting from the process of convolution on the right, first convolution core f is used to pooled maps of the previous layer, get feature maps, and then progressively relu normalization (rectified Linear) and maximum pooling (max Pooling). The deconvolution process starts with the inverse maximum pooling (max unpooling), and gradually gets unpooled, rectified unpooled, and reconstruction map.
The main components of the deconvolution neural network are:
It is an irreversible process for maximum pooling, so the author's technique is to record the position of each maximum value at the time of pooling. In this case, the value of the other location is assigned 0 when the maximum activation value in the pooled process is activated at the time of the anti-pooling. (The specific anti-pooling process can refer to HJIMCE's blog: http://blog.csdn.net/hjimce/article/details/50544370)
For the Relu activation function, the activation value is non-negative. Therefore, for the reverse process, it is also necessary to ensure that the eigenvalues of each layer are non-negative, so the reluctant anti-activation process is the same as the activation process.
Convolutional networks are the feature map of this layer that the network uses to check the characteristics of the previous layer of convolution to obtain a convolution. Deconvolution is the inverse process of this process, using the feature map of this layer and the convolution kernel of the transpose, to obtain the characteristics of the upper layer.
2. Visualization results
- What has the trait learned?
From the results, we can see that different layers of layer learning are different characteristics. For Layer1 and Layer2, network learning is basically the edge, color and other images of the underlying features; Layer3 can learn some complex features, such as mesh texture, Layer4 can learn the characteristics of higher dimensions, such as dog head, Bird's foot, concentric ring, etc. Layer5 is a key feature of more discerning.
- How does the feature layer evolve with training?
Shows how each feature layer in the network evolves with the number of training steps. Each of the sub-graphs represents the training of [1,2,5,10,20,30,40,64] epoch respectively. As you can see, their characteristics are quickly learned and stabilized for the lower feature layers. For the more high-dimensional feature layers like Layer5, they learned the key traits of comparative discernment after learning 30 epochs. It is indicated that the increase of training steps can improve the learning and convergence ability of the network better.
- How do visual networks improve network performance?
The authors have visualized the original alexnet of each feature layer, and found that for Alexnet, the first-layer convolution kernel is mostly high-frequency and low-frequency features, while the image features of the mid-band are not well extracted. At the same time, the visualization of the second-layer feature shows the aliasing artifacts caused by the first-layer convolution step being too large (4). Therefore, the author's improvement on Alexnet includes: reducing the convolution core of the first layer from 11x11 to 7x7, and reducing the convolution step to 2 instead of 4. After the improved model, the classification error of ImageNet2012 is higher than that of alexnet.
- Ablation Analysis (ablation)
The authors analyzed three images with ablation analysis, and found that when the key parts of the image were obscured, the feature excitation of the relevant convolution cores would be significantly smaller (the second column). At the same time, it is easy to put the image into the wrong category when the key parts are obscured, and the occlusion of some background parts will not be (fifth column).
This article is a very important article in the CNN Network visualization related research, after which many research work is based on the results of this article. Therefore, the study of CNN Network visualization must read the literature is not too.
Resources:
1. http://blog.csdn.net/hjimce/article/details/50544370
2. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 2014.
[Paper Interpretation] CNN Network visualization--visualizing and understanding convolutional Networks