"Network in Network" paper notes
1. Overview
There are two very important points in this article:
use of 1x1 convolution
In this paper, the MLPCONV network layer is used to replace the traditional convolution layer. The MLP layer is actually convolution plus the traditional MLP (multilayer perceptron), because the convolution is linear, and the MLP is non-linear, which can get a higher abstraction and greater generalization capability. In the case of cross-channel,cross feature map, Mlpconv is equivalent to the convolution layer +1x1 the convolution layer, so the MLPCONV layer is also called the CCCP layer (cascaded crossing channel parametric pooling).
FC layer not used in CNN network (full connectivity layer)
It is proposed to replace the final fully connected layer with the global Average pooling, because the parameters of the fully connected layer are many and easy to fit. The practice is to remove the full join layer, in the last layer (using the Mlpconv) layer, followed by a layer of average pooling layer.
The above two points, it is important, it is to a large extent reduce the number of parameters, it is possible to obtain a better result. and the reduction of the parameter size, not only the use of network layer to deepen (due to too many parameters, the network size is too large, GPU memory is not enough to limit the increase in network layer, thus limiting the generalization ability of the model), but also in the training time has been improved.
2. Network Structure
the traditional convolution layer
single channel MLPCONV layer
cross-channel MLPCONV layer (CCCP layer)
As the figure shows, MLPCONV=CONVOLUTION+MLP (2-layer MLP in the figure).
Implemented in Caffe, mlpconv=convolution+1x1convolution+1x1convolution (2-layer MLP)
implementation in 3.Caffe
The complete network structure of the original 3-layer Mlpconv
Caffe in Layer 4 network (IMAGENET)
Description
1. Box is the network layer, ellipse is blob
2. Yellow pool4 for Average Pooling
Caffe network data data are as follows (crop size=224)
| Layer |
Channels |
Filter Size |
Filter Stride |
Padding Size |
Input Size |
| Conv1 |
96 |
11 |
4 |
- |
224x224 |
| Cccp1 |
96 |
1 |
1 |
- |
54x54 |
| Cccp2 |
96 |
1 |
1 |
- |
54x54 |
| Pool1 |
96 |
3 |
2 |
- |
54x54 |
| Conv2 |
256 |
5 |
1 |
2 |
27x27 |
| Cccp3 |
256 |
1 |
1 |
- |
27x27 |
| Cccp4 |
256 |
1 |
1 |
- |
27x27 |
| Pool2 |
256 |
3 |
2 |
- |
27x27 |
| Conv3 |
384 |
3 |
1 |
1 |
13x13 |
| Cccp5 |
384 |
1 |
1 |
- |
13x13 |
| Cccp6 |
384 |
1 |
1 |
- |
13x13 |
| Pool3 |
384 |
3 |
2 |
- |
13x13 |
| conv4-1024 |
1024 |
3 |
1 |
1 |
6x6 |
| cccp7-1024 |
1024 |
1 |
1 |
- |
6x6 |
| cccp8-1000 |
1000 |
1 |
1 |
- |
6x6 |
| Pool4-ave |
1000 |
6 |
1 |
- |
6x6 |
| Accuracy |
1000 |
- |
- |
- |
1x1 |
- For crop size = 227, the change of input size is 227, 55, 27, 13, 6, 1.
4. The role of 1x1 convolution
The following excerpt from: http://www.caffecn.cn/?/question/136
Q: What does it do to find that many networks use 1x1 convolution cores? In addition, I have always felt that the 1x1 convolution kernel is a proportional scaling of the input, because the 1x1 convolution core has only one parameter, and the core slides on the input, which is equivalent to multiplying the input data by a factor. I don't know if I understand the right thing.
Answer 1:
For the convolution between the single-channel feature map and the single convolution, the master's understanding is right, and the convolution in CNN is mostly a multi-channel operation between feature map and multi-channel convolution (the input multichannel feature Map and a set of convolution cores for convolution summation to get an output of feature map), if using 1x1 convolution kernel, this operation is to achieve the linear combination of multiple feature map, can achieve feature map in the number of channels change. After the normal convolution layer, with the activation function, you can implement the network in network structure (this content author is authorized only to the CAFFECN community (caffecn.cn) use, if required to reprint please enclose the source of content. )
Answer 2:
I would like to say my understanding, I think 1x1 convolution is about two aspects of the role of it:
1. Enable cross-channel interaction and information integration
2. Reducing the number of convolution cores and ascending dimension
Explained in detail below:
1. This point Sun Lin June children's shoes are very clear. 1x1 convolution Layer (possibly) caused people's attention is in the structure of NIN, the thesis Andrew Brother's idea is to use MLP instead of the traditional linear convolution kernel, thereby improving the network's expression ability. At the same time, using the angle interpretation of cross-channel pooling, the proposed MLP is equivalent to the CCCP layer behind the traditional convolution kernel, thus realizing the linear combination of multiple feature maps and realizing the information integration across the channel. The CCCP layer is equivalent to 1x1 convolution, so a closer look at the NIN Caffe realization, is in each of the traditional convolution layer behind the two CCCP layer (in fact, is connected to two 1x1 convolution layer).
2. The descending and ascending dimension has been brought to the attention of people (possibly) in the googlenet. For each inception module (e.g.), the original module is left, and a 1x1 convolution is added to the image on the right to reduce the dimension. Although the convolution cores on the left are relatively small, when the number of inputs and outputs is large, multiplying will make the convolution kernel parameters very large, and the right figure added 1x1 convolution can reduce the number of input channels, convolution kernel parameters, the complexity of the operation is then lowered. Take the Googlenet 3a module as an example, the input feature map is the 28X28X192,3A module in the 1x1 convolution channel for the 64,3x3 convolution channel for the 128,5x5 convolution channel is 32, if the structure is left, then the convolution kernel parameter is 1x1x192x64+3x 3x192x128+5x5x192x32, and a 1x1 convolution layer with a channel number of 96 and 16 is added to the right to the 3x3 and 5x5 convolution layers, so that the convolution kernel parameter becomes 1x1x192x64+ (1x1x192x96+3x3x96x128) + (1x1x 192x16+5x5x16x32), the parameter is reduced approximately to the original One-third. At the same time after adding 1x1 convolution layer behind the parallel pooling layer can also reduce the output of feature map number, left Figure pooling feature map is unchanged, and then add convolution layer feature map, will make the output feature map expanded to 416 , if each module is this way, the output of the network will be more and more large. On the right, a 1x1 convolution with a channel of 32 is appended to the pooling, which reduces the feature map number of the output to 256. Googlenet uses 1x1 convolution to reduce the dimension, get a more compact network structure, although there are a total of 22 layers, but the number of parameters is only 8 layers of alexnet one-twelveth (of course, there is a large part of the reason is to remove the full connection layer).
Recently the hot MSRA ResNet also used the 1x1 convolution, and was used before and after the 3x3 convolution layer, not only to reduce dimensions, but also to raise dimensions, so that the convolution layer input and output channel number are reduced, the number of parameters further reduced, such as the structure. (Or I can't imagine how the 152-tier network is going to run. Tat)
[1]. https://gist.github.com/mavenlin/d802a5849de39225bcc6
[2]. http://www.caffecn.cn/?/question/136
Network in Network 2