Table of Contents: part I: Source partial II: Applications, role III: effects (dimensionality reduction, ascending dimension, trans-channel interaction, increasing of nonlinearity)--from the perspective of fully-connected layers
First, Source: [1312.4400] Network in Network (if 1x1 convolution is followed by a normal convolution layer, the network in network structure can be implemented with the activation function.) )
second, the application: The residual module in the inception and ResNet in Googlenet
third, the role:
1, dimensionality reduction (reduce parameters)
example of the 3a module in 1:googlenet
The input feature map is 28x28x192
1x1 convolution channel is 64
3x3 Convolution channel is 128
5x5 convolution channel is 32
Left Tou kernel parameters: 192x (1x1x64) +192x (3x3x128) + 192x (5x5x32) = 387072
The right-hand graph adds a 1x1 convolution layer with a channel number of 96 and 16 respectively before the 3x3 and 5x5 convolution layers, so the convolution kernel parameters become:
192X (1x1x64) + (192x1x1x96+ 96x3x3x128) + (192x1x1x16+16x5x5x32) = 157184
At the same time, after adding 1x1 convolution layer behind the parallel pooling layer, the output feature map number can be reduced (feature map size refers to W, H is the share weight sliding window,feature map number is channels)
Left Figure feature Map number: 128 + + + (pooling feature map unchanged) = 416 (if each module is the case, the network output will become larger)
Right Figure feature Map number: 128 + + + (pooling followed by a channel of 32 1x1 convolution) = 256
Googlenet using 1x1 convolution dimensionality reduction, the more compact network structure, although there are 22 layers, but the number of parameters is only 8 layers of alexnet one-twelveth (of course, a large part of the reason is to remove the full connection layer)
the residual module in example 2:resnet
Suppose the feature map on the previous layer is w*h*256, and the final output is 256 feature map
Left-Hand operand: w*h*256*3*3*256 =589824*w*h
Right-hand operand: w*h*256*1*1*64 + w*h*64*3*3*64 +w*h*64*1*1*256 = 69632*w*h, the left parameter is about 8.5 times times the right side. (Achieve dimension reduction, reduce parameters)
2, Ascending dimension (using the least parameters to broaden the network channal)
Example: in the previous example, not only is there a 1*1 convolution kernel at the input, there is also a convolution kernel at the output, and the channel of the 3*3,64 convolution core is 64, just add a 1*1,256 convolution kernel, use only the 64* 256 parameters can widen the network channel from 64 to four times times to 256.
3. Cross-channel information interaction (Channal transform)
Example: Using 1*1 convolution kernel, the operation of descending and ascending dimension is actually the linear combination of information between channel, and the 3*3,64channels of the convolution kernel adds a 1*1,28channels convolution nucleus, which becomes the 3*3,28channels convolution nucleus. The original 64 channels can be understood as a trans-channel linear combination into 28channels, which is the information interaction between channels.
Note: Only linear combinations are made on channel dimensions, W and h are sliding windows with shared weights
4. Increase the non-linear characteristic
1*1 convolution kernel can greatly increase the non-linear characteristics (using the Non-linear activation function of the latter) to keep the feature map scale unchanged (that is, no loss resolution), and make the network deep.
Note: After a filter corresponds to the convolution to get a feature map, different filter (different weight and bias), convolution after the different feature map, extract different features, get the corresponding specialized neuro.
Iv. to understand the 1*1 convolution kernel from the angle of fully-connected layers
Consider it as a fully connected layer
The 6 neurons on the left, respectively, are A1-a6, and become 5 after the full connection, respectively, of the B1-B5
6 neurons on the left are equivalent to the channels:6 in the input feature.
A new feature of the 5 neurons on the right, equivalent to the 1*1 convolution channels:5
W*h*6 on the left can be fully connected via 1*1*5 convolution cores.
In convolutional Nets, there is no such thing as "fully-connected layers". There are only convolution layers with 1x1 convolution kernels and a full connection table. –yann LeCun
Reference: one by one [1 x 1] convolution-counter-intuitively What is the effect of the useful/1x1 convolution nucleus.
Understanding of the 1*1 convolution kernel/How to understand the 1*1 convolution in convolution neural networks