Understanding of deep separable convolution, packet convolution, expanded convolution, transpose convolution (deconvolution)

Source: Internet
Author: User

Reference:
https://zhuanlan.zhihu.com/p/28749411
https://zhuanlan.zhihu.com/p/28186857
https://blog.yani.io/filter-group-tutorial/
https://www.zhihu.com/question/54149221
Http://blog.csdn.net/guvcolie/article/details/77884530?locationNum=10&fps=1
http://blog.csdn.net/zizi7/article/details/77369945
Https://github.com/vdumoulin/conv_arithmetic
https://www.zhihu.com/question/43609045/answer/130868981 1. Deep separable convolution (depthwise separable convolution)

In separable convolution (separable convolution), convolution operations are typically split into multiple steps. The deep separable convolution (depthwise separable convolution) is commonly used in neural networks.
For example, suppose there is a 3x3-sized convolution layer with an input channel of 16 and an output channel of 32.
Then the general operation is to use 32 3x3 convolution core to the input data convolution, so that each convolution core needs to 3x3x16 parameters, the resulting output is only one channel of data. One channel of data is obtained because each channel of the 3x3x16 core that is just beginning to do the convolution on each corresponding channel of the input data, and then overlay each channel corresponding position value, make it into a single channel, then 32 convolution cores altogether need (3x3x16) x32 = 4,068 parameters. 1.1 The difference between the standard convolution and the depth separable convolution

Use a sheet to explain the depth separable convolution, as follows:

Each channel can be seen with a filter convolution after the output of a corresponding channel, and then the information fusion. The previous standard convolution process can be represented by the following diagram:
1.2 process of depth separable convolution

The process of applying depth-separable convolution is that the ① uses 16 3x3-sized convolution cores (1 channels) to make a convolution with the input 16-channel data (which uses 16 1-channel convolution cores, with 1 3x3 convolution cores for each channel of the input data), and a feature map of 16 channels is obtained. We say that the step is depthwise (layered), before the superposition of 16 feature maps, ② then with 32 1x1-sized convolution cores (16 channels) in the 16 feature map of the convolution operation, the 16 channels of information fusion (with 1x1 convolution for the information fusion between different channels), We say that this step is pointwise (per pixel). So we can figure out the whole process using 3x3x16+ (1x1x16) x32 = 656 parameters. 1.3 Advantages of deep separable convolution

It can be seen that the required parameters are reduced by the use of deep separable convolution than the ordinary convolution. It is important that the in-depth separable convolution takes into account both the channel and the area of the previous common convolution operation, and the convolution takes only the area before considering the channel. The separation of channels and regions is realized. 2 packet convolution (group convolution)

Group convolution, the first in the Alexnet, due to the limited hardware resources at the time, training alexnet convolution operation can not be all on the same GPU processing, so the author of the feature maps to the multiple GPU processing, respectively, Finally, the results of multiple GPUs are fused. 2.1 What is a grouped convolution

Before explaining the grouping of convolution, we use a graph to understand the general convolution operation.

As can be seen from the above figure, the general convolution will be the whole of the input data convolution operation, that is, the input data: H1XW1XC1, and the convolution kernel size is h1xw1, there are C2, and then convolution output data is h2xw2xc2. Here we assume that the output and output resolution is constant. The main view is that the process is one go, which puts a higher demand on the memory capacity.
But the grouping convolution obviously does not have that many parameters. First, use the image to visually feel the process of grouping convolution. For the same problem as mentioned above, the grouping convolution is shown in the following figure.

can see that the input data is divided into 2 groups (the number of groups is g), it is important to note that this grouping is only in depth, that is, a number of channels into a group, this specific quantity is determined by (c1/g). Because of the changes in the output data, the corresponding convolution cores also need to make the same change. That is, the depth of the convolution kernel in each group becomes (c1/g), and the size of the convolution nucleus is not required to change, at this time the number of convolution cores per group becomes (c2/g), rather than the original C2. Then with each set of convolution core with their corresponding group of input data convolution, obtained the output data, and then combined with concatenate way, the final output data channel is still C2. That is, after the group of G is determined, then we will parallel operation G of the same convolution process, each process (each group), the input data is h1xw1xc1/g, the convolution kernel size is h1xw1xc1/g, there are c2/g, the output data is h2xw2xc2/g. 2.2 A specific example of a grouped convolution

From a specific example, the Group conv itself greatly reduces the parameters. For example, when the input channel is 256, the output channel is 256,kernel size is 3x3, do not do group conv parameter is 256x3x3x256. When implementing a packet convolution, if group is 8, each group input channel and output channel are 32, the parameter is 8x32x3x3x32, is the original one-eighth. The feature maps that the group conv the last set of outputs should be combined in a concatenate way.
Alex thinks that the way group conv can increase the diagonal correlation between filter, and can reduce the training parameters, not easy to fit, which is similar to the effect of the regular. 3 void (expansion) convolution (dilated/atrous convolution)

void convolution (dilated convolution) is a convolution idea which can reduce the image resolution and the loss of information in the middle and lower sampling of image semantic segmentation. Use Add holes to expand the field and let the original 3
The convolution cores of the X3 have 5x5 (dilated rate =2) or greater susceptibility to the same number of parameters and calculations, thus eliminating the need for sampling. Expanded convolution (dilated convolutions), also known as void convolution (atrous convolutions), introduces a new parameter called "expansion rate" to the convolution layer, which defines the spacing between the values of the convolution core processing data. In other words, compared to the original standard convolution, the expanded convolution (dilated convolution) is a hyper-parameter (Hyper-parameter) called dilation rate (expansion), which refers to the number of intervals before the kernel points, " The normal convolution dilatation rate is 1 ". the concept of void convolution


(a) The figure corresponds to a 3x3 1-dilated conv, as with a normal convolution operation. (b) The figure corresponds to the 3x3 2-dilated conv, the actual convolution kernel size or 3x3, but the hole is 1, it is important to note that the empty position is filled in 0, fill in 0 and then convolution. "This change is shown in the figure below" (c) is the 4-dilated conv operation.
The sensing field of the expanded convolution in the above figure can be calculated from the following formula, where i+1 represents dilated rate.
For example, (a) above, dilated=1,f (dilated) = 3x3; in figure (b), dilated=2,f (dilated) =7x7; in figure (c), dilated=4, F (dilated) =15x15.
dilated=2 when the specific operation, that is, according to the following figure in the empty position after filling in 0, and then directly convolution on it.
the dynamic process of void convolution

Visualize the process of expanding convolution on a two-dimensional image:

The image above is a 3x3 convolution core with an expansion rate of 2, and the sensing field is the same as the 5x5 convolution core, and requires only 9 parameters. You can think of it as a 5x5 convolution kernel that deletes one row or column every other row or column.
Under the same calculation conditions, the void convolution provides a greater sense of the wild. Void convolution is often used in real-time image segmentation. Empty convolution can be considered when the network layer needs a large sensing field, but the computational resources are limited and cannot increase the number or size of convolutional cores. dilated convolution the exponential growth of wild-

For the standard convolution nuclear case, for example, with a 3x3 convolution core continuous convolution 2 times, in the 3rd layer to get 1 feature points, then the 3rd layer of this feature point conversion back to the 1th layer covered how many feature points it.
3rd Floor:

2nd Floor:

1th Floor:

A 5x5-sized area of the first layer is transformed into a point after a 2-time 3x3 standard convolution. That is, from the size of the 2-layer 3*3 convolution conversion is equivalent to 1-layer 5*5 convolution. Aside from the evolution of the above diagram, it can be seen that a 5x5 convolution kernel can be replaced by 2 successive 3x3 convolution.
But for Dilated=2,3*3 's expanded convolution nucleus.
A point on the 3rd floor:

2nd Floor:


You can see the area of the first layer of 13x13, after 2 times of 3x3 expansion convolution, becomes a point. In terms of size, a 2-layer 3x3 void convolution conversion is equivalent to a 1-layer 13x13 convolution. Transpose convolution

Transpose convolution (transposed convolutions) aka Deconvolution (deconvolution) or fractional step convolution (fractially straced convolutions). Anti-convolution (transposed convolution, fractionally strided convolution or The first appearance of the concept of deconvolution was Zeiler's paper deconvolutional networks, published in 2010. the difference between a transpose convolution and an inverse convolution

Then what is deconvolution. The inverse process of convolution is literally understood. Notable deconvolution exists, but it is not commonly used in deep learning. The transpose convolution, though also known as anti-convolution, is not a true deconvolution. Because of the mathematical meaning of the deconvolution, the input signal can be completely restored through the deconvolution output signal through the deconvolution. The fact is that the transpose convolution can only restore the shape size, not the value. You can understand that, at least in terms of numerical value, the transpose convolution cannot achieve the inverse process of convolution operation. So the transpose convolution is a bit similar to the real deconvolution, because both produce the same spatial resolution. But the name Deconvolution (deconvolutions) is inappropriate because it does not conform to the concept of deconvolution. Dynamic graph of transpose convolution


Convolution core is a 3x3, Stride 2, and no boundary expansion of two-dimensional transpose convolution
It is important to note that the padding,stride is still the value specified by the convolution process and will not change. Example

Because the above is only a theoretical explanation of the purpose of transpose the convolution, and does not explain how to rebuild the input by the output after the convolution. Here's an example of how to feel.
For example, with input data: After 3x3,reshape, for a:1x9,b (can be understood as a filter): 9x4 (Toeplitz matrix) then A*b=c:1x4;reshape c=2x2. So, with the B convolution, we changed the input data from shape=3x3 to shape=2x2. Turn. When we take the results of convolution to do input, at this time a:2x2,reshape after the 1x4,b for the transpose to 4x9, then a*b=c=1x9, notice at this time the C, we think it is convolution before the input, although there are deviations. Then the reshape is 3x3. So, through the transpose of B-"deconvolution", we obtained the shape=3x3 from convolution result shape=2x2, and reconstructed the resolution.
that is, input feature map a=[3,3] is output to [2,2], where padding=0,stride=1, deconvolution (transpose convolution) is input feature map a=[2,2], after deconvolution filtering b=[ 2,2]. The output is [3,3]. Where the padding=0,stride=1 is unchanged. So how does the [2,2] convolution core (filter) convert to [4,9] or [9,4]? Through the Toeplitz matrix.
As to what the Toeplitz matrix is, it is no longer covered in space. But even if you do not know the matrix, the specific work of transpose convolution should be understood.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.