Convolution network and its variants (deconvolution, expansion convolution, causal convolution, tu product)

Source: Internet
Author: User

Today, we share some of the most recent convolutional networks and some of its variants.

First, introduce the underlying convolutional network.

The process of convolution can be well understood through this classic dynamic picture on PPT. The large blue matrix in the graph is our input, the yellow small matrix is convolution core (kernel,filter), and the next small matrix is the convolution input, usually called feature map.

From the dynamic graph, we can clearly see that the convolution is actually a weighted overlay.

At the same time, it can be seen from this dynamic graph that the output dimension is less than the input dimension. If we need the output dimension to be equal to the dimension of the input, this needs to be populated (padding).

Now let's take a look at the different conv NN brought by padding.

First, let's take a look at the conv NN without padding (valid).

We can see from this static picture in ppt: Output dimension o, convolution kernel dimension k and input dimension I have the following relationship (equation-1). Where we assume that both input and kernel are squares, so we can use 4 to represent the kernel of 3*3 for 4*4 input,3.

When the dimension of input is the same as the dimension of output, this padding is called same padding. Also known as half padding. Likewise, we can see the relationship between the output dimension and the input dimension (equation-2) when the conv containing the padding is shown in the picture in the PPT. where p represents the dimension of padding, where we also assume that the same number of padding will be in the two dimensions of length and width.

When we k/2 the padding, we use the formula (equation 2) to obtain the following results. Also if K is odd (2n+1), the output dimension = input dimension is inferred by derivation.

The Causal convolution (causal conv) can work when we focus on the order of input data.

Causal was originally presented with Wavenets. Wavenets is a build model that is used primarily to generate music. Wavenets is the use of convolution to learn the input data (audio) before the T-moment to predict the output of the t+1 moment. That is, the probability of the last x that the model outputs will be as shown in Equation 3. In the dynamic picture in Equation 3 and ppt, we can see that the output of the T moment depends only on the input of the,..., t-1 moment, and does not depend on the t+1 moment and the input at the following time. This is very different from the idea of bilstm.

When your model has this special requirement, you can use casual.

In the realization, the casual of 1D is mainly realized by padding. The casual in 2D is mainly implemented by the Mask filter map.

Here are the simple experiments I did under the Chemdner dataset. Where the input takes a window size of 11. It is clear that CNN, without careful tuning, is significantly weaker than bilstm, while CNN in valid form is better than the padding CNN. This may be due to the noise that the padding will bring, interfering with the model.

, showing the process of using CNN to perform NLP tasks. It can be found that multichannel CNN is commonly used to learn the characteristics of the input, using different kernel and pooling.

Next, let's take a look at another convolution, extended convolution (dilated). The extended convolution is currently not used in NLP. Mainly used for images.

Dilated Conv was presented at ICLR 2016. Its main function is to increase the size of the visual field (each output is determined by the input of the visual field size) without increasing the parameters and the complexity of the model. This effect can be seen from this. The blue rectangle represents the visual field. The small red dots indicate kernel. In figure A, kernel is 3*3, visual field is 3*3,dilated=1, in Figure B, kernel is 3*3, but the visual field is 7*7,dilated=2, in Figure C, kernel is 3*3, but the visual field is 15*15,dilated=4. It can be seen that the visual field also expands when the dilated (expansion factor) expands.

Below, we use 1D data to look at dilated in detail. As can be seen, when dilated=2, each output, "saw" 3 inputs (although the 2-1=1 is ignored). When dilated=4, "see" 5 inputs (4-1=3 are ignored)

As can be seen from the above analysis, dilated is very similar to stride. But can dilated be equated with stride?
The answer is in the negative. We can consider dilated as a pattern of kernel thinning. And Stride is just a special case of dilated. Depending on the task, we can design different sparse patterns. It does not necessarily require that the sparse number on the width be set to the sparse number on the long.

Yes, the results given in the paper. This task is scene segmentation. Dilated is the 4th column, the standard answer is the 5th column. It can be seen that dilated results better than other models.

The same experiment was done under the Chemdner dataset. You can see that the dilated is the same as the normal CNN performance. But the performance is reduced when it is larger.

Because dilated will be sparse kernel, it may not be appropriate for NER tasks. However, for document classification, relationship extraction may be better.

Now let's look at the deconvolution (deconvolution)

The paper I'm going to talk about is published in CVPR 2010.

Deconv in mathematics, it is the effect of reversing convolution. At Deconv, we only know H, which requires F and G. However, when Conv, we know that G, through forward propagation and direction propagation to modify F to get the best H.

At present, Deconv in real life has been used in signal processing, image processing and so on.

On the deep learning, it is mainly used in the following three aspects.

L Unsupervised Learning: reconstructing images

L CNN Visualization: Restore feature map from conv to pixel space to see which pattern images are sensitive to a particular feature map

L Upsampling: on sampling.

Deconv is also called the Transpose convolution (TRANSPOSEDCONV).

We can write the process of conv in the graph in the form of multiplying the matrix. where c represents the first matrix in the diagram. X represents a second matrix. Y represents a third matrix. On both sides of the formula, the inverse convolution can be obtained by the transpose of C.

As a consequence, in the forward propagation is the use of C, after the use of CT is the normal conv. Conversely, it is deconv.

Below, we introduce the application of Deconv in image reconstruction. The task is mainly to extract the characteristics of the image. Is the result provided by the paper. You can see the effect is good.

The loss used in this task is a typical reconstruction error +L1 regular.

The DECONV formula used is as follows. Represents the input of the refactoring, Z represents the feature map under Conv, which is the result of our task, and F represents kernel.

The next figure is the approximate flow of the system. The figure F indicates that deconv,ft indicates that conv,p represents pool,u for Unpool. R represents the union of the f,u,f operation. RT in the same vein.

In the general process (for example, when using CNN for image classification), we first input the pixel point of the image through the FT operation into the Z1 layer, then the P operation, then the output of the first layer. Then after the second level of FT, p operation, the second layer of output can also be obtained. Then, we can use this second layer of output to do the related tasks. For example, do image classification.

But in Deconv, we need to reverse this process. At the beginning, let's say our model has been trained well. My task is the output of the second layer on that hand (again assuming we've got it), with the U operation, f operation, the first layer of output can be obtained. The reconstructed input can then be obtained after the same operation again. Since we have assumed that the model has been trained, the reconstructed input differs from the original input very little.

Is the 3D pooling used in the paper. The difference from the familiar 2D pooling is that 3Dpooling first passes through 2D pooling and then pool again between different feature maps. So it's 3D.

And unpooling is the reverse process of 3D pooling.

After the pooling and unpooling, you can see that the results are sparse. This is also the demand of the task.

After describing the approximate process, let's look at how the model is trained. Since F and G are not known in Deconv, it is necessary to fix F to optimize G and fix g to optimize F. Specifically, in this task, the filter in Deconv, the image of the feature map (z) is unknown, we only know the original Y, we need to get the image Z, that is, in the previous pages of PPT show the outline of the picture.

For ease of understanding, we will use a well-trained model called inference, and the training model is called learning. The inference corresponds to the fixed F, which optimizes Z. Learning first inference operation, then Z, Optimization f.

In the inference process, divided into three steps, first in the gradient step, the use of reverse propagation, you can get loss for z gradient. Update Z with this gradient. Then, on shirnkage step, the function of the diagram is used to dilute Z, where beta is the parameter. Finally, on the pooling step, after p operation, the output of the last second layer can be obtained. In general, during the inference process, we will repeat these 3 steps until we find that the final loss is small enough.

In the process of learning, we go through the inference and then use the CG algorithm to optimize f under fixed Z. After repeating these two steps, the model has been trained well until the last loss is found to be sufficient.

The local connection layer (locally-connectedlayers) is an extension of the conv. In Conv, all w is shared. But in locally, all the parameters are not shared. In other words, the locally also uses W for conv (weighted overlay) on the calculation, but this w will be different for each input.

Locally compared with Conv, because W is not shared, the model can learn more complex features. It is also easier to cross-fit.

The same experiment was done under the Chemdner. It can be seen that the deconvolution (transpose) is slightly better than cnn-same.

Here's a look at the TU product (graphconvolution).

We do not introduce the theory of graph Conv first. We first describe how to use graph conv.

First, we know that figure G = (v,e), where x represents the feature of the vertex set V, a represents the structure information of the graph, and usually uses the adjacency matrix. In a layer of graph conv, using the upper output hl,a and the W of this layer as input, after some function mapping F, you can get the output of this layer.

Below, we assume that the function σ is used. Because A is the adjacency matrix of a graph, there is a value only if the current point has a connection to another point. This ahw indicates that all neighbors of the current node are multiplying the output of the previous layer by W. Thus, we only see the local connection of the current point through the function σ. This is very similar to the local connection of the conv. Therefore, we can understand graph conv from this point. At the same time, when we use the multi-layer graph conv, H2 will take advantage of the H1 value, H1 using the current node's 1 neighbor information, and H2 is the use of the current node 1 and 2 neighbor information.

We use the complex version of the function just now, on the Karate-club data set, randomly initialize W, using the 3-layer graph conv, the final H3 output, you can get the results. You can see that the vector distance between nodes is good (the point distance of the same color is close) when not trained.

The ahat=a+i,dhat in the figure represents the diagonal matrix of the Ahat node degree.

Theoretically, we can explain graphconv in two ways.

L in spectral graph theory, convolution can be expressed as the product of a matrix. Applying the formula 4 to the Chebyshev polynomial and the other approximation, we can get the formula-5. And the formula 5 is basically the same as the function we just used σ.

The L W-L algorithm tells us that we can use the neighbor of the current node to represent it.

Here's a summary.

summarized as shown. Among them, dilated may get better results in text categorization and relationship extraction.

Convolution network and its variants (deconvolution, expansion convolution, causal convolution, tu product)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.