Some knowledge about the deconvolution of upper sampling

Source: Internet
Author: User
Tags advantage

Original link: http://www.2cto.com/kf/201609/545237.html



Preface

(the production of vomiting.) A few days ago just did a semantic segmentation of the image of the report, the recent reading of the paper and some ideas to talk about. So today it is summed up into articles, to facilitate discussion and discussion. This article just shows some of the more classic and I think that the more good structure, after all, there are quite a lot of structural methods. introduce

Image semantic segmentation, in short, is given a picture of each pixel on the image classification

From the image point of view, we need to split the actual scene map into the following split graph:


Different colors represent different categories.

After I read "a lot" of papers (shyness) and view Pascal VOC Learderboard, I found that image semantic segmentation from deep learning to introduce this task (FCN) to this point, a common framework has been probably determined. namely: Original FCNCRF/MRF split graph fcn-full convolution network crf-conditions with the airport mrf-Markov random field

The front end uses the FCN to carry on the characteristic coarse extraction, the back end uses the CRF/MRF to optimize the front end the output, finally obtains the segmentation graph.

Next, I'll summarize it from the front end and back end two parts. Front End Why need FCN

We classify the network will usually be connected at the end of a few layers of full connection layer, it will be the original two-dimensional matrix (picture) compressed into one dimension, thereby losing space information, the final training output a scalar, this is our classification results.

The output of image semantic segmentation needs to be a split graph, and it is two-dimensional at least, regardless of size. So, we need to discard the full link layer, put on the full convolution layer, and this is the full convolution network. For specific definitions please refer to the paper: fully convolutional Networks for semantic segmentation front-end structure FCN

The FCN here refers specifically to the structure proposed in the fully convolutional Networks for semantic paper, rather than the generalized full convolution network.

The author's FCN mainly used three kinds of technology: convolution (convolutional) on the sampling (upsample) Jump structure (skip Layer) convolution

Convolution is the ordinary classification network, such as VGG16,RESNET50/101 network discarded full connection layer, the corresponding convolution layer can be. The following figure:
Sample on

The above sample is the Deconvolution (deconvolution). Of course there are different frameworks for this name, Caffe and Kera are called deconvolution, and TensorFlow is called Conv_transpose. CS231N in this course, called conv_transpose more appropriate.

As you know, the common pool (why this is the normal pool) will reduce the size of the picture, such as VGG16 five times after the picture was reduced 32 times times. In order to get a large split graph with the original image, we need to sample/reverse convolution.

Deconvolution and convolution are similar, and are all multiplication and addition operations. But the latter is more than one, the former is a pair of many. The forward and backward propagation of the deconvolution is only transmitted by reversing the convolution. So no matter the optimization or the back propagation algorithm is no problem. The illustrations are as follows:

However, although the article is said to be a learning deconvolution, but the author's actual code does not let it learn, probably because of this one-to-many logical relationship. The code follows:< yo "/kf/ware/vc/" target= "_blank" class= "Keylink" >vcd4ncjxwcmugy2xhc3m9 "Brush:java"; > Layer {name: "Upscore" type: "deconvolution" bottom: "score_fr" Top: "Upscore" param {lr_mult:0} Convolution_para m {num_output:21 bias_term:false kernel_size:64 stride:32}}

You can see that the Lr_mult is set to 0. Jump Structure

(This strange name is translated by me, it's generally called ignoring the connection structure. The function of the structure is to optimize the result, because if the result of the full convolution is sampled directly, the results of the different pool layers are sampled to optimize the output. The specific structure is as follows:

The results of different sample structures are compared as follows:

Of course, you can also pool1, pool2 output and then sample output. However, the authors say the resulting results are not much improved.
This is the first structure, but also the depth of learning applied to the image semantic segmentation of the mountain, so got CVPR2015 's best thesis. However, there are some more rough treatment of the place, concrete and back to know the comparison. segnet/deconvnet

This structure is summed up here, but I think the structure is more elegant, it is not necessarily better than the results. segnet

deconvnet

Such a symmetric structure has a sense of self encoder inside, first coded and then decoded. This structure mainly uses deconvolution and upper pool. That

Reverse convolution as above. And the implementation of the upper pool is mainly in the pool to remember the output value of the location, in the pool when the value back to the original location, other locations fill 0 that OK. Deeplab

The next step is to introduce a very mature and elegant structure, so much of the improvement is now based on the network structure.

First, here we will point out the rough place of the first structure FCN: to ensure that the output size is not too small, FCN's author in the first layer directly to the original image added a 100 padding, it is conceivable that this will introduce noise.

And how to ensure that the output size is not too small and does not produce 100padding such a practice. Some might say that reducing the pool layer doesn't work, which is theoretically possible, but this directly changes the structure that is available, and the most important point is that it cannot be fine-tune with previous structural parameters. So Deeplab here uses a very elegant approach: change the pooling stride to 1, plus 1padding. This does not reduce the size of the image after the pool, and still retains the characteristics of the pool integration feature.

But, it's not over yet. Because the pool layer changes, the back of the convolution of the field also corresponds to the change, so it can not be fine-tune. Therefore, a new convolution, Deeplab convolution: atrous convolution is proposed. namely:

and the specific field of perception changes as follows:

A is the result of an ordinary pool, and B is the result of "elegant" pooling. We envisage a normal convolution with a convolution kernel size of 3 on a, and the corresponding field size is 7. While the same operation is performed on B, the corresponding sensation field becomes 5. The sensation field is reduced. But if the atrous convolution with hole is 1, the feeling field remains 7. Therefore, atrous convolution can ensure that the feeling of the pool after the same, so that can fine tune, but also to ensure that the output of the results more refined. That

Summary

Here are three kinds of structures: FCN, Segnet/deconvnet,deeplab. Of course, there are other structural methods, such as useful rnn to do, there are more practical meaning of weakly-supervised methods and so on. back end

Finally to the back end, the back end here will speak a few fields, involving some mathematical things. My understanding is not particularly profound, so welcome to the slot. Full connection condition with the airport (DENSECRF)

For each pixel I have category label XI and corresponding observation value Yi, so that each pixel as a node, the pixel and the relationship between the pixels as an edge, that constitutes a condition with the airport. And we infer the pixel I corresponding Category label XI by observing the variable Yi. The conditions follow the airport as follows:

The condition is in accordance with the Gibbs distribution of the airport: (here x is the observed value above)
P (x=x "i") =1z (i) exp (? E (x| I))
of which E (x| i) is the energy function, for simplicity, the following omitted global observations I:
E (x) =∑iψu (xi) +∑i<jψp (XI,XJ) < nobr= "" >
The unary potential function ∑iψu (xi) is derived from the output of the front-end FCN. And the two-dollar potential function is as follows:
Ψp (XI,XJ) =u (XI,XJ) ∑m=1mω (m) K (m) G (FI,FJ)
Binary potential function is to describe the relationship between pixel and pixel points, encourage similar pixels to allocate the same label, while the larger pixels assign different labels, and the definition of "distance" is related to the color value and the actual relative distance. So the CRF can make the picture as far as possible at the border division. The full join condition differs from the airport in that the two-dollar potential function describes the relationship of each pixel to all other pixels, so it is called a "full connection".

Let's take a look at this pile of formulas ... The direct calculation of these formulas is more troublesome (I would like to trouble), so generally use the average field approximation method to calculate. And the average field approximation is a bunch of formulas, here I do not give out (I think we are not too willing to see), the original intention to understand the students directly read the paper bar. Crfasrnn

The first use of DENSECRF is directly added to the FCN output behind, but this is relatively rough. And in the depth of study, we are pursuing end-to-end system, so crfasrnn this article will DENSECRF really combined into the FCN. This article also uses the average field approximation method, because each step of the decomposition is a number of multiplication and addition of the calculation, and ordinary addition and subtraction (the specific formula or read the paper), so you can easily describe each step as a layer of similar convolution calculation. This can be combined into the neural network, and there is no problem with forward and backward propagation. Of course, the author also put it in the iteration, the different number of iterations to get the results of the optimization is also different (generally take less than 10 iterations), so the article is said to be as RNN. The results of the optimization are as follows:
Markov Random Airport (MRF)

MRF is used in deep parsing network, and its formulas are defined in a similar definition to CRF, except that the two-dollar potential function is modified by the author:
Ψ (Yui,yvi) =∑k=1kλkuk (i,u,j,v) ∑?z∈njd j,z (PVZ)
The author joins λk as the label context, because the UK only defines the frequency at which two pixels occur simultaneously, and λk can punish some situations, for example, a person may be at the table, but less likely to be under the table. So it's possible to learn the probabilities of different situations. The original distance d (i,j) only defines the relationship between two pixels, where the author adds a triple penalty, which also introduces Z near J, which describes the tripartite relationship for a more sufficient local context. The specific structure is as follows:

The Advantage of this structure is that the average field is constructed as a CNN joint training and can be one-pass inference without iterative Gaussian conditions with the airport (G-CRF)

This structure uses the CNN to learn the unary potential function and the two-yuan potential function respectively. This structure is something we like better:

and the energy function in this is different from the previous:
E (x) =12xt (a+λi) x? Bx
When (a+λi) is a symmetric positive definite, the minimum value of E (x) equals the solution:
(a+λi) x=b

The advantage of the

G-CRF is that the two-time energy has a clear global solution that is simple and easy to understand. sentiment FCN is more like a skill. Progress as basic networks (such as VGG, ResNet) improve performance. The depth learning + probability graph model (GPM) is a trend. In fact, the DL is to do feature extraction, and gpm can be a good mathematical theory to explain the nature of the relationship between things. The network of probabilistic graph model. Because GPM is usually not easy to join in the DL model, the GPM network can be a gpm parameter self-learning, and constitute a end-to-end system.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.