Paper Notes "Fully convolutional Networks for Semantic Segmentation"

Source: Internet
Author: User

"Fully convolutional Networks for Semantic segmentation", CVPR best paper,pixel level, Fully supervised.

The main idea is to change CNN to FCN, input an image directly on the output to get dense prediction, that is, each pixel belongs to the class, thus obtaining a end-to-end method to achieve image semantic segmentation.

We already have a CNN model, first of all connected to CNN as a convolution layer, convolution template size is the size of the input feature map, that is, the whole connected network as the entire input map to do convolution, the entire connection layer has 4,096 6*6 convolution core, 4,096 1*1 convolution core, Convolutional cores of 1000 1*1, such as:


Next to the output of these 1000 1*1, do upsampling, get 1000 original size (such as 32*32) output, these outputs merged, get the heatmap shown.

Dense prediction is obtained through upsampling, the author has studied 3 kinds of programs:

1,shift-and-stitch: The drop-down sampling factor between the set original and the FCN output is F, so for each f*f area of the original image (not overlapping), "shift the input x pixels to the right and y pixels down fo R every (x, y), 0 < X, y < F. "The output of this f*f region corresponds to the output of the center point pixel at this time, so that each f*f region is f^2 output, that is, each pixel can correspond to one output , so became the dense prediction.

2,filter rarefaction: Just zoom in on the size of the filter in the subsampling layer of the CNN Network and get a new filter:


where S is the sliding step of the subsampling, the new filter's sliding step is set to 1, so that the subsampling does not shrink the image size, finally can get dense prediction.

Neither of these methods has been used by the authors, mainly because the two methods are trade-off, because:

For the second method, the down-sampling function is weakened so that more detailed information can be seen by the filter, but the receptive fileds will be relatively small, may lose global information, and will introduce more operations to the convolution layer.

For the first method, although the receptive fileds is not smaller, the original image is divided into f*f area input network, which makes the filters unable to feel the finer information.

3, here upsampling operation can be regarded as deconvolution (deconvolutional), convolution operation parameters and CNN parameters are in the process of training FCN model through the BP algorithm learning.


The above is the results of the CNN processing, got the dense prediction, and the author found in the experiment, the resulting segmentation results are relatively rough, so consider adding more front-layer details, that is, the output of the penultimate layer and the final output to do a fusion, in fact, add and:


This results in the second and third rows, and experiments show that the results are more detailed and accurate. In the process of layered fusion, do the third row and then down, the results will become worse, so the author did not stop here. You can see the corresponding result as in the previous three lines:


The advantages of this approach are:

1, training a end-to-end FCN model, using convolutional neural network of strong learning ability, to get more accurate results, the previous CNN-based approach to the input or output to do some processing, in order to get the final result.

2, directly using the existing CNN network, such as Alexnet, VGG16, googlenet, just add upsampling at the end, parameter learning or using the principle of the reverse propagation of CNN itself, "Whole image training is effective and efficient. "

3, do not limit the size of the input picture, do not require all the pictures in the picture set is the same size, just in the last upsampling by the original image by the subsampling scale back, the final output will be the same size as the original dense prediction map.

According to the experiment shown in the conclusion section of the paper, the sample output is as follows:


It can be intuitively seen that this method and groud truth compared to the easy to lose smaller targets and local details, such as the first picture of the car, and the second picture of the audience crowd, if you want to improve, this should be some room for improvement.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Paper Notes "Fully convolutional Networks for Semantic Segmentation"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.