"Paper Information"
"Fully convolutional Networks for Semantic Segmentation"
CVPR Best Paper
Reference Link:
http://blog.csdn.net/tangwei2014
http://blog.csdn.net/u010025211/article/details/51209504
Overview & Key contributions
This paper presents a end-to-end method of semantic segmentation, referred to as FCN.
As shown, directly take segmentation's ground truth as the supervisory information, train an end-to-end network, let the network do pixelwise prediction, directly predict the label map.
( the author's own analogy thought: faster rcnn in the rbn-> (fc->region proposal) label map-> FAST-RCNN for fine tuning)
"Introduction to the Method"
The main idea is to change CNN to FCN, input an image directly on the output to get dense prediction, that is, each pixel belongs to the class, so as to get a End-to-end method to achieve image semantic segmentation.
We already have a CNN model, first of all connected to CNN as a convolution layer, convolution template size is the size of the input feature map, that is, the whole connected network as the entire input map to do convolution, the entire connection layer has 4,096 6*6 convolution core, 4,096 1*1 convolution core, Convolutional cores of 1000 1*1, such as:
Next to the output of these 1000 1*1, do upsampling, get 1000 original size (such as 32*32) output, These outputs merged, get the heatmap shown.
"Detail Record"
Dense prediction
Dense prediction is obtained through upsampling, the author has studied 3 kinds of programs:
1,shift-and-stitch: The drop-down sampling factor between the set original and the FCN output is F, so for each f*f area of the original image (not overlapping), "shift the input x pixels to the right and Y Pixels down for every (x, y), 0 < X, y < F. "Output of this f*f area corresponds to the pixel of the center point of the region, so f^2 output is obtained for each f*f region, That is, each pixel can correspond to one output, so it becomes dense prediction.
2,filter rarefaction: Just zoom in on the size of the filter in the subsampling layer of the CNN Network and get a new filter:
where S is the sliding step of the subsampling, the new filter's sliding step is set to 1, so that the subsampling does not shrink the image size, finally can get dense prediction.
Neither of these methods is used by the authors, mainly because both methods are Trad-off, because:
For the second method, the down-sampling function is weakened so that more detailed information can be seen by the filter, but the receptive fileds will be relatively small, may lose global information , and will introduce more operations to the convolution layer.
For the first method, although the receptive fileds is not smaller, the original image is divided into f*f area input network, which makes the filters unable to feel the finer information .
Key methods:
Anti-convolution layer->pixel wise->bp parameters-> realize the conv of the forward and reverse transmission process can be reversed
3, here upsampling operation can be regarded as deconvolution (deconvolutional), convolution operation parameters and CNN parameters are in the process of training FCN model through the BP algorithm learning.
Fusion prediction
The above is the results of the CNN processing, got the dense prediction, and the author found in the experiment, the resulting segmentation results are relatively rough, so consider adding more front-layer details, that is, the output of the penultimate layer and the final output to do a fusion, in fact, add and:
This results in the second and third rows, and experiments show that the results are more detailed and accurate. In the process of layered fusion, do the third row and then down, the results will become worse, so the author did not stop here. You can see the corresponding result as in the previous three lines:
Questions & Solutions
1. How to do Pixelwise's prediction?
The traditional network is subsampling, the corresponding output size will be reduced, in order to do pixelwiseprediction, must ensure the output size.
Workaround:
(1) The final fully connected layer of the traditional network, such as Alexnet,vgg, becomes the convolution layer.
For example, the first fully connected layer in VGG16 is 25088x4096, which is interpreted as a convolution core of 512x7x7x4096, then if the convolution operation is performed on a larger input image (the lower half), the original output 4096-D feature node (the upper part), A coarsefeature map will be output.
The advantage of this is that you can take good advantage of the trained supervisedpre-training network, do not like the existing methods, from beginning to end training, only need to fine-tuning, training efficient.
(2) Add In-network upsampling layer.
to the middle get Featuremap do bilinear on the sampling, is the anti-convolution layer. The implementation of the conv of the forward and reverse transfer process can be reversed.
2. How to refine and get better results?
The step in Upsampling is 32, the input is 3x500x500, the output is 544x544, the edge is very bad, and the limit thescale of detail of the upsampling output.
Workaround:
Using the method of Skiplayer, the step of upsampling is reduced in the shallow layer, and the finelayer is fused with the Coarselayer obtained by the high-level, and then upsampling to get the output.
This approach takes into account both local and global information, that is, the combiningwhat and where in the text, and has achieved a good performance improvement. The fcn-32s increased to 62.4,fcn-8s to 62.7 for 59.4,fcn-16s. Can see the effect is still very obvious.
3. Training Details
Use ALEXNET,VGG16 or googlenet training model to do the initialization, on this basis to do fine-tuning, all fine-tuning.
Use Wholeimage to do training, do not carry patchwise sampling. The experiment proves that the direct use of the whole map has been very effectiveand efficient.
A full 0 initialization is done for the Classscore convolution layer. Stochastic initialization has no advantage in performance and convergence.
"Experimental Design"
1, compare 3 kinds of cnn:alexnet with good performance, VGG16, googlenet experiment, choose VGG16
2, compare fcn-32s-fixed, Fcn-32s, Fcn-16s, fcn-8s, prove the best dense prediction combination is 8s
3,fcn-8s and State-of-the-art contrast are optimal, r-cnn, SDS. Fcn-16s
4,fcn-16s and some of the existing work are compared to the optimal
5,fcn-32s and fcn-16s are better than State-of-the-art on rgb-d and HHA image datasets
"Summary"
Advantages
1, training a end-to-end FCN model, using convolutional neural network of strong learning ability, to get more accurate results, the previous CNN-based approach to the input or output to do some processing, in order to get the final result.
2, directly using the existing CNN network, such as Alexnet, VGG16, googlenet, just add upsampling at the end, parameter learning or using the principle of the reverse propagation of CNN itself , "whole image training is effective and efficient. "
3, do not limit the size of the input picture, do not require all the pictures in the picture set is the same size, just in the last upsampling by the original image is subsampling scale back , the final output will be the same size as the original dense Prediction Map.
Defects
According to the experiment shown in the conclusion section of the paper, the sample output is as follows:
It can be seen intuitively that this method is easier to lose than groud truth, such as the car in the first picture, and the audience in the second picture, which should have some room for improvement if it is to be improved.
Results
Of course it's state-of-the-art.
Feel it:
-
Top
-
0
RCNN Study Notes (8): Fully convolutional Networks for Semantic segmentation (full convolutional network FCN)