Today to see a more classical semantic segmentation network, that is FCN, full name title, the original English thesis website: https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf
Three big guys: Jonathan Long Evan shelhamer Trevor Darrell
This web site is a big guy on the Internet FCN blog, at the same time deeply felt the gap between himself and the big guy, but still have to bite the bullet to complete the paper, paste out the Web site, and we learn together:47205839
To get to the point, the author said at the beginning that they put forward the "complete convolution" network, which is characterized by the ability to enter data of any size, and to produce output of the corresponding size and dimensions through effective detachment and learning. Use Alexnet,googlenet,vgg to modify the full convolution network, and make the transfer learning fine-tuning. Next, the author defines a jumping structure, which combines the information from a deep, rough layer, in which the layer contains the appearance information from the shallow layers, with the structure and information of the front, allowing for precise and detailed segmentation. (slot: CNN's strength is that its multi-layered structure can automatically learn features, and can learn multiple levels of features: the shallow convolution layer has a smaller perceptual domain, learning some of the characteristics of some local areas; the deeper convolution layer has a larger perceptual domain and can learn more abstract features. These abstract features are less sensitive to the size, position, and orientation of objects, thus helping to identify performance improvements. )
The author thinks that convolutional networks not only improve the whole image classification effect, but also make great progress in the positioning task of structured output (bounding box object detection, prediction of parts and key points, matching positioning, etc.). FCN has done a lot of pioneering work, which is the first end-to-end training of pixels, while supervised pre-training, while modifying the existing network to a full convolutional network, can produce dense output for any size input. The author points out that the process of learning and reasoning is carried out on the whole image by Feedforward calculation and reverse propagation. In the network, the upper sampling layer can realize the prediction of pixels, and the pool layer with the reduced sampling is used to study in the network. The big guy said that the training is a very common way, but compared with the full volume training, the efficiency is still less. FCN does not need to produce proposals, or by the airport, the classifier after the optimization of the operation, and thus can be seen, FCN is still very light. The intense predictions that the authors often refer to in the paper, personal feelings and predictions for pixel points are similar. Semantic segmentation has a close relation between location and semantics, and global information can parse local information. The deep feature hierarchy encodes the position and semantics on the nonlinear local-global pyramid. The author defines the previous jumping structure to make better use of the feature information.
The author proposes the method of image classification as supervised pre-training, at the same time fine-tuning full convolution, so that the entire input image and Groundtruth can be efficiently learned. The authors combine the characteristics of each layer to define a nonlinear end-to-end local-global representation. Feel wild from the perspective of CNN Visualization, it is the area of the input image that outputs the response of a featuremap node. The structure of convolution, pooling, and activation function in the network runs on the local input image and relies on the corresponding spatial coordinates. The general deep network is used to calculate the general nonlinear function, but only the network layer that computes the nonlinear function is called the deep Filter or the full convolution network. In the loss function this, a bit difficult to understand, the author said a real value composed of FCN loss function defines a task, if this loss function is the last layer of network all spatial dimensions of the sum, here it gives a formula, loss The gradient of the function is the same as the gradient of each pixel, so the random gradient descent of the whole image L is equal to the random gradient descent of each pixel point, where the last layer of the network is used as a minibatch. When there is significant overlap in the sensing field, feedforward calculations and reverse propagation are more efficient than patch-by-patch between layers and layers.
The overall process is, first of all, to adjust the network to a full convolution network, to produce a coarse output, for the pixel level, the above rough output needs to be reflected to the pixel. Compared to the transformation before the alexnet, vgg,lenet and other networks, these networks have a fixed input, at the same time produce non-spatial output, torture is limited to the culprit is the full connection layer, its fixed size even, but also the space to throw away, however, the author is more witty, The whole connection layer is regarded as convolution core to cover the entire input image convolution. This transformation can be entered in any size.
The result graph obtained by the full convolutional network is equivalent to the pre-adjusted network evaluation of a particular input area, but the calculation of overlapping areas in patches is relatively centralized. In FCN, although the size of the output map is controlled by an input image of any size, the output dimension is usually reduced by the drop-down sampling. The reason for the classification network to reduce sampling is to keep the number of filters small, while the calculation requirements are feasible, this version of the full Convolutional network, by multiplying the pixel span factor of the output cells to make the output look "rough point."
When you see the shift and stitch is filter this piece of the feeling is not understood, referring to the previous big guy's URL, learned that shift and Stitch is the author in order to get dense pretiction one of the programs, the author mentions that dense The prediction can be stitched together by the output that is moved on the input image, and if the lower sampling factor of the output graph is f, the input pixels are moved to the right by x, and y pixels are moved down, (x, y) St 0<x,y<f. It is obvious that after a series of moves (where the movement appears to be a step at a time), it is equivalent to getting f*f input (note that, although the computational amount is f*f, it increases, but has the effect of predicting a single pixel), and interleaved output allows the predictions to correspond to the pixels of the center of the sensation. Next, the author also mentions a network layer (convolution or pooling layer) with a stride for S, and a series of convolution layers with weighted fij (ignoring the non-correlated features), setting the input step of the low-level network to 1, and sampling the output of the underlying network with the factor S. However, the convolution of the original filter with the result of the sample is
The difference between shift and stitch is that because the filter for convolution here only sees part of the input reduction, the author modifies the filter as follows:
In order to achieve the effect of shift and Stitch, the filter is repeatedly amplified by layers until all the drop samples disappear. Reducing down-sampling in the network is a compromise, so that filter can get better information, but the downside is that it has a smaller feel and needs more time to calculate. The shift and Stitch are also a tradeoff, with a more dense output in the case where the bottom filter feels like a wild belt line, but the downside is that the filters is forbidden to access the information in a finer size than was originally set.
Eggs and, above two techniques, the author does not use the model, but uses the skill of the upper sampling. The idea of interpolation is used in the sampling. The author explains that the upper sampling of a factor of f is a convolution with a step size of. So as long as F is an integer, the stride on the output is an inverse convolution of f (which actually reverses the forward and back processes of the convolution). On-line sampling is end-to-end learning through the loss of pixels in the network to reverse propagation. One feature of deconvolution filters is that dimensions are not fixed and can be learned.
The author maps the ILSVRC classifier to FCN and expands it so that it can be dnese prediction through the loss function on both the upper and the pixel levels. Next, fine-tune the training. At the beginning, it was mentioned that a skip structure, which is a end-to-end learning to refine the semantics and spatial accuracy of the output (finer dimensions require less network layers, and a combination of fine network layers and rough network layers allows the model to make local predictions considering the global structure).
The network structure defined by the author is as follows:
This concludes the FCN theory, and a detailed description of the network structure will appear in the code.
Fcn:fully convolutional Networks for Semantic segmentation