Thesis study: Fully convolutional Networks for Semantic segmentation

Last Update:2018-01-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Published in 2015 This "Fully convolutional Networks for Semantic segmentation" is important in the field of image semantic segmentation.

1 CNN and FCN

Typically, the CNN network is connected to a number of full-join layers after the convolutional layer, mapping the feature map generated by the convolution layer (feature map) to a fixed-length eigenvector. The classic CNN structure, represented by alexnet, is suitable for image-level classification and regression tasks, as they ultimately expect a numerical description (probability) of the entire input image, such as the alexnet imagenet model outputs a 1000-D vector representing the probability that the input image belongs to each class ( Softmax normalization).

For example: In the cat, input alexnet, to get a 1000 output vector, indicating that the input image belongs to each class of probability, which in the "tabby cat" this kind of statistical probability is the highest.

FCN the pixel-level classification of images, which solves the problem of semantic level image segmentation (semantic segmentation). Unlike the classic CNN, which uses the full-join layer to classify a fixed-length feature vector after the convolution layer (full-join layer +softmax output), the FCN can accept arbitrary-sized input images and use the deconvolution layer to sample the feature map of the last convolutional layer, Restores it to the same size as the input image, making it possible to produce a single prediction for each pixel while preserving the spatial information from the original input image, and then classifying the pixels on the feature map on top of the sample on a per-pixel basis.

Finally, the loss of the Softmax classification is calculated by pixel, which corresponds to one training sample per pixel. Is the structure of the All-convolution network (FCN) used for semantic segmentation:

To put it simply, the difference between the FCN and CNN is that the last fully connected layer of CNN is replaced by a convolution layer, and the output is a picture that has been label good.

The power of CNN is that its multilayer structure can automatically learn features, and can learn multiple levels of features: the shallow convolution layer perception domain is small, learning to some local area characteristics; the deeper convolution layer has a larger perceptual domain and can learn more abstract features. These abstract features are less sensitive to the size, position, and orientation of objects, thus helping to identify performance improvements. For the CNN Classification network:

These abstract features are useful for classification and can be used to determine what kind of objects are included in an image. However, at the same time, due to the loss of some of the details of the object can not be very good to give the exact outline of the object, point out the specific object of each pixel, so accurate segmentation is very difficult.

The traditional CNN-based segmentation Method :

To classify a pixel, use an image block around the pixel as input to the CNN for training and prediction.

There are several drawbacks to this approach:

One is that storage overhead is very high. For example, the size of the image block used for each pixel is 15x15, and then sliding the window, each sliding window to the CNN discriminant classification, so the required storage space according to the number and size of sliding window rise sharply.

The second is the inefficient calculation. The adjacent pixel blocks are basically duplicated, and the convolution is computed on a per-pixel basis, and the calculation is largely repetitive.

The third is the size of the pixel block that limits the perceived area. Usually, the size of the pixel block is much smaller than the whole image size, only some local features can be extracted, which results in limited performance of the classification.

The full convolutional network (FCN) restores the category that each pixel belongs to from the abstract feature. That is, the classification from the image level extends further to the classification at the pixel level.

fully-connected layer, convolution layer

The only difference between the full-join layer and the convolution layer is that the neurons in the convolution layer are connected only to a local area in the input data, and the neurons in the convolution column share the parameters. However, in two classes of layers, neurons are calculated dot product, so their function is the same. Therefore, it is possible to convert the two to each other.

(1) For any convolutional layer, there is a full-join layer that implements the same forward propagation function as it does. The weight matrix is a huge matrix, except for some specific blocks, and the rest is zero. In most of these blocks, the elements are equal.

(2) Conversely, any fully connected layer can be converted into a convolution layer.

For example, aK=4096 full connection layer, the size of the input data body is7?7?512, this fully connected layer can be equivalently regarded as a f =7, P=0 ,s =1, K=4096 convolutional layer.

In other words, the size of the filter is set to match the size of the input data body.

Because only a single depth column overrides and slides over the input data body, the output becomes 1? 1? 4096, the result is the same as using the initial fully connected layer.

The fully connected layer is converted to a convolution layer :

In the two transformations, it is more useful to convert the full-join layer into a convolution layer in practical application. Suppose that an input to a convolutional neural network is< Span id= "mathjax-span-38" class= "Mrow" >224 x224 x3 images, a series of convolution layers and lower sampling layers to turn image data into dimensions 7< Span id= "mathjax-span-47" class= "Mi" >x7 x512 the activation data body. The alexnet uses two fully connected layers of size 4096, and the last fully connected layer with 1000 neurons is used to calculate the classification score. We can convert any of these 3 fully connected layers into a convolutional layer:

(1) for the first connecting area is [7x7x512] The full connection layer, so that its filter size is f=7, so that the output data body is [1x1x4096].

(2) for the second fully connected layer, so that its filter size is f=1, so that the output data body is [1x1x4096].

(3) similar to the last fully connected layer, making it f=1, the final output is [1x1x1000]

In practice, each of these transformations requires that the weighted W of the full join layer be reshaped into a convolution layer filter. So what is the effect of such a transformation? It can be more efficient in the following situations: Let the convolutional network slide over a larger input image to get multiple outputs, which allows us to do this in a single forward propagation process.

For example:

If we want to let the 224x224 size of the floating window, with step 32 in the 384x384 of the picture slide, the location of each stop is brought into the convolutional network, and finally get the category score 6x6 location. It is easier to convert the fully connected layer into a convolution layer.

If the input image of 224x224 is given an array of [7x7x512] after the convolution layer and the lower sampling layer, then the large image of 384x384 will get an array of [12x12x512] directly after the same convolution layer and the next sampling layer. The output of [6x6x1000] ((12–7)/1 + 1 = 6) is then passed through the 3 convolution layers that were converted from the 3 fully connected layers above. The result is the 6x6 of the floating window at the end of the original stop:

In the face of 384x384 image, let the initial convolutional neural network (including the fully connected layer) independently evaluate the 224x224 block in the image with a 32-pixel step, and the effect is the same as using the convolution neural network which transforms the full-join layer into convolutional layer.

As shown, FCN transforms the all-connected layer in the traditional CNN into a convolution layer, corresponding to the CNN network FCN to convert the last three-layer fully connected layer into a three-layer convolution layer. In the traditional CNN structure, the first 5 layers are convolution layers, the 6th and 7th layers are a one-dimensional vector of length 4096, and the 8th layer is a one-dimensional vector of length 1000, corresponding to the probability of 1000 different classes respectively. FCN these 3 layers as convolutional layers, the size of the convolution cores (number of channels, width, height), respectively (4096,1,1), (4096,1,1), (1000,1,1). There seems to be no difference in numbers, but convolution is not the same concept and calculation process as the full connection, using the weights and biases that CNN has trained before, but the difference is that the weights and biases have their own range and belong to one of their convolution cores. Because all the layers in the FCN network are convolutional layers, they are called full-convolution networks.

is a full convolution layer, and is not the same as the image corresponding to the size subscript, CNN input image size is unified fixed resize into 227x227 size image, the first layer pooling after 55x55, the second layer pooling after the image size 27x27, The image size after layer fifth pooling is 13*13. The FCN input image is h*w size, the first layer pooling after the original size of 1/4, the second layer becomes the original size of 1/8, the fifth layer becomes the original size of 1/16, the eighth layer becomes the original size of 1/32 (errata: Actually the first layer of code is 1/2, and so on).

After multiple convolution and pooling, the resulting images are getting smaller and lower resolution. Where images to H< Span id= "mathjax-span-54" class= "Texatom" > /32 ? w/32 when the picture is the smallest layer , the resulting figure is called heatmap Heat map, heat map is our most important high-dimensional feature map, get high-dimensional features of the heatmap is the most important step is the final step to the original image upsampling, the image amplification, amplification, amplification, to the original image size.

The final output is 1000 heatmap through upsampling into the original size of the image, in order to classify each pixel to predict the last semantic segmentation of the image label, there is a small trick, That is, the maximum numerical description (probability) of the pixel position in 1000 images, by pixel-by-pixels, as the classification of the pixel. So a picture has been sorted, such as a dog and cat on the right.

upsampling

It is much more efficient to perform a forward propagation calculation using a transformed convolutional neural network than using the original convolutional neural network before conversion, since 36 computations are shared in the compute resources, as well as in the iterative calculation of all 36 locations. This technique is often used in practice, once to get better results. For example, the size of an image is usually larger, and then the transformed convolution neural network is used to evaluate many different locations in the space, and then the average of these scores is calculated.

Finally, what if we want to use a floating window with a step size of less than 32? It can be solved with multiple forward propagation. For example, we want to use a floating window with a step size of 16. Then first use the original image in the converted Convolutional network to carry forward propagation, and then along the width, along the height, and finally along the width and height, respectively, and then the initial picture to translate 16 pixels, and then the translation of the image into the convolutional network respectively.

As shown, when the picture is processed in the network into a smaller picture, the more obvious its characteristics, like the color in the image shown, of course, the last layer of the picture is no longer a 1-pixel picture, but the original image h/32xw/32 size of the map, here to simplify and draw a pixel.

As shown, the original image is convolution conv1, pool1 after the original image is reduced to 1/2, the image after the second conv2, pool2 after the image is reduced to 1/4; then proceed to the third convolution operation of the image conv3, Pool3 reduced to the original image of 1/ 8, at this time to retain the pool3 Featuremap; then continue to the image for the fourth time convolution operation Conv4, POOL4, reduced to the original image 1/16, reserved pool4 featuremap; Finally, the fifth convolution operation of the image conv5, POOL5, Reduced to the original image of 1/32, and then the original CNN operation of the full connection into the convolution operation Conv6, Conv7, the image of the featuremap number of changes but the image size is still the original 1/32, when the image is no longer called Featuremap but called Heatmap.

Now we have a 1/32 size HEATMAP,1/16 size featuremap and a 1/8 size FEATUREMAP,1/32 size heatmap for upsampling operation, Because such an operation restores a picture that is only a feature of the convolution kernel in conv5, the accuracy problem is not a good way to restore the features in the image, so we iterate forward here. The convolution in the Conv4 check the last upsampling after the figure of the Deconvolution supplement details (equivalent to a difference process), and finally the conv3 in the convolution check the image after the upsampling just again back to the convolution details, and finally complete the restoration of the entire image.

Disadvantages of FCN:

(1) The results are still not fine. 8 times times the sampling is a lot better than 32 times times, but the result of the sampling is rather blurry and smooth, not sensitive to the details in the image.

(2) The individual pixels are categorized, and the relationship between pixels and pixel is not fully considered. The spatial regularization (spatial regularization) step, which is used in the usual pixel-based segmentation method, lacks spatial coherence.

Thanks for the original https://www.cnblogs.com/gujianhan/p/6030639.html

Thesis study: Fully convolutional Networks for Semantic segmentation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Thesis study: Fully convolutional Networks for Semantic segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Thesis study: Fully convolutional Networks for Semantic segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support