Interpretation of Semantic Segmentation--pyramid Scene Parsing Network (pspnet) paper

Last Update:2018-07-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

pspnet

Pyramid Scene Parsing Network

Included: CVPR 2017 (IEEE conference on Computer Vision and pattern recognition)

Original address: Pspnet

Code: Pspnet-github Keras TensorFlow

Effect Chart:

Abstract

The pyramid pooling modules (Pyramid pooling module) presented in this paper can aggregate the contextual information of different regions to improve the ability of acquiring global information. Experiments show that such a priori representation (that is, the structure of the PSP) is effective and shows good results on multiple datasets. Introduction

The difficulty of scene parsing (Scene parsing) is closely related to the label of the scene. Most advanced scenario parsing frameworks are based mostly on FCN, but there are several problems with FCN:

Mismatched relationship: Contextual matching is important for understanding complex scenarios, for example, in the first row of the diagram above, the big one on the water is probably "boat" rather than "car". Although "boat" and "car" are very similar. FCN lacks the ability to infer according to context. Confusion Categories: There is a link between many tags that can be made up by the relationship between tags. The second line in the figure, which identifies part of the skyscraper as a building, should be just one, not both. This can be made up by the relationship between categories. inconspicuous Classes: Models may ignore small things, and larger things can exceed FCN reception, leading to discontinuous predictions. As the third line above, the pillow and quilt material consistent, is recognized together. In order to improve the segmentation effect of inconspicuous things, we should pay attention to small area objects.

Summing up these situations, many of the problems in FCN are not effective in dealing with the relationship between scenes and global information. In this paper, we propose a deep network pspnet which can get the global scene, and can integrate the local and global information together with the appropriate global features. In this paper, an optimal strategy of moderate supervision loss is proposed, which has excellent performance in many data sets.

The main contributions of this paper are as follows: A pyramid scene parsing network is proposed, which can embed the difficult-resolved scene information feature into an effective optimization strategy based on the deep supervision loss resnet based on the FCN prediction Framework, and constructs a practical system for scene parsing and semantic segmentation, and includes implementation details . Related Work

With the drive of deep neural network, scene parsing and semantic segmentation have made great progress. such as FCN, enet and other work. Many deep convolution neural networks are commonly used in dilated convolution (void convolution) and coarse-to-fine structure in order to enlarge the feature of high rise. Based on the previous work, the selected baseline are FCN with dilated network.

The work of most semantic segmentation models is based on two aspects: on the one hand: multi-scale feature fusion, high-level features have strong semantic information, the underlying features contain more details. On the other hand: based on structural prediction. For example, a CRF (conditional random field) is used to make a back-end thinning segmentation result.

In order to make full use of the global feature level prior knowledge to understand different scenarios, the PSP module proposed in this paper can aggregate different regions to achieve the global context. Architecture Pyramid pooling Module

Also mentioned above, a major contribution of this article is the PSP module.

In general CNN can be roughly considered to use the context of the size of information, the paper pointed out that in many networks do not have sufficient access to global information, so the effect is not good. To solve this problem, the commonly used method is: The global average pool processing . However, on some datasets, it is possible to lose space relationships and cause ambiguity. The features of pyramid-pooling produce different levels and are finally smoothed into a FC layer for classification . This can eliminate the fixed size of CNN image classification constraints, reduce the loss of information between different regions.

This paper presents a hierarchical global priority, which contains different scales between different sub regions, called Pyramid Pooling module.

The module incorporates 4 different pyramid-scale features, the first row of red is the most coarse feature – global pooling generates a single bin output, followed by three rows of different scales of the pool feature. In order to guarantee the weight of global features, if the pyramid has N levels, then using 1x1 1x1 after each level will reduce the level channel to the original 1/n. Then the size of the concat is obtained by bilinear interpolation, and eventually it is put together.

The size of the pool nucleus at the pyramid level can be set, which is related to the input to the pyramid. The 4 grades used in the paper, the nucleus size of the 1x1,2x2,3x3,6x6 1x1,2x2,3x3,6x6 respectively. Overall Architecture

On the basis of the PSP module, Pspnet's overall architecture is as follows:

Based on the pre trained model (RESNET101) and the empty convolution strategy to extract feature map, the extracted feature map is the input of the 1/8 size feature map after pyramid pooling Module obtains the fused feature with the whole information, after sampling and the feature map phase before the pool concat the last one convolution layer obtains the final output

The pspnet itself provides a priori for the global context (that is, the structure of the Pyramid Pooling module), and the subsequent experiment verifies the validity of the structure. a deep supervision network based on ResNet

The paper uses a very "metaphysics" method to get a basic network layer, the following figure:

On the basis of ResNet101 to do the improvement, in addition to using the following Softmax classification to do loss, additional in the fourth phase added an auxiliary loss, two loss together to propagate, using different weights, common optimization parameters. Subsequent experiments have shown that this is conducive to rapid convergence. Experiment

The experiment was done on the three datasets of Imagenet scene parsing Challenge 2016, PASCAL VOC 2012,cityscapes.

Training Details:

Project	Set up
Learning Rate	Use the "poly" policy, that is, lr=lrbase∗ (1−itermaxiter) Power lr=lr_{base}* (1-\frac{iter}{max_{iter}) ^{power} settings lrbase=0.01,power= 0.9 lr_{base}=0.01,power=0.9, the attenuation momentum is set to 0.9 and 0.0001.
Number of iterations	Imagenet set 150k,pascal VOC set 30k,cityscapes set 90K
Data enhancement	Random flip, size from 0.5 to 2 zoom, angle between 10 to 10 rotation, random Gaussian filter
BatchSize	Batch is very important, set the batch=16 (this is very eating memory ah ~)
Training Branch Network	Set the secondary loss weight to 0.4
Platform	Caffe

imagenet Scene parsing challenge 2016

Test the performance of ResNet under different configurations to find a better training model:
Resnet50-baseline: RESNET50 structure based on FCN, Baseline Resnet50+b1+max with empty convolution: 1x only with 1x1 Maximum pool of 1 Resnet50+b1+ave: Average pool Resnet50+b1236+max with 1x1 1x1: Maximum pool 1x1,2x2,3x3,6x6 with 1x1,2x2,3x3,6x6 Resnet50+b1236+ave: Band 1x1,2x2,3x3,6x6 1x1,2x2,3x3,6x6 of the average pool resnet50+b1236+max+dr: with 1x1,2x2,3x3,6x6 1x1,2x2,3x3,6x6 maximum pool, after the pool to do the channel down the dimensions of the RESNET50+B1236+AVE+DR (best) : with 1x1,2x2,

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Interpretation of Semantic Segmentation--pyramid Scene Parsing Network (pspnet) paper

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Interpretation of Semantic Segmentation--pyramid Scene Parsing Network (pspnet) paper

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support