pspnet
Pyramid Scene Parsing Network
Included: CVPR 2017 (IEEE conference on Computer Vision and pattern recognition)
Original address: Pspnet
Code: Pspnet-github Keras TensorFlow
Effect Chart:
Abstract
The pyramid pooling modules (Pyramid pooling module) presented in this paper can aggregate the contextual information of different regions to improve the ability of acquiring global information. Experiments show that such a priori representation (that is, the structure of the PSP) is effective and shows good results on multiple datasets. Introduction
The difficulty of scene parsing (Scene parsing) is closely related to the label of the scene. Most advanced scenario parsing frameworks are based mostly on FCN, but there are several problems with FCN:
Mismatched relationship: Contextual matching is important for understanding complex scenarios, for example, in the first row of the diagram above, the big one on the water is probably "boat" rather than "car". Although "boat" and "car" are very similar. FCN lacks the ability to infer according to context. Confusion Categories: There is a link between many tags that can be made up by the relationship between tags. The second line in the figure, which identifies part of the skyscraper as a building, should be just one, not both. This can be made up by the relationship between categories. inconspicuous Classes: Models may ignore small things, and larger things can exceed FCN reception, leading to discontinuous predictions. As the third line above, the pillow and quilt material consistent, is recognized together. In order to improve the segmentation effect of inconspicuous things, we should pay attention to small area objects.
Summing up these situations, many of the problems in FCN are not effective in dealing with the relationship between scenes and global information. In this paper, we propose a deep network pspnet which can get the global scene, and can integrate the local and global information together with the appropriate global features. In this paper, an optimal strategy of moderate supervision loss is proposed, which has excellent performance in many data sets.
The main contributions of this paper are as follows: A pyramid scene parsing network is proposed, which can embed the difficult-resolved scene information feature into an effective optimization strategy based on the deep supervision loss resnet based on the FCN prediction Framework, and constructs a practical system for scene parsing and semantic segmentation, and includes implementation details . Related Work
With the drive of deep neural network, scene parsing and semantic segmentation have made great progress. such as FCN, enet and other work. Many deep convolution neural networks are commonly used in dilated convolution (void convolution) and coarse-to-fine structure in order to enlarge the feature of high rise. Based on the previous work, the selected baseline are FCN with dilated network.
The work of most semantic segmentation models is based on two aspects: on the one hand: multi-scale feature fusion, high-level features have strong semantic information, the underlying features contain more details. On the other hand: based on structural prediction. For example, a CRF (conditional random field) is used to make a back-end thinning segmentation result.
In order to make full use of the global feature level prior knowledge to understand different scenarios, the PSP module proposed in this paper can aggregate different regions to achieve the global context. Architecture Pyramid pooling Module
Also mentioned above, a major contribution of this article is the PSP module.
In general CNN can be roughly considered to use the context of the size of information, the paper pointed out that in many networks do not have sufficient access to global information, so the effect is not good. To solve this problem, the commonly used method is: The global average pool processing . However, on some datasets, it is possible to lose space relationships and cause ambiguity. The features of pyramid-pooling produce different levels and are finally smoothed into a FC layer for classification . This can eliminate the fixed size of CNN image classification constraints, reduce the loss of information between different regions.
This paper presents a hierarchical global priority, which contains different scales between different sub regions, called Pyramid Pooling module.
The module incorporates 4 different pyramid-scale features, the first row of red is the most coarse feature – global pooling generates a single bin output, followed by three rows of different scales of the pool feature. In order to guarantee the weight of global features, if the pyramid has N levels, then using 1x1 1x1 after each level will reduce the level channel to the original 1/n. Then the size of the concat is obtained by bilinear interpolation, and eventually it is put together.
The size of the pool nucleus at the pyramid level can be set, which is related to the input to the pyramid. The 4 grades used in the paper, the nucleus size of the 1x1,2x2,3x3,6x6 1x1,2x2,3x3,6x6 respectively. Overall Architecture
On the basis of the PSP module, Pspnet's overall architecture is as follows:
Based on the pre trained model (RESNET101) and the empty convolution strategy to extract feature map, the extracted feature map is the input of the 1/8 size feature map after pyramid pooling Module obtains the fused feature with the whole information, after sampling and the feature map phase before the pool concat the last one convolution layer obtains the final output
The pspnet itself provides a priori for the global context (that is, the structure of the Pyramid Pooling module), and the subsequent experiment verifies the validity of the structure. a deep supervision network based on ResNet
The paper uses a very "metaphysics" method to get a basic network layer, the following figure:
On the basis of ResNet101 to do the improvement, in addition to using the following Softmax classification to do loss, additional in the fourth phase added an auxiliary loss, two loss together to propagate, using different weights, common optimization parameters. Subsequent experiments have shown that this is conducive to rapid convergence. Experiment
The experiment was done on the three datasets of Imagenet scene parsing Challenge 2016, PASCAL VOC 2012,cityscapes.
Training Details:
Project |
Set up |
Learning Rate |
Use the "poly" policy, that is, lr=lrbase∗ (1−itermaxiter) Power lr=lr_{base}* (1-\frac{iter}{max_{iter}) ^{power} settings lrbase=0.01,power= 0.9 lr_{base}=0.01,power=0.9, the attenuation momentum is set to 0.9 and 0.0001. |
Number of iterations |
Imagenet set 150k,pascal VOC set 30k,cityscapes set 90K |
Data enhancement |
Random flip, size from 0.5 to 2 zoom, angle between 10 to 10 rotation, random Gaussian filter |
BatchSize |
Batch is very important, set the batch=16 (this is very eating memory ah ~) |
Training Branch Network |
Set the secondary loss weight to 0.4 |
Platform |
Caffe |
imagenet Scene parsing challenge 2016
Test the performance of ResNet under different configurations to find a better training model:
Resnet50-baseline: RESNET50 structure based on FCN, Baseline Resnet50+b1+max with empty convolution: 1x only with 1x1 Maximum pool of 1 Resnet50+b1+ave: Average pool Resnet50+b1236+max with 1x1 1x1: Maximum pool 1x1,2x2,3x3,6x6 with 1x1,2x2,3x3,6x6 Resnet50+b1236+ave: Band 1x1,2x2,3x3,6x6 1x1,2x2,3x3,6x6 of the average pool resnet50+b1236+max+dr: with 1x1,2x2,3x3,6x6 1x1,2x2,3x3,6x6 maximum pool, after the pool to do the channel down the dimensions of the RESNET50+B1236+AVE+DR (best) : with 1x1,2x2,