English text address:https://arxiv.org/abs/1511.00561
Segnet is also the classic image segmentation network, the title of the paper can be learned that segnet is a depth, with convolution, but also self-coding-decoding structure, the structure is mainly used for image segmentation. Segnet is also an all-convolutional neural network. His central structure mainly includes: a coded network, encoding and decoding, the other is decoding the network, decoding the network followed by a pixel-level classification layer. The structure of the coded network is the same as the topological structure of the VGG16 13 convolution layers. The first coding network produces low-resolution features, and the decoding network is used to map this coarse feature to the pixel-level classification on the entire input image-level resolution feature map, which must produce features that are useful for precise boundary positioning. The author emphasizes that a small, segnet novelty is the way the decoder samples its low-resolution input feature graph. This method is the decoder through the pool index to achieve non-linear upper sampling, the pool index is the decoder corresponding to the encoder for maximum pooling operation calculation. This eliminates the need for learning to sample, maps that are sampled are sparse, and then convolution with a trained filter core to produce dense feature maps. The result of the segmentation is very coarse, mainly because the maximum pooling layer and the reduced sampling reduce the resolution of the feature maps. In order for objects with small shapes to be described, it is necessary to retain the boundary information from the extracted image map. Train with SGD because End-to-end training can be used to adjust the weight of the network together. The segnet coding network mentioned earlier is the same as the 13 convolutional layers in the VGG16, so the Segnet network is much smaller and easier to train than the other networks due to the castration of the Vgg's fully connected layer. A key component of the segnet is the decoding network, which corresponds to the coded network, which, as previously mentioned, uses the maximum pooled index obtained from the corresponding encoder to perform non-linear sampling of the input feature graph. The idea was inspired by a structure for unsupervised feature learning. Reuse of the maximum pooled index during decoding there are several advantages of grounding gas: one is to improve the delimitation of the boundary, and the other is to reduce the parameters for end-to-end training, because the parameters of the corresponding pooling layer are shared, and the last one is that the upper sampling of this form can be combined into any encoder-decoder. With the same coding network in the depth frame used for segmentation and the number of training parameters is also huge, that is VGG16, the different points are mainly embodied in the training and reasoning of decoding network.
The author focuses on the analysis of Segnet decoding technology and FCN, and pays close attention to the actual trade-offs of partition structure design. The author mentions that recently the depth frame used for segmentation has its own coding network, like VGG16, but its decoding network is different in training and reasoning. Because there are too many network parameters to train, it is difficult to train the network, which leads to multi-stage training, the network is added to the pre-trained architecture (like FCN), with RPN, non-intersect classification training, and the segmented network and pre-trained additional training data or a full training and other methods of auxiliary inference wave.
The author summarizes some of the segmented past and present, and today it is recorded here that the best way to do this is to design a manual feature for each pixel to be classified before the deep neural network appears. Classic, like a random forest or boosting, predicts the class probabilities for each central pixel by sending a patch to the classifier,
Predictions from each pixel-level noise are smoothed by using paired or higher-order CRF to improve their accuracy. As mentioned earlier, this is done by predicting patches, shape-based features, or SFM (the SFM algorithm is an off-line algorithm for three-dimensional reconstruction based on a collection of unordered images:74936438) The processed appearance has been explored in the road scene comprehension test.
Predictions for each pixel noise from the classifier are smoothed by using a paired or higher-level CRF (conditional random field) to improve accuracy. A closer approach is to generate high-quality unary by trying to predict the labels of all the pixels in the patch, rather than just predicting the central Pixel's label (the personal feeling of this unary translation comes to the question unit). Although the unaries results based on random forests have been improved, the reduced-structure classification is less effective. Another approach is to advocate the use of hand-designed features and space-time hyper-pixel combination to obtain a higher accuracy rate. The best approach in Camvid testing solves the problem of frequency imbalance between labels by combining object detection output with classifier prediction in the CRF framework. The authors refer to the indoor RGBD datasets, using methods such as Rgb-sift, depth SIFT and pixel locations as inputs to the neural network classifier, followed by a CRF to be smoothed, and improved by using a richer set of features, including LBP and region segmentation, for higher accuracy. In recent work, the use of RGB and deep cues-based combinations to infer categorical segmentation and support relationships. Another approach focuses on real-time joint reconstruction and semantic segmentation, where random forests are used as classifiers (random forests are really powerful), and boundary detection and hierarchical grouping are used before categories are split. The common feature of these methods is the use of hand-designed features to classify RGB or RGBD images.
Matching image size with the deepest layer of the classification network is applied to the image segmentation, but the resulting classification results are blocky. Another is the use of common neural networks to combine several low-resolution predictive Chaung, and the use of recurrent neural networks to combine several low-resolution predictions to create a resolution prediction map of the input image, which has a poor ability to delimit boundaries.
The new frame structure used for segmentation is applied to segmentation by learning to decode or mapping low-resolution images to pixel-level predictions. The coding network, which produces the representation of low-resolution images, is the VGG16 classification network, which has 13 convolutional layers and 3 fully connected layers. Decoding networks of different structures differ and produce multiple features for each pixel for the next classification.
Each decoder in the FCN learns to sample the input feature mappings and combine them with the corresponding encoder's feature map as input to the next decoder. This structure is characterized by a large number of training parameters in the coding network, but few decoding networks. The overall scale of the network is very large and it is difficult to end-to-end training. Therefore, the author uses the segmented training processing, each decoder in the decoding network is added to the training network one by one, the network does not increase continuously, if the observed performance does not increase stop. The growth of the network stops after three decoders, which can result in loss of edge information if the high-resolution feature map is omitted. The above is a question of training, and the feature map produced by re-using the encoder in the encoder will be crowded at times of testing.
The FCN is fine-tuned on the RNN, and the RNN layer mimics the clear boundary characterization of the CRF when using the FCN feature representation. Compared with FCN-8, the previous structure has a significant improvement, and with the increase of data, this improvement is not much. FCN+CRF-RNN This kind of structure has the superiority in the joint training, although at the cost of more complicated training and reasoning, the performance of deconvolution network is obviously better than FCN. The author raises the question of whether the perceived advantage of CRF-RNN will be reduced as the Feedforward division engine becomes better. Under any conditions, the CRF-RNN network can be added to any segmented architecture that includes segnet. The author mentions that the excessive size of the deep structure is also more popular. There are two styles, one combining a small amount of input images from a deep feature extraction network, and the other is a combination of feature maps from different layers of a single deep architecture. The common idea is to perform feature extraction of different sizes to provide a local or global context, while the feature maps used by the early coding layer retain more high-frequency details, making the class boundaries clearer. In order to make training easier, using multi-stage training and data enhancement, feature extraction with multiple convolution path inference is very expensive. Add a CRF to a multi-size network for joint training. Deconvolution Network and its semi-supervised variant form, using the encoder's feature maps to implement nonlinear sampling in the decoding network.
The network structure proposed by the author is a coding-decoding network, and the encoder is obtained feature maps by convolution, nonlinear element, maximal pooling and descending sampling. For each sample, the maximum positional index computed in the pool is stored in the decoder. The decoder uses the index in the store to feature maps. The decoding network uses a trained convolution kernel with the feature map convolution on the sample to reconstruct the input image. This structure is used for pre-training without supervision, and it is possible to use a small input patch to learn the layered feature to use the whole picture for a layered encoder.
Network structure of Segnet
Segnet has a coded network and a corresponding decoding network, followed by the final pixel-level classification layer. The coded network consists of 13 convolution layers, which correspond to the first 13 convolution layers in the VGG16, which are used to divide the object. The author uses the initialization training process to classify large datasets using trained weights. The segnet removes the fully connected layer (which in some way reduces the number of parameters on the network) to preserve a higher resolution of feature maps at the deepest encoder output. Each encoder in the coded network is convolutional to produce a set of feature maps. These feature maps are normalized and then applied relu for processing. The maximum pooled operation with size 2*2, step 2, and twice times the result is sampled. The maximum pooling operation is used to realize the shift invariance of the small space displacement in the input image. The de-sampling produces contextual information for each pixel in the feature map in an input image. Multi-layer maximum pooling and de-sampling can achieve more translational invariance and make the classification more robust, but there is still a certain loss of spatial resolution on feature maps. The image representation of the lossy boundary details is not conducive to segmentation (which is important for dividing the boundary). Therefore, it is necessary to capture and store the boundary information for the feature map in the encoder prior to de-sampling. It is not possible to store all feature maps, only the location of the maximum eigenvalues in each of the pooled windows of each encoder feature map. The appropriate decoder in the decoding network samples the input feature map by using the maximum pooled index value stored in the corresponding coded network feature map. Thus the sparse feature map is obtained. Segnet decoding process such as,
It is also said that the decoded feature map will also be convolution operations to generate dense feature maps and then batch standardization processing. This is different from common sense, although the encoder input consists of RGB3 channels, but the decoder corresponding to the first encoder generates a multichannel feature map, which differs from other encoders in the network (other decoders can produce the same number of sizes and channels as the encoder inputs feature Maps). The high-dimensional feature of the last decoder output is sent to a trained softmax classifier that splits each individual pixel. Deconvnet and unet have structures similar to those of segnet. However, because the deconvnet has a full convolution, there are a lot of parameters, which makes it difficult to end-to-end training. Instead of unet the pooled index, the decoder's feature map is obtained by sampling the entire feature map by the corresponding decoder that is transferred through the inverse convolution. and Unet no Vgg in the conv5 and MAX-POOL5 layers. Segnet uses all pre-trained convolution weights from the Vgg as a pre-trained weight.
The author later mentions a minisegnet with only four encoder and four decoder, which is not used bias after convolution, and there is no decoder non-linear processing in the Relu network. The kernel size of all encoder decoder layers is defined as 7*7, and more contextual information can be obtained from the feature map.
Please refer to the original paper for specific experiments and analysis.
Segnet:a Deep convolutional encoder-decoder Architecture for Image segmentation