Learninghierarchical Features for Scene labeling
Introduction:
Full-scenelabeling is scene parsing.
The key is to extract feature vectors with Connet!!!
The difficulty of 1.sceneparsing is that a process should be combined with detection, segmentation,multilabel recognition.
2. There are two problems: one is to produce good expression of visual information, and the other is to use background information to ensure the consistency of image interpretation.
3. Main methods of the article: Using convolutional neural Networks (papers referencing LeCun's text recognition)
Convolution neural network with images of raw pixels, to supervise the learning style of training classification
Each layer includes Filterbank module, nonlinearity, spatial pooling Mudule
4.multiscaleConNet resolution pixel classification relies on a small range of information and also relies on long-range information.
5. Process: 1) Use the graph method to generate the hypothesis of the segmentation category. 2) candidate segment (CS comprehension as the test part) extracts the characteristics. 3) using CFR or other diagram methods to be trained to generate labels for CS (classification)
The improved approach is to use large background windows to flag pixels, reducing the need for processing methods
The sceneparsing system is divided into two parts:
1.multiscaleconvolutional Networks
G (I) is to do Laplacian pyramid transformation of the input image, different scale, to get the characteristics. The disadvantage is the inability to accurately highlight the boundaries of the area. Use Superpixel to get the contour of the image natural contours.
2.graphbased classification
Superpixels:simple and effective, butsuboptimal
CRF Oversuperpixels: Effective to avoid partial transcendence (such as people in the air), but useless, Multiscale featurerepresentation has taken into account the relationship between the landscape level.
Multilevelcut with Class Purity Criterion: Detail below
Relatedworks:
Learning Object-class Segmentation withconvolutional Neural networks,2012
Deep convolutional Networks for sceneparsing,2009
Multiscale feature extraction for scene parsing
1.scale-invariant,scene-level Feature Extraction
Okay, internal, representation is layered.
Pixel edglets Motifs Parts Objectsscenes
The input and output of each stage---feature maps
Three-storey convnet:
The first two layers have a bunch of filters f1,f2,f3 and F1,F2,F3. The third layer has only a bunch of filter f.
The filter is the convolution kernel that comes from training. Each stack of filters is invariant (input how to move output how to move)
Panoramic understanding requires the system to model complex interrelationships in terms of the size of the full image.
Is Multiscale pyramid, the aim is to produce pictures of different sizes. There are Gaussian pyramid, Laplacian Pyramid, steerable Pyramid. Use Laplacian pyramid for the input image to get a different size picture. And preprocessing makes the local neighborhoods have 0 mean and unit variance. FS can be seen as a series of linear transformations interspersed with nonlinear transformations (such as the sigmoid function, this article with the Tanh function)
FS has a total of L layers layer (and the above layer is different, said layer is stage) as follows, in the L layer of the Hidden matrix unit as follows
The filter wl and offset BL are expressed in \theta_s. Pool is a activations that considers the nearest neighbor and each neighbor produces a activation value. See (http://blog.csdn.net/zhoubl668/article/details/24801103). Get f function, a map of feature vectors,u (.) is the on sampling function.
The weight (which should be the filter WL) is shared between FS, and the benefit is to force the network to learn the same scale characteristics (should be the image is reduced after the convolution of the feature matrix should be reduced correspondingly) and reduce the likelihood of overfitting.
Learning discriminative Scale-invariant Features
Ideally, a linear classifier should produce the correct classification for each pixel, based on the vector fi. Training \theta to achieve this goal, using a multi-class cross entropy function. CI is a normalized prediction of pixel I, normalized with the Softmax function.
Scene Labeling Strategies
The simplest way to use the predicted argmax for pixel I. Get tagged,
Superpixels Hyper-pixel method
Using a two-layer network allows the system to catch the nonlinear relationships of features at different scales. The W1,W2 is a training parameter, and the dk,a is the average of all pixels on the super-pixel block.
Conditional Randomfields (to learn, seemingly and MRF quite a lot different)
The traditional CRF is perfected on a super-pixel basis, because the hyper-pixel does not contain a general understanding of the landscape. Although Convnet has the ability to express the overall relationship of the landscape, the CRF strengthens this trait. The strategy is to combine the picture with the graph on the data structure and define the energy function to find the best.
Process does not explain ...
Parameter-free Multilevel parsing
The above two methods are based on the arbitrary segmentation of the image, the image will be decomposed too small or too large. This paper proposes a classification method to analyze a partition set (family of segments) and then automatically discovers the best observation level for each pixel. A special scheme for this split set is to divide the tree, and the segmentation on the segment tree is organized by hierarchy. This method is not limited to this kind of partition tree, but also can be used in the nearest neighbor's random set (? )。
"Hyper-pixel segmentation" or "Layered image segmentation" (the two most classical literature) is directly used to get CK, the constituent (component) CK is all the components of the image connected to the defined mesh (k=1 ... K), SK is the cost of each ingredient, on every pixel I, want to find k* (i) (the index of the expression ingredient, is the first ingredient, 1 ... K), can best explain this pixel I (cost minimum).
Then the CK element is too many, can only consider a subset of him, for example, the value next to the circle is sk* (i), the circle is ck* (i), find the best cover of the set C, as long as the smallest weight on each branch to find the point CK. Another method, with different merging thresholds in the "hyper-pixel segmentation" algorithm, to find the best way to cover is ...
Generate confidence costs (given CK, how to find SK)
(2016.4.17) Literature Summary Learning hierarchical Features for Scene labeling