Reading some papers about DL
- Reading some papers about DL
- Segnet A Deep convolutional encoder-decoder Architecture for Image segmentation
- Encoder Network
- Decoder Network
- Training
- Analysis
- Do convnets Learn correspondence
- Ideas
- Methods 1
- Method 2
- Personal Thoughts
- FCN
- Semantics Segmentation VS Spatial Segmentation
- Fully convolutional Networks for Semantic segmentation
- What does receptive fileds mean?
- Stride
- How to reach dense prediction
- Programme one stitching together coarse outputs by shifted inputs
- Scenario two decrease stride if the sampling step is set to 1
- The three upsampling parameters of the scheme are obtained by learning
- Combining what and where
- Some Tricks
- Overfeat
- Ideas
- Model Design and Training
- Multi-scale classification Multi-View Voting at all location and at mutiple scale
- Classification
- Localization
- Reference articles
Segnet:a Deep convolutional encoder-decoder Architecture for Image segmentation
- Encoder Network is topologically identical to the convolutional layers in VGG16 network– based on Vgg (remove full link layer)
- Stores the max-pooling indices of the feature maps and uses them in ITES decoder network to achieve good Performance–dow Nsampling When you mark the value in feature map, upsampling (non-linear) as much as possible to recover data
- Train:end-to-end, stochastic gradient descent– training, end-to-end, random gradient descent
The key learning module is an Encoder-decoder network. An encoder consists of convolution with a filter bank, Element-wise Tanh non-linearity, max-pooling and sub-sampling to OB Tain the feature maps. For each sample, the indices of the max locations computed during pooling is stored and passed to the decoder. The decoder upsamples the feature maps by using the stored pooled indices. It Convolves This upsampled maps using a Trainable decoder filter bank to reconstruct the input image.
Encoder Network
- First convolutional layers in the VGG16 network, lose the full link layer parameter
- Use batch Normalized
- ReLU is applied
- max-pooling 2*2 window and Stride 2 (non-overlapping window)
- Storing only the max-pooling indices
- The locations of the maximum feature value in each pooling window
Decoder Network
- The FCN decoder model requires storing encoder feature maps during inference
Training
- Use stochastic gradient descent (SGD)
- Fixed learning rate of 0.1
- Mini-batch (IMAGES)
- Cross-entropy loss
- Median frequency balancing
- Label images must is single channel, with each pixel lablled with its class
- Where the weight assigned to a class in the loss function are the ratio of the median of class frequencies computed on the Entire training set divided by the class frequency. This implies, larger classes in the training set has a weight smaller than 1 and the weights of the smallest classes is the highest.
Analysis
- G:global accuracy–measures The percentage of pixels correctily classified in the dataset
- C:class average accuracy–the mean of the predictive accuracy over all classes
- Iu:mean intersection over Union
Personal thoughts:
- Is it possible to use CNN to train Decoder to zoom in on low-resolution images
- Reduce network model size – focus on reducing full-link parameters
Do convnets Learn Correspondenceideas
Visualize the spatial characteristics of the Convnets feature to verify that its characteristics correspond to the spatial location of the input image * is the modern convnets that excel at classification and detection al So able to find precise correspondences between object parts *
-If the use of large pooling domain and the use of the whole image training, convnets can also correspond to the local fine features of the image (intuitively, the large sensing field may pooled the information away)
Methods 1
The paper provides a novel visual investigation to examine Convnet features– using convnet features to reconstruct the original image (using a simple top-k nearest neighbors, averaging)
Objective: To verify that as the number of layers deepens, the convnet features is more and more large, whether or not it can accurately correspond to the local spatial structure of the original image
Conclusion: Although Convnet feature feel more and more large, still can carry local information at a finer scale
- The yellow box in the upper left corner is the input patch, and the black frame outside the yellow box is the increasing sensation field.
- In the right column, enter choosing input patches uniformly at the random from conv3-sized neighborhoods. Not the center cut, but the random cut within the sensing field.
Method 2
Instead of fusing Top-k's cut-out patches, average the entire receptive fields of the neighbors
Personal Thoughts
- Whether you can use the spatial correspondence between feature map and the image to do image alignment (images alignment) or image flattening (image stiching)
- How to use CNN to do keypoint prediction?
Fcnsemantics Segmentation VS Spatial Segmentation
Semantic segmentation, which is the segmentation of objects of different semantics, contains some recognition feelings – the delineation of boundaries is not based on simple information such as texture gradients, but on their semantic categories
-"Global information resolves what while local information resolves where"
Fully convolutional Networks for Semantic segmentation
- Image segmentation problem can be regarded as end-to-end, pixels-to-pixels spatially dense prediction tasks.
- Transform the full-link layer to consider the full link as convolution core for the entire feature map size convolution (these fully connected layers can also be viewed as convolutions with K Ernels that cover their entire input regions)
- Use bilinear interpolation to upsample, and learn to get interpolation parameters
- Use "Skip architecture" to fuse global information (deep, coarse layer, semantic) and part of the information (Shallow,fine layer, appearance) to improve the segmentation effect
What does receptive fileds mean?
Locations in higher layers correspond to the Locations in the image they is path-connected to, which is called their re Ceptive fields.
-Feel the range of images (input images) that deep neurons feel. Intuitively, due to the local connection nature of convolution operation, deep neurons are more receptive than shallow neurons.
Stride
Interval sampling step, if the convolution window is w, stride = N, the number of overlap for two adjacent neurons in the convolutional network is w-n
How to reach dense prediction? Programme I: Stitching together coarse outputs (by shifted inputs)
- By panning the input image, get coarse outputs, (note, at this time for FCN, no full link layer, outputs and the image spatial corordinates have a corresponding relationship)
- Stitching these outputs, get dense prediction
- If downsampled by a factor of F, the image coordinates x shifts right, y pans down, the total processing of the f*f output image is required
- Although this approach has reached the dense prediction without changing the inputs, it has not been possible to obtain more granular information (explanation forced: But the filters is prohibited from accessing information At a finer scale than their original design)
Scenario Two: Reduce stride, such as setting the sampling step to 1
- This will achieve dense prediction, but the output is different from the shift-and-stitch, because the posterior layer neurons become smaller, the characteristics of the extraction are different, and the computational time becomes longer.
- Another strategy is that the stride is reduced, but only the original input remains unchanged, and the extra input is 0, and the size of each volume is doubled after this layer
- The goal is to ensure that the feelings of the wild, but this way undoubtedly increased the burden of the back layer
Scenario Three: upsampling, parameters obtained through learning
- Using bilinear interpolation simple bilinear interpolation, interpolation parameters obtained upsampling by learning!
- End-to-end backprogagation can make parameters quickly learned, fast and effective
- is actually learning convolution (interpolation) kernel parameters, if this is still bilinear?
Combining what and where
- The shallow feature map has more detailed information, local
- The deep feature map contains more semantic information, a global
- In this paper, the control output stride and upsampling size to achieve: shallow layer of stride is deep twice times, when fused, the deep results upsampling twice times, and then merged with the shallow layer.
Some Tricks
- A Minibatch size of images and fixed learning rates
- Dropout
- Found class Balance unneccessary (its data three-fourths is background) s
- Augmenting the training data by randomly mirroring and "jittering", the images by translating them up to pixels
Overfeatideas
An integrated Framework is presented, using convnet to inherit recognition, positioning and detection, a network (the same framework, a shared feature learning base) is multi-purpose and mutual gain
- Training a convolutional network to simultaneously classify, locate and detect objects in images can boost the Classificat Ion accuracy and the detection and localization accuracy of all tasks.
Explains how Convnet can is used for locatization and detection?
- Use Multiscale and sliding windows to predict object boundaries
- Use bounding box to accumulate the way increase detection confidence, which avoids training on background samples, so that the accuracy rate increases
Improved recognition classification accuracy (mainly because of the difference in the class)
- Sliding windows at different scales
- Avoid incomplete objects in view windows (will make poor localization and detecction)
- The training network not only learns the category distribution of each sliding window, but also predicts the location and size of the view window that contains some object
- The confidence level of the categories that accumulate these different positions and sizes accumulate the evidence for each category at each location and size
Realization mechanism of image segmentation
- Calculates the label of the center pixel of the view window
- Benefits: You can determine the semantic boundaries between objects (objects do not need to be clearly demarcated)
- Cons: Requires dense pixel-level estimation
Model Design and Training
- downsample, each image being downsampled so, the smallest dimension is a pixels
- Extract 5 random C Rops (and their horizontal flips) of size 221*221 pixels
- Present These to the network in mini-batches of size +
- Initial the network randomly with N (0, 0.01), learning rate 0.05 and are successively decreased by a facot R of 0.5 after ((+,-) epochs
- dropout with a rate of 0.5 are empolyed on the fully connected LA Yers in the classifier
- rectification "Relu"
- No contrast normalization is used
-
Pooling reg Ions is non-overlapping
-
While the sliding window approach is computationally prohibitive for certain Types of model, it is inherently efficient in the case of Convnets. Why?
The
- is to compare the patch input to make the input of the entire picture and the image into overlapping patches, reducing the amount of computation in the overlapping area
- in fact Slide window can achieve the effect of patch block, the two equivalent
/ul>
Multi-scale classification (Multi-View voting:at each location and at mutiple scale)
Apply the last subsampling operation at every offset–resolution augmentation
- The right-most layer of pooling (subsampling ratio is 3), pooling at each offset, i.e. (0,1,2), gets a pooling feature map of different offset, then crosses the results and enters the lower CLA Ssifer
Details:
For a single image, at a given scale, we start with the Unpooled Layer 5 feature maps
- Each of the unpooled maps undergoes a 3*3 max pooling operation (non-overlapping regions), repeated 3*3 times for pixel offset S of {0,1,2}, produces a set of pooled feature maps
- The classifier (layers 6,7,8) have a fixed input size of 5*5 and produces a c-dimensional output vector for each location W Ithin the pooled maps. The classifier is applied in sliding-window fashion to the pooled maps
- The output maps for different offset combinations is reshaped into a single 3D ouput map
Classification
- Taking the spatial max for each class in each scale and flip
- Averaging the resulting c-dimensional vectors from different scales and flips
- Taking the top-n elements from the mean class vector, according to the different model version
Localization
- Combine the regression predictions together, along with the classification
- Since these share the same feature extraction layers, only the final regression layers need to being recomputed after Computi NG the Classification network
- Enter the full image, the classification layer sliding window, get each position contains the object category of the score.
Reference article:
Segnet:a Deep convolutional encoder-decoder Architecture for Image segmentation
Do convnets learn correspondence?
Fully convolutional Networks for Semantic segmentation
overfeat:integrated recognition, localization and detection using convolutional networks
Papers about DL