Papers about DL

Last Update:2016-03-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reading some papers about DL

Reading some papers about DL
- Segnet A Deep convolutional encoder-decoder Architecture for Image segmentation
  - Encoder Network
  - Decoder Network
  - Training
  - Analysis
    - Personal Thoughts
- Do convnets Learn correspondence
  - Ideas
  - Methods 1
  - Method 2
  - Personal Thoughts
- FCN
  - Semantics Segmentation VS Spatial Segmentation
  - Fully convolutional Networks for Semantic segmentation
  - What does receptive fileds mean?
  - Stride
  - How to reach dense prediction
    - Programme one stitching together coarse outputs by shifted inputs
    - Scenario two decrease stride if the sampling step is set to 1
    - The three upsampling parameters of the scheme are obtained by learning
    - Combining what and where
    - Some Tricks
- Overfeat
  - Ideas
  - Model Design and Training
  - Multi-scale classification Multi-View Voting at all location and at mutiple scale
  - Classification
  - Localization
- Reference articles

Segnet:a Deep convolutional encoder-decoder Architecture for Image segmentation

Encoder Network is topologically identical to the convolutional layers in VGG16 network– based on Vgg (remove full link layer)
Stores the max-pooling indices of the feature maps and uses them in ITES decoder network to achieve good Performance–dow Nsampling When you mark the value in feature map, upsampling (non-linear) as much as possible to recover data
Train:end-to-end, stochastic gradient descent– training, end-to-end, random gradient descent

The key learning module is an Encoder-decoder network. An encoder consists of convolution with a filter bank, Element-wise Tanh non-linearity, max-pooling and sub-sampling to OB Tain the feature maps. For each sample, the indices of the max locations computed during pooling is stored and passed to the decoder. The decoder upsamples the feature maps by using the stored pooled indices. It Convolves This upsampled maps using a Trainable decoder filter bank to reconstruct the input image.

Encoder Network

First convolutional layers in the VGG16 network, lose the full link layer parameter
Use batch Normalized
ReLU is applied
max-pooling 2*2 window and Stride 2 (non-overlapping window)
Storing only the max-pooling indices
- The locations of the maximum feature value in each pooling window

Decoder Network

Each encoder layer have a corresponding decoder layer and hence the decoder network has a layer
The final decoder output is fed to a multi-class Soft-max classifier to produce class probablilities for each pixel Indepe ndently
Batch Normalization
Each decoder filter has the same number of channels as the number of upsampled feature maps. A smaller variant is one where the decoder filters was single channel, i.e they only convolve their corresponding upsample d feature Map.

The FCN decoder model requires storing encoder feature maps during inference

Training

Use stochastic gradient descent (SGD)
Fixed learning rate of 0.1
Mini-batch (IMAGES)
Cross-entropy loss
Median frequency balancing
Label images must is single channel, with each pixel lablled with its class
- Where the weight assigned to a class in the loss function are the ratio of the median of class frequencies computed on the Entire training set divided by the class frequency. This implies, larger classes in the training set has a weight smaller than 1 and the weights of the smallest classes is the highest.

Analysis

G:global accuracy–measures The percentage of pixels correctily classified in the dataset
C:class average accuracy–the mean of the predictive accuracy over all classes
Iu:mean intersection over Union

Personal thoughts:

Is it possible to use CNN to train Decoder to zoom in on low-resolution images
Reduce network model size – focus on reducing full-link parameters

Do convnets Learn Correspondenceideas

Visualize the spatial characteristics of the Convnets feature to verify that its characteristics correspond to the spatial location of the input image * is the modern convnets that excel at classification and detection al So able to find precise correspondences between object parts *
-If the use of large pooling domain and the use of the whole image training, convnets can also correspond to the local fine features of the image (intuitively, the large sensing field may pooled the information away)

Methods 1

The paper provides a novel visual investigation to examine Convnet features– using convnet features to reconstruct the original image (using a simple top-k nearest neighbors, averaging)
Objective: To verify that as the number of layers deepens, the convnet features is more and more large, whether or not it can accurately correspond to the local spatial structure of the original image

Conclusion: Although Convnet feature feel more and more large, still can carry local information at a finer scale

The yellow box in the upper left corner is the input patch, and the black frame outside the yellow box is the increasing sensation field.
In the right column, enter choosing input patches uniformly at the random from conv3-sized neighborhoods. Not the center cut, but the random cut within the sensing field.

Method 2

Instead of fusing Top-k's cut-out patches, average the entire receptive fields of the neighbors

Personal Thoughts

Whether you can use the spatial correspondence between feature map and the image to do image alignment (images alignment) or image flattening (image stiching)
How to use CNN to do keypoint prediction?

Fcnsemantics Segmentation VS Spatial Segmentation

Semantic segmentation, which is the segmentation of objects of different semantics, contains some recognition feelings – the delineation of boundaries is not based on simple information such as texture gradients, but on their semantic categories
-"Global information resolves what while local information resolves where"

Fully convolutional Networks for Semantic segmentation

Image segmentation problem can be regarded as end-to-end, pixels-to-pixels spatially dense prediction tasks.
Transform the full-link layer to consider the full link as convolution core for the entire feature map size convolution (these fully connected layers can also be viewed as convolutions with K Ernels that cover their entire input regions)
Use bilinear interpolation to upsample, and learn to get interpolation parameters
Use "Skip architecture" to fuse global information (deep, coarse layer, semantic) and part of the information (Shallow,fine layer, appearance) to improve the segmentation effect

What does receptive fileds mean?

Locations in higher layers correspond to the Locations in the image they is path-connected to, which is called their re Ceptive fields.
-Feel the range of images (input images) that deep neurons feel. Intuitively, due to the local connection nature of convolution operation, deep neurons are more receptive than shallow neurons.

Stride

Interval sampling step, if the convolution window is w, stride = N, the number of overlap for two adjacent neurons in the convolutional network is w-n

How to reach dense prediction? Programme I: Stitching together coarse outputs (by shifted inputs)

By panning the input image, get coarse outputs, (note, at this time for FCN, no full link layer, outputs and the image spatial corordinates have a corresponding relationship)
Stitching these outputs, get dense prediction
- If downsampled by a factor of F, the image coordinates x shifts right, y pans down, the total processing of the f*f output image is required
Although this approach has reached the dense prediction without changing the inputs, it has not been possible to obtain more granular information (explanation forced: But the filters is prohibited from accessing information At a finer scale than their original design)

Scenario Two: Reduce stride, such as setting the sampling step to 1

This will achieve dense prediction, but the output is different from the shift-and-stitch, because the posterior layer neurons become smaller, the characteristics of the extraction are different, and the computational time becomes longer.
Another strategy is that the stride is reduced, but only the original input remains unchanged, and the extra input is 0, and the size of each volume is doubled after this layer
- The goal is to ensure that the feelings of the wild, but this way undoubtedly increased the burden of the back layer

Scenario Three: upsampling, parameters obtained through learning

Using bilinear interpolation simple bilinear interpolation, interpolation parameters obtained upsampling by learning!
- End-to-end backprogagation can make parameters quickly learned, fast and effective
- is actually learning convolution (interpolation) kernel parameters, if this is still bilinear?

Combining what and where

The shallow feature map has more detailed information, local
The deep feature map contains more semantic information, a global
In this paper, the control output stride and upsampling size to achieve: shallow layer of stride is deep twice times, when fused, the deep results upsampling twice times, and then merged with the shallow layer.

Some Tricks

A Minibatch size of images and fixed learning rates
Dropout
Found class Balance unneccessary (its data three-fourths is background) s
Augmenting the training data by randomly mirroring and "jittering", the images by translating them up to pixels

Overfeatideas

An integrated Framework is presented, using convnet to inherit recognition, positioning and detection, a network (the same framework, a shared feature learning base) is multi-purpose and mutual gain
- Training a convolutional network to simultaneously classify, locate and detect objects in images can boost the Classificat Ion accuracy and the detection and localization accuracy of all tasks.
Explains how Convnet can is used for locatization and detection?
- Use Multiscale and sliding windows to predict object boundaries
- Use bounding box to accumulate the way increase detection confidence, which avoids training on background samples, so that the accuracy rate increases
Improved recognition classification accuracy (mainly because of the difference in the class)
- Sliding windows at different scales
- Avoid incomplete objects in view windows (will make poor localization and detecction)
- The training network not only learns the category distribution of each sliding window, but also predicts the location and size of the view window that contains some object
- The confidence level of the categories that accumulate these different positions and sizes accumulate the evidence for each category at each location and size
Realization mechanism of image segmentation
- Calculates the label of the center pixel of the view window
- Benefits: You can determine the semantic boundaries between objects (objects do not need to be clearly demarcated)
- Cons: Requires dense pixel-level estimation

Model Design and Training

downsample, each image being downsampled so, the smallest dimension is a pixels
Extract 5 random C Rops (and their horizontal flips) of size 221*221 pixels
Present These to the network in mini-batches of size +
Initial the network randomly with N (0, 0.01), learning rate 0.05 and are successively decreased by a facot R of 0.5 after ((+,-) epochs
dropout with a rate of 0.5 are empolyed on the fully connected LA Yers in the classifier
rectification "Relu"
No contrast normalization is used
Pooling reg Ions is non-overlapping
While the sliding window approach is computationally prohibitive for certain Types of model, it is inherently efficient in the case of Convnets. Why?
The
- is to compare the patch input to make the input of the entire picture and the image into overlapping patches, reducing the amount of computation in the overlapping area
- in fact Slide window can achieve the effect of patch block, the two equivalent

Multi-scale classification (Multi-View voting:at each location and at mutiple scale)

Apply the last subsampling operation at every offset–resolution augmentation
- The right-most layer of pooling (subsampling ratio is 3), pooling at each offset, i.e. (0,1,2), gets a pooling feature map of different offset, then crosses the results and enters the lower CLA Ssifer
Details:
For a single image, at a given scale, we start with the Unpooled Layer 5 feature maps
Each of the unpooled maps undergoes a 3*3 max pooling operation (non-overlapping regions), repeated 3*3 times for pixel offset S of {0,1,2}, produces a set of pooled feature maps
The classifier (layers 6,7,8) have a fixed input size of 5*5 and produces a c-dimensional output vector for each location W Ithin the pooled maps. The classifier is applied in sliding-window fashion to the pooled maps
The output maps for different offset combinations is reshaped into a single 3D ouput map

Classification

Taking the spatial max for each class in each scale and flip
Averaging the resulting c-dimensional vectors from different scales and flips
Taking the top-n elements from the mean class vector, according to the different model version

Localization

Combine the regression predictions together, along with the classification
Since these share the same feature extraction layers, only the final regression layers need to being recomputed after Computi NG the Classification network
- Enter the full image, the classification layer sliding window, get each position contains the object category of the score.

Reference article:

Segnet:a Deep convolutional encoder-decoder Architecture for Image segmentation
Do convnets learn correspondence?
Fully convolutional Networks for Semantic segmentation
overfeat:integrated recognition, localization and detection using convolutional networks

Papers about DL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Papers about DL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Papers about DL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support