OVERVIEW:CNN history (to be continued)

Last Update:2018-10-19 Source: Internet

Author: User

Tags scale image

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Directory

I. Basic knowledge
Ii. Early attempts
- 1. Neocognitron, 1980
- 2. LeCun, 1989
  - A. Overview
  - B. Feature Maps & Weight Sharing
  - C. Network Design
  - D. Experiments
- 3. LeNet, 1998
Iii. Historic Breakthrough: AlexNet, 2012
- 1. Historic
- 2. The difficult point
- 3. Select CNN
- 4. This article contributes
- 5. Network Design
  - A. ReLU
  - B. Training on multiple GPUs
  - C. Local Response Normalization
  - D. Overlapping Pooling
- 6. Suppressed fit Design
  - A. Data Augmentation
  - B. Dropout
- 7. Discussion
Iv. Network Deepening
- 1. Vgg Net, 2014
- 2. Msra-net, 2015

I. Basic knowledge

Before you make an overview of CNN, be sure to understand the following questions:

Basic knowledge of machine learning:

What is "full connectivity"?
What are the respective functions of forward propagation and reverse propagation? Why do I need to reverse-propagate?
What is a perception machine? What is the fatal flaw?
What are the common activation functions? What features are "activated"? (Non-linear, most directly solves or problems)

The basics of deep learning:

What are the common loss functions? It's best to know how to get it (for example, by maximizing likelihood).
What are the common optimization methods? What are the pros and cons?
What is the simplest way to suppress overfitting?
What is generalization capability?
Dropout purpose? How is it implemented? Does the test phase need to be used?
What is gradient vanishing? What are the solutions? (Auxiliary loss function in Rnn→lstm,sigmoid, Tanh→relu,batch normalization,googlenet, etc.)
Batch Normalization purpose? How is it implemented?
What is a gradient explosion? What are the solutions?

CNN Basics:

How is discrete convolution implemented?
Suppose you have a single-channel 28x28 picture that uses a 3 convolution core (filter) with a size of 5x5 and a step of 1, then the convolution layer size is? (3x24x24)
What are two of CNN's innovations compared to an all-connected network, making it easy to train and learn higher-level features? (Sparse connections and parameter sharing)
What is feeling wild? The larger the number of layers, the greater the perception of this layer unit?
What is transpose convolution? (See my blog and its reference links)
What is pooling? Parameter sharing and maximum pooling, what are the features of CNN introduced separately?
(Translational and other transgender and non-invariance.) If the convolution output of the separation parameter is pooled, then the invariance of many features, such as rotation invariance, will be introduced.

Outline of the following, reference to the Chinese Academy of Sciences ppt: "CNN recent progress and practical skills."
and the concrete content, for the author to read the study summary after the paper.

Ii. Early attempts

CNN's Evolutionary context:

1. Neocognitron, 1980

1959 Hubel and others mentioned the functional division of the visual cortex:

Inspired by this, in 1980, Kunihiko Fukushima proposed Neocognitron, which can be used for pattern recognition tasks such as handwritten digital recognition.

NEOCOGNITRON:A self-organizing Neural network model for A mechanism of pattern recognition unaffected by shift in Positio N

Innovation point: Neocognitron uses two types of cells: simplecell and complex cell, and lets cascade work .
The former directly extracts the characteristics, the latter deals with the characteristics of distortion, such as translation, so that Neocognitron has translational invariance.

2. LeCun, 1989A. Overview

Yann Le Cun is the first person to use CNN in the field of computer vision (handwritten postcode identification problem).

BackPropagation applied to handwritten Zip Code recognition

We know that the introduction of prior knowledge, that is, the introduction of constraints, can enhance the ability of network generalization.
So, how should we introduce it?
LeCun that the introduction of a priori most basic principle (method) is to ensure that the computing power is not changed, as far as possible to reduce redundant parameters.
The interpretation of this principle is: one can reduce the entropy, the second can reduce the number of VC dimensions, so can enhance the generalization ability.

Personal understanding: For the same problem, if the parameters are more, either the problem is more complex, or a priori knowledge is not enough, assuming large space.
If the computational power is guaranteed not to fall, minimizing the parameters is tantamount to introducing as much prior knowledge as possible and reducing the hypothetical space.

Now the question is: how do you do that?
LeCun will explain: we can do this by constructing a special network architecture.

B. Feature Maps & Weight Sharing

The most significant differences from previous recognition work are:

The network is end-to-end, that is, using low-level information directly, rather than starting with feature vectors.
All the connections in the network are self-adapting, and the previous layers of the previous work are manually selected for connection parameters.

Specific design:

First, the network inherits the method of feature extraction:
Extract the local features, integrate together, and then form higher order features.
Its advantages have been proved in the work of predecessors.

A goal has a lot of characteristics and is different, so we should:
Use a set of feature detectors to detect these characteristics.

One of the same features may appear in each location, so:
We can use the weight sharing method proposed by Rumelhart and others in 1986 to detect this feature at various locations at a smaller computational cost.

To synthesize the above two points, the first hidden layer should be a feature maps , each of which shares parameters within each plane, representing a feature.
Because the specific location information of a feature is less important, the layer can be smaller than the input layer.

C. Network Design

D. Experiments

The nonlinear activation unit adopts scaled hyperbolic tangent symmetric functions, and converges faster. However, when the function input value is very large or special hour, the study will be very slow.
The loss function is MSE.
The output unit is sigmoid to ensure that the loss will not be too large (gradients will be large too).
The target output requires a quasi-linear region of the sigmoid to ensure that the gradient does not arise from the flat area of the sigmoid.
Random initialization in a small range, with the same intent.
The stochastic gradient drops and converges faster.
The quasi-Newton method regulates the learning rate and makes the descent more reliable.

The final effect: in the field of digital recognition is of course state of the art ~

3. LeNet, 1998

This article is a summary article that has been quoted and summarizes the methods of handwritten numeral recognition at that time.

Gradient-based Learning applied to document recognition

Many interesting histories are mentioned in the article:

In the past, feature extraction was a priority and could even be hand-crafted by the whole.
The reason is that the classification task of the past classifier is relatively simple, the category gap is large, so assume that the space is relatively small.
Therefore, the accuracy of classifiers depends largely on the fitting degree of the low-dimensional space, and feature extraction is particularly important.

However, with the advent of improved computer performance, a growing database, and ever-greater depth-learning techniques, designers no longer need to design feature extractors carefully, but can rely more on real data and construct sufficiently complex assumptions.

The review, which is 46 pages long, focuses on CNN's classic architecture: LeNet-5.

LeNet-5 has 7 floors:

Input layer, size greater than any one letter, to ensure that each letter will appear in the seventh unit of the Center of the field of perception.
The middle five layers are: convolution layer → descending sampling layer → convolution layer → descending sampling layer → convolution layer.
The first convolutional layer uses six filters, so feature maps with six channels.
The second convolution layer rises to 16 channels. Each channel is not the same as the first 6 channels, see, the purpose is to destroy the symmetry, forcing each channel to learn different characteristics (ideally, complementary features).
In the fully connected layer, features are built into the product and are non-linear activated.
Finally, the output layer, 10 kinds of numbers corresponding to 10 output units, respectively, the output vector and the classification of the reference vector Euclidean distance.
The parameter is either +1 or 1, mainly because each element of the output vector is normalized with sigmoid, and its range is [ -1,+1].
For example, [1-1...-1] and [1-1...-1] distance is 0,[-1 1-1...-1] and [1-1...-1] distance is 8.

Statistically, if the true probability model is a Gaussian distribution, then for linear regression, maximizing the logarithmic likelihood of W, and minimizing the MSE is equivalent.
The output layer is a linear regression problem, and we can assume that a simple transcendental is a Gaussian distribution. See "Deep learning" for details.

So why not use 0-9 of these 10 numbers to classify it?
First, the above reasons illustrate the rationality of distributed codes in measuring distance;
Second, the experiment proves that when the category is more, this place code effect is very poor. Because only a majority of the output is 0, only 1 output is not 0 is very difficult;
Third, if the input is not a character, place code is more difficult to refuse to judge.

Iii. Historic breakthrough: AlexNet, 20121. Historic

AlexNet competed in the ImageNet Large scale Visual recognition challenge in 2012.
The network achieved a top-5 error of 15.3%, more than 10.8percentage points lower than that's the runner up.
As a result, alexnet detonated the craze for deep learning, which we call historic breakthroughs.

ImageNet classification with deep convolutional neural networks

2. The difficult point

Alexnet solves the most direct problem, is the ImageNet image classification problem, training set of high-definition image capacity of 1.2 million, the category of 1000 kinds of neural network parameters reached 60 million, neurons up to 650,000, the complexity of the problem imaginable.
It is obviously difficult to train such a large network.

The beginning of this article also illustrates the complexity of the problem: even a large data set such as ImageNet cannot specify this complex problem.

That is, suppose the space has a large unknown area.

Therefore, we need a lot of prior knowledge to compensate for data that is not in the training set.

3. Select CNN

CNN is an ideal model because it usually brings strong and correct assumptions about natural images (test images), i.e. statistical stability (translational invariance) and pixel position dependence (i.e., the meaning of both).

Most importantly, it has a relatively low computational cost.

4. This article contributes

The main contributions of this paper are:

The proposed network structure has achieved amazing results in the image classification competition.
A highly optimized GPU 2D convolution implementation has been written and has been made public. However, the implementation of hardware, novice advice from the Caffe start.
Many measures have been taken to suppress the fitting ; Although the database is large, there are many parameters, and overfitting is very serious.
The network contains only 5 convolution layers, the number of parameters only accounted for less than 5%, but indispensable (the effect will be worse); It is important to find the depth.
The current work is only limited by the level of hardware, there is a lot of room for development.

5. Network Design

Ranked by importance, the network has the following innovations:

A. ReLU

The Tanh and sigmoid activation functions used previously have saturation zones.
Instead of saturated ReLU, the convergence speed can reach several times Tanh!

Experiment: The same 4-layer convolutional network, which requires a 25% accuracy rate on the CIFAR-10, tests ReLU (solid line) and tanh (dashed lines), respectively.

B. Training on multiple GPUs

2 GPU Synergy, the most direct role is to speed up the training speed.
The author attempts to change the network to a single GPU while keeping the number of parameters constant, which is slightly slower than the double GPUs.

Second, the two GPUs are actually interactive. The interaction increases accuracy by more than 1.2% compared to independent work.

Finally, GPU performance was not strong enough and the video memory was only 3GB.
This collaborative thinking can break down large tasks into smaller tasks.

C. Local Response Normalization

In real neurons, there is a side inhibition effect lateral inhibition.
Simply put, in a small area, if a neuron is activated, the neurons nearby will be relatively inhibited.

In other words, this is a mechanism to promote local competition in neurons. but this is actually a bit farfetched.

Inspired by this, AlexNet uses Local Response normalization after certain layers of nonlinear activation, and the formula

The original output of the ReLU is \ (\alpha_{x,y}^i\) , located in the \ ((x, y) \) position of the first ( i\) core.
Inhibition occurs between different nuclei, in the current nucleus of the left and right (n\) channels, nuclear serial number do not exceed the upper limit \ (n\) and lower limit \ (0\) .
Obviously, if there is large in the neighborhood (\alpha_{x,y}^j\) and very small, then it will be small after the poor, if the reverse, then the impact is not small.
This opens the gapbetween the rich and the poor.

The authors note that the trick makes the accuracy rate increase by more than 1.2%.

But according to the big God feedback, the trick basically ineffective.

D. Overlapping Pooling

The experimental results show that the overlapping pooling can better suppress the fitting, and increase the accuracy rate by about 0.4% and 0.3%.

6. Suppress overfitting design A. Data Augmentation

The simplest method of suppressing overfitting is label-preserving transformations .
Simply put, it is to let the image do all kinds of transformations that do not affect the essence of the object, enlarge the data volume.

The network uses a simple transformation that does not require storage and is implemented with the CPU, without affecting the GPU that is computing the previous batch.

Mirror symmetric transformations;
Image light intensity and color transformation.

2nd specifically:

First, the RGB three-channel component is extracted;
The main components were extracted from each channel by principal component analysis.
Then a three-channel linear combination of random coefficients is performed.

Personally, Guthrie component analysis can reduce the amount of data it ~

The above scheme increases the accuracy by at least 1% while suppressing overfitting.

B. Dropout

If we have several different models to work with to predict, then the generalization error will be reduced effectively.
The problem is that the computational cost of training multiple models is high.

Dropout gives us a new idea: let these models share the same weighting coefficients, but the output of neurons is different.
In this way, we are equivalent to getting a lot of models.

Specifically, the hidden neuron output has a 50% probability of being zeroed.
In this way, each time the reverse propagation, the reference loss are calculated by different models.

In general, dropout technology breaks down the dependence between neurons, forcing networks to learn more robust neuronal connections.

We only use it at the full connection level because there is a lot of connections to the full connection layer.
Dropout is not used during the testing phase.

Dropout will prolong the convergence time, but it can effectively suppress the overfitting.

7. Discussion

The authors found that removing any middle tier in the network would reduce overall performance by about 2%. So the depth is very important.

In order to simplify the calculation, the author does not adopt unsupervised pre-training.
Although pre-training may be able to improve network performance while not increasing the amount of data.

Finally, a deep network can be used for video to take advantage of temporal correlation.

Iv. Network deepening 1. Vgg Net, 2014

The depth of the VGG is unprecedented, with a convolution layer of 10, and a fully connected layer of 3.
Vgg won the ImageNet2014 of the classification and the positioning race runner-up, and the generalization ability is very good.

Very deep convolutional networks for large-scale image recognition

Important contribution: Explore the tradeoff between depth and filter size, and find depth far more important than filter size .
To explore this problem, the size of the filter is fixed to 3x3, the network depth is deepened, and the experiment is carried out.

To ensure objectivity, Vgg Net inherits the AlexNet configuration (Relu, etc.), but discards LRN and considers the trick not only invalid but also redundant.

The following table is a few configurations of the experiment, from left to right depth increment, contrast effect:

As you can see in the table below, the deeper the depth is, the better the effect of the multi-scale training is:

The following table shows the comparison of test results for various networks at that time:

2. Msra-net, 2015

convolutional neural networks at constrained time cost

Time cost control is important, especially when dealing with online real-time tasks, multi-user requests, mobile terminal processing and so on.
The focus of this article is on time cost control . The author mentions:

Most of the recent advanced CNNs is more timeconsuming than Krizhevsky et al. ' s [+] original architecture in both Traini Ng and testing.
The increased computational cost can being attributed to the increased width1 (numbers of filters) [1], Depth (number of layers) [+], smaller strides [+], and their combinations.

In the experiment, the time cost is a fixed indicator . So when the depth of the network increases, the depth or size of the filter needs to be reduced.
In other words, the experiment in this article holds the principle of trade off .

The experiments in this paper have obtained the following important conclusions:

network depth is the core indicator of accuracy , which is more important than the rest of the parameters and should be tilted in trade off.
This conclusion is not straightforward, as other articles simply stack more layers, but this article does a trade off to control a certain time cost.
when the network depth is over-increased, there is a degradation of the network, even if there is no trade off.
This phenomenon is a foreshadowing of the innovation of He Keming ResNet.

The most direct effect is that the author only uses the complexity of AlexNet 40%, the GPU speed is 20% faster, and achieves a high 4.2% classification accuracy.

OVERVIEW:CNN history (to be continued)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More