"Convolutional neural Networks-evolutionary history" from Lenet to Alexnet

Source: Internet
Author: User

"Convolutional neural Networks-evolutionary history" from Lenet to Alexnet

This blog is "convolutional neural network-evolutionary history" of the first part of "from Lenet to Alexnet"

If you want to reprint, please attach this article link: http://blog.csdn.net/cyh_24/article/details/51440344

More related blog please poke: http://blog.csdn.net/cyh_24

This series of blogs is an expanded profile of Dr. Melody's recent advances and practical tips on CNN.

The main discussion of CNN's development, and citing Dr. Melody's ideas, a more detailed introduction to the development of CNN, will be described by the history of CNN:

As shown in the history of CNN's structural evolution summarized by Dr. Melody, the starting point is the neuro-cognitive machine model, at which point the convolution structure has appeared, and the classic Lenet was born in 1998. Later, however, CNN's sharpness began to be covered by hand-crafted features such as SVM. With the advent of relu and dropout, and the historical opportunities presented by GPUs and big data, CNN ushered in a historic breakthrough in 2012 –AlexNet.

The evolution path of CNN can be summarized in the following directions:

  • From Lenet to Alexnet
  • Evolutionary path One: deepening of network structure
  • Evolution Road II: Enhancing convolution function
  • Evolutionary path III: from classification to detection
  • Evolution Road IV: New function module

This series of blogs will explain the most representative CNN model structure in the four paths of the CNN development.

The beginning of Everything (LeNet)

is widely circulated lenet network structure, but the perfectly formed, convolutional layer, pooling layer, all connected layer, these are the basic components of the modern CNN network.

    • Input Size: 32*32
    • Convolution layer: 3
    • Down-Sampling layer: 2 x
    • Fully connected layer: one
    • Output: 10 categories (probability of a number 0-9)

Because Lenet can be said to be the beginning of CNN, here is a brief introduction to the purpose and meaning of each component.

Input (32*32)

The input image size is 32*32. This is larger than the largest letter (28*28) in the Mnist database. The purpose of this is to hope for potential salient features, such as intermittent strokes and the presence of corner points at the center of the highest level feature monitoring sub-senses.

C1, C3, C5 (convolutional layer)

The convolution core is shifted on a two-dimensional plane, and each element of the convolution core is multiplied by the corresponding position of the convolution image, and then summed. With the continuous movement of the convolution nucleus, we have a new image, which consists entirely of the sum of the products of the convolution cores at each position.

The effect of the two-dimensional convolution in the image is:
The weighted sum of the neighborhood of each pixel of the image (the size of the neighborhood is the kernel) gets the output value of that pixel. The following are the specific practices:

An important feature of convolution operations is that the original signal features can be enhanced and the noise reduced by convolution operations.

Different convolution cores can be extracted to different features in the image, here are online demos, below are different feature maps obtained by convolution cores,

The C1 layer is described : the C1 layer is a convolution layer, there are 6 convolution cores (extracting 6 local features), the kernel size is 5*5, can output 6 feature map feature map, the size of 28*28. The C1 has 156 training parameters (each filter 5*5=25 a unit parameter and a bias parameter, altogether 6 filters, total (5*5+1)6 = 156 parameters), a total of 156 (28*28) =122,304 connections.

S2, S4 (pooling layer)

S2, S4 is the lower sampling layer, is to reduce the network training parameters and model overfitting degree. Pooling/sampling is usually available in the following two ways:

    1. max-pooling: Select the maximum value in the Pooling window as the sample value;
    2. mean-pooling: Adds all the values in the Pooling window to average, taking the average as the sampled value;

The S2 layer is a 6-14*14 feature Map,map each element is connected to the 2*2 field in the previous layer, so the S2 layer is 1/4 of the C1 layer.

F6 (full connection layer)

The F6 is an all-connected layer, similar to a layer in MLP, with a total of 84 neurons (why is this number chosen?). Associated with the output layer, the 84 neurons are fully connected to the C5 layer, so the parameters to be trained are: (120+1) *84=10164.
Like the classical neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state of unit I.

Output (export layer)

The output layer consists of a European radial basis function (Euclidean Radial Basis function) unit, one unit per class, each with 84 inputs.
In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther away the input is from the parameter vector, the greater the RBF output. In terms of probabilistic terminology, the RBF output can be understood as the negative log-likelihood of the Gaussian distribution of the F6 layer configuration space. Given a loss function, it should be possible to make the configuration of the F6 close enough to the RBF parameter vectors (i.e., the expected classification of the pattern).

The Return of Kings (AlexNet)

AlexNet can be said to be a historic network structure, it can be said that before the AlexNet, deep learning has been silent for a long time. The history of the transition in 2012, AlexNet in the Imagenet Image classification competition, the TOP-5 error rate than the previous year's title fell 10%, and far more than the second in the year.

For AlexNet to succeed, deep learning is able to return to the historical stage because:

  1. Non-linear activation function: ReLU
  2. Methods to prevent overfitting: Dropout,data augmentation
  3. Big Data training: Millions imagenet image data
  4. Other: GPU implementations, use of the LRN normalization layer

Here's a brief look at some of Alexnet's details:

Data Augmentation

There is a view that neural networks are fed by data, and increasing the training data can increase the accuracy of the algorithm because it avoids overfitting, and you can increase your network structure by avoiding overfitting. When the training data is limited, some transformations can be used to generate some new data from the existing training data set to enlarge the size of the training data.

One of the simplest and most common ways to deform image data:

  1. From the original image (256,256), randomly crop out some images (224,224). "Pan transform, crop"
  2. Flips the image horizontally. "Reflection Transform, Flip"
  3. Add some random illumination to the image. "Light, color transform, color jittering"

When AlexNet training, it is well handled on the data augmentation:

    • Random crop. At the time of training, for 256*256 pictures to be randomly crop to 224*224, and then allow horizontal flipping, then quite with the sample multiplied to ((256-224) ^2) *2=2048.
    • Test time, the upper left, upper right, lower left, lower right, the middle do 5 times crop, and then flip, a total of 10 crop, after the results averaged. The authors say that without random crop, large networks are almost always fitted (under substantial overfitting).
    • Make a PCA on the RGB space and then make a Gaussian perturbation (0, 0.1) on the principal component. The result is a 1% drop in the error rate.
ReLU activation function

Sigmoid is a commonly used non-linear activation function that can "compress" the continuous real value of the input to between 0 and 1. In particular, if it is a very large negative number, then the output is 0; if it is a very large positive number, the output is 1.
But it has some fatal drawbacks :

  • Sigmoids saturate and kill gradients. Sigmoid has a very fatal disadvantage when the input is very large or very small, there is saturation, the gradient of these neurons is close to 0. If your initial value is very large, the gradient will be in reverse propagation because of the need to multiply the derivative of the previous sigmoid, so the gradient is getting smaller, which will make the network difficult to learn.
  • the output of Sigmoid is not a 0-mean value.This is undesirable, as this causes the next layer of neurons to receive input from a non-0 mean signal of the previous layer output.
    One result is that if the data enters the neuron it is positive (e.g. x>0 Elementwise in F = wT x+b ), then w The calculated gradient will always be positive.
    Of course, if you are training by batch, then the batch may get a different signal, so the problem can be mitigated. Therefore, the problem of non-0 mean is much better than the kill gradients problem mentioned above, although it has some bad effects.

The mathematical expression of ReLU is as follows:
F (x)=max(0,x)

Obviously, from the left of the graph, the input signal <0 , the output is 0, >0 , the output equals the input. w is a two-dimensional case, the effect after using Relu is as follows:

Alex replaced sigmoid with ReLU, and found that the rate of convergence of SGD obtained using ReLU was much faster than Sigmoid/tanh.

Mainly because it is linear, and non-saturating (because the derivative of ReLU is always 1), it is possible to get the activation value instead of a large number of complex operations compared to the Sigmoid/tanh,relu only need a threshold.

For more information on activating functions, please step into my other article: activating functions-facets

Dropout

Combining many different pre-trained models to make predictions is a very successful way to reduce test errors (Ensemble). But because the training of each model takes several days, it is too expensive for a large neural network.

However, AlexNet proposes a very effective version of the model combination, which only takes twice times the time of the single model in training. The technique, called dropout, is to set the output of each hidden-layer neuron to zero at a probability of 0.5. Neurons in this way "dropped out" are neither involved in forward propagation nor are they involved in reverse propagation.

So each time a sample is entered, it is equivalent to the neural network trying a new structure, but all of these structures share weights. Because neurons cannot depend on other specific neurons, this technique reduces the complexity of the neuron's adaptive relationships.

Because of this, networks need to be forced to learn more robust features that are useful when combined with a number of different random subsets of other neurons. In testing, we multiply the output of all neurons by just 0.5, which is a reasonable approximation for obtaining the geometric mean of the predicted distribution generated by the dropout network.

Multi-GPU Training

The single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the network that is trained on it. So they distributed the network on two GPUs.
The current GPU is particularly well suited for cross-GPU parallelization, as they can be read and written directly from the memory of another GPU, without the need for host memory.

The parallel scenario they used was to place half the cores (or neurons) in each GPU, and there was an extra trick: the communication between the GPUs was only on certain layers.

For example, the core of layer 3rd needs to be entered from all the nuclear mappings in layer 2nd. However, the core of the 4th layer requires only those kernel mapping inputs from the 3rd layer that are located on the same GPU.

Local responce Normalization

In a nutshell: Essentially, this layer is also designed to prevent the saturation of the activation function.

The principle of personal understanding is that by regularization, the input of the activation function is close to the middle of the "bowl" (avoiding saturation), thus obtaining a larger number of guides.

So functionally speaking, the relu is repetitive.

However, the authors say, from the test results, the LRN operation can improve the generalization ability of the network and reduce the error rate by about 1%.

AlexNet Advantage: Increased network (5 convolutional layers + 3 Full-Connection layers + 1 Softmax layers), while solving overfitting (Dropout,data augmentation,lrn), and using multi-GPU acceleration computing

"Convolutional neural Networks-evolutionary history" from Lenet to Alexnet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.