Alexnet--cnn

Last Update:2018-03-09 Source: Internet

Author: User

Tags scale image

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: ImageNet classification with deep convolutionalneural Networks

I. Limitations of Lenet

For a long time, Lenet had achieved the best results in the world at the time, albeit on small-scale issues, such as handwritten numerals, but had not been a great success. The main reason is that lenet in large-scale images, such as a lot of natural picture content understanding, so not get enough attention in the field of computer vision. And that's why Alexnet appears, and not only that, but Alexnet also introduces GPU-boosting computing power.

Second, alexnet structure introduction 1.relu

The commonly used activation function is the Sigmoid,tanh function, which is saturated when x is large or very small, and Relu:max (0,X) is a non-linear unsaturated function. In the training time, the unsaturated function is trained faster than the saturation function. Moreover, this twisted linear function not only retains the non-linear expression ability, but also because of its linear nature (positive part), there will be no gradient dispersion image due to nonlinearity (the top error is larger, due to the decreasing error of the Tanh and the sigmoid function in the inverse error transfer). The low-level error is very small, which results in a small update of the depth network formation weights, which results in the local optimal depth network. These properties of Relu allow us to train deeper networks.

From a biological point of view, Relu satisfies the three major characteristics of biological cell activation:① Unilateral inhibition ② relatively broad excitation boundary ③ sparse activation (the brain is activated at the same time only 1~4% neurons, neurons at the same time only a small part of the input signal selective response, a large number of signals are deliberately shielded, This improves the accuracy of learning and extracts sparse features more quickly .

In contrast, the softplus presented at the same time period has the first two features, while the sparse activation is slightly inferior.

On the idea of sparsity:

1. Information dissociation: at present, a clear goal of deep learning is to extract key factors from data variables. The original data, which is dominated by natural data, is usually intertwined with highly dense features. The reason is that these eigenvectors are interrelated, a small key factor may be disturbing a bunch of features, a bit like the butterfly effect, reaching. Traditional machine learning methods based on mathematical principles have a fatal weakness in the dissociation of these associated features. However, if the complex relationship between features can be solved and converted to sparse features, the feature is robust (eliminating extraneous noise).

2. Linear scalability: sparse features are more likely to be linearly divided, or have a smaller dependency on nonlinear mapping mechanisms. Because sparse features are on high-dimensional feature spaces (automatically mapped) from a manifold learning standpoint (see noise-cancelling automatic encoders), sparse features are moved to a relatively pure low-dimensional manifold surface. Linear separable can also refer to natural sparse text-based data, even if there is no hidden layer structure, it could still be separated very well.

3. dense distribution But sparse: dense winding is characterized by the most enrichment of information, from the potential point of view, often more than the characteristics of the local less than the number of the effective multiplier. The sparse features, which are extracted from the dense winding zone, are of great potential value.

Potential problems:

The rationality of forcing the introduction of sparse zeros?

Admittedly, Sparsity has a number of advantages. However, excessive forced sparse processing can reduce the effective capacity of the model. That is, there are too many feature masks, which makes the model unable to learn effective features. In this paper, the introduction degree of sparsity is experimented, and the ideal sparsity (force 0) ratio is 70%~85%. Over 85%, network capacity becomes a problem, resulting in an extremely high error rate. Compared with the 95% sparsity of brain work, there is still a big gap between the existing computational neural network and the biological neural network. Fortunately, Relu only a negative value will be sparse, that is, the introduction of sparsity can be trained to regulate, is dynamic change. As long as the gradient training, the network can be reduced to the direction of error, automatic control of the sparse ratio, to ensure that the activation chain has a reasonable number of non-0 values.

References: Deep Sparse rectifier Neural Networks

2. Partial response Normalization (LRN)

Local normalization of motivation: in neurobiology there is a concept called lateral inhibition (lateral inhibition), which refers to the activation of neurons that inhibit adjacent neurons. Normalization (normalization) is designed to "suppress" and local response normalization is to use the idea of lateral inhibition to achieve local inhibition, especially when we are using Relu.

Benefits: Benefit to increase generalization ability, do smooth processing, improve the recognition rate 1~2%

The LRN layer mimics the lateral inhibition mechanism of the biological nervous system, and creates a competitive mechanism for the activity of local neurons, which makes the response larger values relatively larger and improves the generalization ability of the model.

Here k,n,α,β are super parameters, general settings k=2,n=5,α=1*e-4,β=0.75.

The formula I indicates that the first core is in position (x, y) using the output of the activation function Relu, n is the number of neighboring kernel maps at the same location, and n is the total number of kernel.

Reference: What is the Local Response normalization in convolutional neural Networks?

Late controversial, LRN basically does not work, refer to very deep convolutional Networks for large-scale Image recognition.

3. Overlapping pooling

Usually we use the general pooling, if the pooling window size is size*size, then stride>size, that is, the adjacent two pooling window is not coincident.

The overlapping pooling mentioned here is stride<size, and therefore overlaps. Called overlapping pooling.

In addition, there is a Pooling called empty pyramid pooling (Spatial Pyramid Pooling). reference: Spatial Pyramid Pooling in deep convolutional Networks for Visual recognition

The spatial pyramid pooling can transform the convolution feature of any scale image into the same dimension, which not only can make CNN deal with arbitrary scale image, but also avoid cropping and warping operation, which leads to the loss of some information, which has very important meaning.

General CNN need to input image size is fixed, this is because the input of the full connection layer needs to be fixed input dimension, but in the convolution operation is not a limit to the image scale, all authors proposed the spatial pyramid pooling, first let the image convolution operation, and then transformed into the same dimension of the feature input to the full connection layer, This can extend CNN to any size image.

The idea of spatial pyramid pooling comes from spatial Pyramid Model, where a pooling becomes a pooling of multiple scale. By using different size pooling window to convolution feature, we can get 1x1,2x2,4x4 result, because there are 256 filters in conv5, so we get a 256-dimensional feature, 4 256 features, and 16 256-dimensional features, Then the 21 256-dimensional features are linked together into an all-connected layer, in this way the different size of the image into the same dimension features.

For different images to get the same size pooling results, it is necessary to calculate the size and step of the pooled window dynamically according to the size of the image. Assuming that the size of the CONV5 output is a*a, you need to get a pooled result of n*n size, which allows the window size to be sizex to the step. Take the size of the CONV5 output as an example of 13*13.

Question: If the conv5 output size is 14*14,[pool1*1] of sizex=stride=14,[pool2*2] sizex=stride=7, none of this is a problem, however, [pool4*4] sizex=5,stride= 4, the last and last row features are not counted as pooled operations.

SPP is a multi-scale pooling that can obtain multiscale information in an image, and by adding SPP to CNN, CNN can handle any size input, making the model more flexible.

4.Dropout

The combination of many different models of prediction is a very successful way to reduce the test error, but it took several days of training before it seemed too expensive for a large neural network. However, there is a very effective version of the model combination that spends only twice as much time in a single model of training. The recently introduced technology, called "Dropout", is to set the output of each hidden-layer neuron to zero at a 0.5 probability. Neurons in this way "dropped out" are neither involved in forward propagation nor are they involved in reverse propagation. So each time an input is presented, the neural network tries a different structure, but all of these structures share weights. Because neurons cannot depend on other specific neurons, this technique reduces the complexity of the neuron's adaptive relationships. For this reason, it is necessary to be forced to learn more robust features that are useful when combined with a number of different random subsets of other neurons. In testing, we multiply the output of all neurons by just 0.5, which is a reasonable approximation for obtaining the geometric mean of the predicted distribution generated by the dropout network.
The first two fully connected layers use dropout. Without dropout, our network would show a lot of overfitting. The dropout increases the number of iterations required for convergence by roughly one-fold.

Understanding: Each time you select half of the nodes to make the network, only forward and reverse propagation on the nodes selected to participate in the connection, for example, the left side is the full join form, the right side is a network of half nodes:

Why does it help prevent overfitting? Can simply explain, using the dropout training process, equivalent to training a lot of only half of the hidden layer of the neural Network (hereafter referred to as "half of the Network"), each of such a half of the network, can give a classification results, some of these results are correct, and some are wrong. As the training progresses, the majority of the network can give the correct classification results, so a few error classification results will not have a big impact on the final result.

5. Data enrichment (Enhanced data)

For image data, to reduce overfitting, the first and most common method is to increase the data set while keeping the data labels intact. Here we use two different methods, all of which require very little computation, so the transformed image does not need to be stored in the hard disk, but only in memory.

(1) Image panning and flipping

(2) Adjust RGB pixel values

6. Training on multiple GPUs three, the overall structure

The network comprises eight weighted layers, the first five layers are convolution layers, and the remaining three layers are all connected layers. The output of the last fully connected layer is sent to a 1000-way Softmax layer, which produces a distribution that covers 1000 categories of labels. Our network maximizes the logistic regression target of multiple classifications, which is equivalent to maximizing the logarithmic probability average of the correct label in the training sample under the predicted distribution.

The cores of the second, fourth, and fifth convolutional layers are connected only to the previous convolution layer also on those kernel mappings in the same GPU (see Figure 2). The core of the third convolution layer is connected to all the kernel mappings in the second convolution layer. The neurons in the fully connected layer are connected to all the neurons in the previous layer. The response normalization layer follows the first and second convolution layers. The maximum pooling layer follows the response normalization layer and the fifth convolution layer. The Relu nonlinearity is applied to the output of each convolution layer and the fully connected layer.
The first convolutional layer filters the input images of size 224X224X3 using 96 cores of size 11x11x3, with a step length of 4 pixels, which is the distance between the sensing centers of neighboring neurons in the same nuclear map. The second convolutional layer requires the output of the first convolutional layer (the response normalized and pooled) as its input, and is filtered using 256 sizes of 5x5x48 (note: 48 corresponds to the 48 map obtained from the first layer, that is, all map convolution of the first layer generates a map of the second layer). The third, fourth, and fifth convolution layers are connected to each other, without any pooling layer in between and the normalized layer. The third convolutional layer has 384 cores of size 3x3x256 that are connected to the second convolutional layer (normalized, pooled) output. The fourth convolutional layer has 384 cores of size 3x3x192, and the fifth convolutional layer has 256 cores of size 3x3x192. There are 4,096 neurons in each of the fully connected layers.

Each layer of operations, dimensions, and parameters

Alexnet--cnn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More