The basic structure of alexnet
Alexnet is composed of 5 convolutional layers and three fully connected layers, a total of 8 weight layers (the pooling layer is not a weight layer because it has no parameters), wherein the RELU activation function on each convolution layer and the full join layer, the first convolution layer and the second convolution layer behind the connection of a local response normalization layer, The maximum pooling layer acts on the output of the first convolution layer, the second convolution layer, and the fifth convolution layer.
ReLU
In the alexnet structure, the traditional ' s ' shape activation function is discarded, but the modified linear element is chosen as the activation function and the Relu traditional ' s ' shape activation function has f (x) =1/(1+e-x), f (x) =tanh (x), where sigmoid functions such as:
The sigmoid function has saturation characteristics, when the input is large or small, the output curve is close to flat, at this time the gradient is almost 0, it will cause the gradient problem disappears, and its output mean value is not 0, may cause bias transfer, will make the latter layer of the output of the previous layer of non-0 mean signal as input, and its output range in 0~1, does not contain negative information, may lose a part of useful information.
As the graph of the Tanh function, it is shown that the output value of the range is [ -1,1], its output mean is 0, and contains negative information, but because it also has saturation characteristics, it will also cause vanishing gradient problem.
Relu in the form of f (x) =max (0,x), when the input is positive, the output value itself, at this time about the derivative of the input x is 1, is a constant, avoids the vanishing gradient problem, and when the input is less than 0 o'clock, the output is 0, the introduction of sparsity, can speed up training, but because its output mean is greater than , so there will also be biased transfer phenomenon, and because when the input is less than 0 o'clock, the corresponding neuron output is 0, the gradient is 0, the corresponding weight can not be updated.
overlap Maximum pooling
In Alexnet, the pooled area is 3x3, with a step size of 2 overlapping pooling, and claims that this pooling method is more than the traditional pooled area is 2x2, the step 2 of the non-overlapping pooling method effect is good, the paper said that the effect was increased by 0.3%, but I found in other data set experiments, Overlapping pooling is less effective than non-overlapping pooling, because when overlapping pooling is used, the maximum value of the next pooled area is likely to be a copy of the largest element of the previous pooled area, which is equivalent to creating feature redundancy.
Local response Normalization
Where aix,y for the I-convolution core at the position (x, y) of the convolution to obtain the elements of the feature map, and then use the Relu to its nonlinear changes in the resulting neurons. Bix,y is a neuron that is obtained after a localized response normalization process. n is the total amount of convolution nuclei in a layer. K,ε,n,β is a hyper-parameter, trained by the network, and set k=2,n=5,ε=10-4,β=0.75 in Alexnet. The summation in local response normalization is the same position in n adjacent feature mappings (x, y), which is equivalent to introducing a competitive mechanism between neurons, and strong features have a larger element value.
reduced over fitting1. Data Set Amplification
The simplest and most common method of reducing overfitting in image data is the artificial amplification of datasets, of course, in the case of retaining the label information. For example, the original image size is 256x256, you can randomly cut a 224x224 in the original image of a small piece of training, and its horizontal flipping, panning and other operations, expand the data set.
2.Dropout
Dropout is a regularization method that randomly sets the value of the neurons of the hidden layer to 0, with a drop rate of p, usually p set to 0.5,dropout forcing each neuron to learn independent features rather than relying on other neurons. It is also equivalent to weight attenuation, when the value of a neuron is 0 o'clock, the weight of the connection is also 0, the network parameters are less, the complexity is reduced, to prevent overfitting, and it is equivalent to a model fusion, assuming that there is a total of n neurons in the network, p=0.5, then the equivalent of 2n sub-network training at the same time, is a model averaging method to improve generalization performance.
Network Structure Analysis
Usually after the convolution layer should be a pooled layer, but alexnet only in the first convolutional layer, the second convolutional layer and the last convolutional layer behind the maximum pooling, because in the lower layer of the network, the size of the feature map is generally large, with more parameters, This is to be pooled in order to reduce the number of parameters to prevent overfitting, and at the higher level of the network, the extracted features are generally advanced features, may be an object contour, such as Puppy's eyes, but the puppy may appear in the image anywhere, the purpose of the maximum pooling is to provide a translation inversion invariance. As for why Alexnet only uses 5 convolution layers, the paper mentions that depth is important, but there may be degradation. Removing a convolutional layer can also degrade the network's representation.
The alexnet of the classic structure in CNN