Datasets:
Labelme:consists of hundreds of thousands of fully-segmented images
Imagenet:consists of over million labeled high-resolution images in over 22000 categories
The data set used in this paper is imagenet
Superfluous words:
The imagenet contains over 1500 0000 Tagged high-definition images, which are available in about 22000 categories. These images are collected from the Internet, and people use Amazon's Mechanical Turk crowdsourcing tool for tagging.
ImageNet Large-scale visual recognition Challenge (ILSVRC), which was held annually since 2010, is part of the Pascal Visual Object Challenge. ILSVRC uses only part of the Imagenet, class 1000, each class about 1000, a total of 120 0000 training pictures, 50000 verified images, 150000 test pictures, the clarity of these images is not uniform. Where ILSVRC-2010 is the only time the test set label can be obtained. The game usually reports two kinds of errors, top-1 and top-5. The top-5 error rate means that the most probable 5 tags of the model prediction are not the proportions of the correct label for the test picture.
step into the chase:
1. Network structure:
This paper is mainly about image classification, for us to get a picture, we can quickly know what this picture, such as a cat, a chair. But for computers, how to classify images is a problem, and the computer knows a bunch of numbers 0 and 1. In order to achieve this goal, and the effect is good, it uses the model structure for short alexnet
Generally speaking :
Alexnet a total of 8 layers, with more than 60M of the parameter amount.
The first five layers are convolutional layers: convolutional layer
The rear three layer is a fully connected layer: fully connected layer
The last layer is 1000-way Softmax, and the objective function is multinorminal logistic regression.
This is the original image that was cut from the paper, which was trained on two 3GB GPUs (the GTX 580 3GB GPU), which was limited by that 3GB memory, resulting in a model that could be trained on a GTX 580 GPU, so there are two parts to the model structure diagram. From the diagram we can see that the 2nd layer, the 4th layer and the 5th layer of the kernels are only connected to the previous layer of kernels on the same GPU, and the 3rd layer and the fully connected layer kernels are connected to all previous kernels.
Hierarchical structure:
(1) The input image is 224 x 224 x 3, indicating that the length of the width is 224 pixels, using the RGB color map channel (channel number is 3) to represent, so also multiplied by 3 (the academic generally think that the paper said 224 is not appropriate, the reason should be 227)
(2) 96 filter with a size of x 3 is used. Under the setting of Stride 4, the input image is convolution, and a data map with x 96 is obtained. The origins of these three numbers are: (227-96)/4 + 1 = 55,96 is the number of filters.
(3) After the activation function Relu, and then the pooling operation-maxpooling, filter (filter) size 3 x 3, Step (Stride) 2, the output of the pool is x x 96. (55-3)/2 + 1 = 27. 96 is still the original depth.
(4)
- LRN (local Response normalization, partial response normalization)
- The other 4-ply convolution process is similar to the
The input for this layer is 6 x 6 x (9216), and the fully connected layer is actually a matrix operation that completes a spatial mapping. So take the input as a column vector x (dimension is 9216), that is, you can think of the input as a matrix of 9216 x 1. Then the parameter matrix W is multiplied , where the parameter matrix W is set to 4096 x 9216, and the output of the last fully connected layer is W and x multiplied by the Y (dimension 4096 x 1).
The output of the 8th layer is the matrix of the X 1, which is a column vector with a dimension of 1000, corresponding to the 1000 tags of the Softmax regression.
2. Innovation of network structure
- ReLu(rectifiedlinear uints, correction linear unit)
F (x) =max (0,x), is an unsaturated nonlinear activation function. Compared to the F (x) =tanh (x) and sigmoid functions: f (x) = (1+e^-x) ^-1, the SGD gradient drops more quickly in terms of the two saturated nonlinear activation functions.
Superfluous words:
Using a parallel mechanism, the kernels is divided into 2 GPUs on a half-block. A little trick is used here, and the communication between the GPUs is done only in certain layers. For example, the core of layer 3rd needs to be entered from all the nuclear mappings in layer 2nd. However, the core of the 4th layer requires only those kernel mapping inputs from the 3rd layer that are located on the same GPU.
- LRN (local Response normalization, partial response normalization)
Essentially, this layer is also designed to prevent the saturation of the activation function.
according to the author of this blog (http://blog.csdn.net/cyh_24/article/details/51440344 ): by regularization let the input of the activation function close to the middle of the "bowl" ( To avoid saturation), thus obtaining a larger number of leads. Functionally speaking, the relu is repetitive.
In this paper, the author mentions that this can improve the generalization ability of the network.
In particular, LRN uses this function to normalized the excitation values at (x, y) of the kernel, n is the total number of kernels in that layer, and the hyper-parameter is constant.
A pooled layer can be considered to consist of a pooled cell mesh spaced s pixels, each of which summarizes a neighborhood of z x Z size centered on the pooled unit.
If s = z, is the traditional local pooling, if s < Z, is overlapping pooling.
This paper is set in the entire network, S = 2, z = 3. Compared with the non-overlapping pooling mechanism (s = 2, z = 2), this mechanism reduces the top-1 and top-5 error rates by 0.4% and 0.3% respectively.
The output of non-overlapping pooling and overlapping pooling is the same dimension. Also, using overlapping pooling models is less likely to fit in training.
3. Avoid overfitting
- Data augmentation(expansion)
There is a view that the neural network is trained through a large number of data, if you increase the training data, you can improve the accuracy of the algorithm, because this can avoid overfitting, and avoid overfitting can increase the network structure. When the data is limited, it is possible to expand the size of the training data by generating some new data from the existing data through some changes.
Among them, the simplest common image data is distorted in the following way :
1. Randomly crop out some images (224,224) "Translational transform, crop" from the original image (256,256)
2. Flip the image horizontally "reflection transform, Flip"
3. Add some random illumination to the image "light, color transform, color jittering"
Alexnet Processing of data augmentation:
1. Translational and Reflection transformations: When training , for a picture of a 224 x 256 randomly crop x 224, and then allow horizontal flipping , the equivalent of multiplying the sample to ((256-224) * * 2) * 2) = 2048. test , the upper left, upper right, lower left, lower right, the middle do 5 times crop, and then Flip (flip), a total of 10 crop, and then the results averaged.
2. Change the intensity of the RGB channel of the training image: We perform PCA in the set of RGB pixel values throughout the imagenet training set. For each training image, we multiply the existing principal components by multiplying the corresponding eigenvalues by a random variable extracted from a Gaussian distribution with a mean value of 0 and a standard deviation of 0.1. Thus, for each RGB image pixel, we add the following: the first and the eigenvalues of the 3x3 covariance matrix, which are the RGB pixel values respectively, are the previously mentioned random variables. Each pixel for a particular training image is extracted only once, until the image is again used for training, at which point it is re-extracted. This scheme captures an important attribute of natural images, that is, light intensity and color change, and object recognition is invariant.
What it does is set the output of each hidden-layer neuron to zero at a probability of 0.5 . neurons in this way "dropped out" are neither involved in forward propagation nor are they involved in reverse propagation . So each time an input is presented, the neural network tries a different structure, but all of these structures share weights . Because neurons cannot depend on other specific neurons, this technique reduces the complexity of the neuron's adaptive relationships . For this reason, it is necessary to be forced to learn more robust features that are useful when combined with a number of different random subsets of other neurons. The first two fully connected layers use dropout. Without dropout, our network would show a lot of overfitting. The dropout increases the number of iterations required for convergence by roughly one-fold.
4. Image preprocessing
① size Normalization
To 256x256 all the pictures to the size of the scale, as for why not directly normalized to 224 (227), please refer to the above-mentioned expansion of the dataset operation.
② minus the pixel average
each pixel value of all pictures is subtracted from the average of all training set pictures .
Reference blog:
1.http://blog.csdn.net/teeyohuang/article/details/75069166
2.http://blog.csdn.net/cyh_24/article/details/51440344
Paper notes--alexnet--imagenet classification with deep convolutional neural Networks