Caffe Study-alexnet's Algorithm chapter

Source: Internet
Author: User

In machine learning, one of the problems we usually consider is how to "generalize", that is, to approximate the global distribution with a finite sample or structure as much as possible. This will take some time in the sample and the structure model.

In the General training task, one of the key issues to consider is whether the data distribution is reasonable: first, the data set coverage, that is, whether the data set can cover the sample space, and secondly, as far as possible to ensure that the same as the real data distribution (note that the data distribution is unknown, you can only approximate by some priori), Such data is valid. Of course, these methods only increase the probability of getting the correct solution, and there is no guarantee that the correct solution will be obtained. When you do not know whether the training set you take is consistent with the actual distribution, then you need to take more than a few times, each data set is calculated, for the classifier is so, a single classifier often can not accurately describe a sub-interface, then we combine, each calculation. From the methodological point of view, for things are often local, so will make the wrong to generalize, if you can get the "partial" ensambling, then the relative "full", so that the larger probability of approximation to the overall distribution. This thought is manifested in many aspects, such as cross-validation, classical Ransac,random Tree (forest), Adaboost and other methods.

Here are the two aspects of data and models to learn some of the techniques in Alexnet, the main reference is Alex 2012 Nips paper Imagenet classification with the deep convolutional neural Networks.

1. Processing of data:

So far, no one has seen the effect of the size of the dataset on the theoretical limit of the deeplearning algorithm, which means that the data set has not reached the critical point, so adding data sets only benefits, no harm.

In Alex's paper, two methods were used to enhance the image.

A. Increase the training sample: The enlarge of the data set is realized by the transformation of the image. First, for the input image (size 256*256) randomly extracts the 224*224 image collection and makes a horizontal reflections for them. The image differs from the original image by 32 pixels, so the body part should be included in the training set, which is equivalent to enriching the training data in the dimension of the position. For horizontal reflections, the camera is mirrored in the direction of the spindle, enriching the image in the opposite direction. The data set increases 2048 times times, the direct result is reduces the overfitting at the same time reduces the network structure design the complex layer degree.

In the test phase, take four corners of each test sample and the middle area, a total of 5 patches and then mirrored to get 10 sample input to the network, and finally 10 Softmax output averaged as the final output.

B. Using PCA to enhance the training data: For each RGB image to perform a PCA transformation, complete the de-noising function, and in order to ensure the diversity of the image, a random scale factor is added to the eigenvalue, each wheel regenerates a scale factor, which ensures that the same image has a certain range of changes in the salient features To reduce the probability of overfitting.

The above strategy is not really necessary, this still want to hit a question mark, because for part A, sample less, you can work hard in the structure design, may achieve the same effect. For B, does the deeplearning also need to add enhanced processing to the image? If so, nature can also use some of the traditional artificial characteristics of the first time, and then deeplearning. I think the key reason is that deeplearning has not really proved the rules, so you use any strategy is a bit of a reason, but who can guarantee not "sweeping" it?

2. Model structure:

In the design of the model, Alexnet did a local Response normalization, and a dropout strategy was adopted in the node selection.

A. Local Response normalization.

The formula is as follows, where a is the activation of each neuron, n is the number of kernel map adjacent to the same position, n is the total number of kernel, K,alpha,beta are preset hyper-parameters, where K=2,n=5,alpha = 1*e-4,beta = 0.75.



As can be seen from this formula, the original activation A is added with a normalized weight (the denominator part) to generate a new activation B, equivalent to the same position (x, y), the activation on different maps is smoothed, but as for why K,alpha,beta so to set up, do not want to be too clear.

This smoothing can probably increase the recognition rate by 1%-2%.

B. Dropout strategy

Using multiple model to co-predict is a basic way to reduce test errors, but training multiple model combinations alone can lead to an increase in the overall training cost, after all, it takes a long time to train a single network, even if the computing resources are sufficient, It is our goal to reduce the entire computational time without affecting the accuracy.

Thus Hinton proposed dropout strategy, this strategy is very simple, for each hidden layer of output, with a probability of 50% to set them to 0, no longer for the forward or backward process played any role. For each input, a different network structure is used, but the weights are shared. The parameters can be adapted to the network structure under different circumstances, that is to say, improve the generalization ability of the system.

This strategy was used in the last two full-connected layers in alexnet.

3. Parameters of the optimization algorithm

In this paper, the SGD algorithm is used, and the basic parameter setting is mentioned in the summary of the previous optimization algorithm. Here are a few personal experiences.

A. The number of batch input in the original text is 256, should Alex after the adjusted results, I actually use the machine performance is relatively low, memory 8G, video memory 4G, so I have to reduce the number of batch to 64, so as not to produce out of memory error. This will require other parameters to be adjusted to ensure the convergence of the data. The reason is that batch is relatively small, leading to the introduction of this article, the sample coverage is too low, resulting in a lot of local minimum, in the step and direction of the joint action, resulting in data turbulence, resulting in non-convergence.

B. In this case, the learning rate adjusted to 0.02, the equivalent of increasing the step size, so as to avoid the shock to a certain extent, you can go over the local minimum point to the larger extreme points of the walk.

C. For each layer of the bias from 1 to 0.1, to a certain extent, limiting the size of the activation, which limits the impact of an excessive error, so as to avoid excessive changes in the direction of the iteration.

D. After b C, the system finally converge, but the negative result is the whole convergence rate is slow, so also need to increase the maximum number of iterations, the number of test iterations has been modified from 45w to 70w.

E. Throughout the operation, there have been several stationary points, 20w and about 40w, so the iteration of the learning rate should be as the iteration of the close to the smooth point when the intention to reduce some, the current is reduced to 1/10 per 10w, tuning parameters for 5 days, the last run time of 15 days.

F. With regard to the strategy of the assistant, the above is set according to some simple understanding, if there is not a reasonable explanation, the assistant becomes a very low job. Fortunately found a few of the papers on the tuning, mainly in the optimization algorithm theory, and then come back to test after learning.

Caffe Study-alexnet's Algorithm chapter

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.