Understanding DropoutOriginal address: http://blog.csdn.net/stdcoutzyx/article/details/49022443Understanding Dropout
注意:图片都在github上放着,如果刷不开的话,可以考虑FQ。转载请注明:http://blog.csdn.net/stdcoutzyx/article/details/49022443
Opening Ming Yi, dropout refers to the neural network unit, which is temporarily discarded from the network in a certain probability during the training of the deep Learning Network. Note that for a while, for a random gradient descent, because it is randomly discarded, every mini-batch is trained on a different network.
Dropout is a big killing device in CNN that prevents overfitting from improving, but there are divergent opinions about why it works. Read about two representative papers, representing two different viewpoints, and share them with you.
Combination Pie
In the first article of the reference, the viewpoint of Hinton, about Hinton in the depth of the study of the status of the field I will not repeat, light this position, it is estimated that this faction of the view is "Wudang Shaolin". Note that I am the name of my own, you do not laugh.
View
This paper from the problem of neural network, step by step dropout why the effective explanation. Large-scale neural networks have two drawbacks:
- Time-consuming
- Easy to fit
These two shortcomings really hug in the deep study thigh of two big burden, one left one right, complement each other, the amount is not, odor congenial. Over-fitting is a common problem in machine learning, and the model is largely obsolete by fitting. In order to solve the problem of fitting, the ensemble method is generally used, that is, to train multiple models to do a combination, time-consuming becomes a big problem, not only training, and testing several models are also very time-consuming. In short, almost a deadlock was formed.
Dropout's appearance is good to solve this problem, every time the dropout is done, the equivalent of finding a more network from the original network 瘦
, as shown in:
Thus, for a neural network with n nodes, after having dropout, it can be regarded as a collection of 2n models, but the number of parameters to be trained is constant, which frees up time-consuming problems.
Motivation theory
Although dropout is an approximation of ensemble's classification performance, in practice, dropout is still performed on a neural network, only a set of model parameters is trained. So what exactly is he working on? This is going to be analyzed from the motive. The author of this paper makes a very good analogy to dropout's motives:
In nature, in large animals, generally sexual reproduction, sexual reproduction refers to the offspring of the genes from both parents inherit half. But intuitively, it seems that asexual reproduction is more reasonable, because asexual reproduction can retain a large segment of the excellent genes. The sexual reproduction of the gene randomly dismantled and dismantled, destroying the joint adaptability of large-segment genes.
But natural selection, after all, did not choose asexual reproduction, and choose to have sexual reproduction, the survival of the fittest. Let's start with the hypothesis that the power of genes lies in the ability to mix rather than individual genes. Both sexual reproduction and asexual reproduction are subject to this hypothesis. In order to prove the strong sexual reproduction, we first look at a little knowledge of probability.
For example, to engage in a terrorist attack, two ways:
-Focus on 50 people, let the 50 people close to the precise division of labor, to engage in a large explosion.
-Divide 50 people into 10 groups, each group of 5 people, split up, go to any place to do some action, success once even.
Which is the probability of success is relatively large? Obviously the latter. Because it turns a big team into a guerrilla war.
So, analogy, the way of sexual reproduction not only can pass the good gene down, but also can reduce the joint adaptability between genes, so that the complex large-segment gene combined with adaptability into a smaller one small segment of the joint adaptability of the gene.
Dropout can also achieve the same effect, forcing a neural unit to work together with other randomly selected neural units to achieve good results. The elimination weakens the joint adaptability between the neuron nodes and enhances the generalization ability.
Personal add: That is, plants and microorganisms are mostly asexual reproduction, because their living environment changes very small, so do not need too strong ability to adapt to the new environment, so it is sufficient to retain a large segment of good genes to adapt to the current environment. And higher animals is not the same, to be ready to adapt to the new environment at any time, so the joint adaptation of genes into a small, more to improve the probability of survival.
Changes in the model brought by dropout
In order to achieve the characteristics of ensemble, with the dropout, the training and prediction of neural network will be changed.
Training level
Inevitably, a probabilistic process is added to each unit of the training network.
The corresponding formula changes as follows:
- No dropout neural networks.
- A neural network with dropout.
Test plane
When predicting, the parameters of each unit are pre-multiplied by P.
Other technical points in the paper
Methods to prevent overfitting:
- Early termination (when the effect on the validation set becomes worse)
- Regularization weighting of L1 and L2
- Soft weight Sharing
- Dropout
Selection of dropout rate
- After cross-validation, the implied node dropout rate equals 0.5 when the effect is the best, because 0.5 of the time dropout randomly generated network structure.
- Dropout can also be used as a method of adding noise, directly to the input operation. The input layer is set to a number closer to 1. Makes the input changes not too large (0.8)
Training process
- The spherical limit of the training of parameter W (max-normalization) is very useful for dropout training.
- Spherical radius c is a parameter that needs to be adjusted. You can use validation sets for parameter tuning
- Although dropout himself is also very cow, but dropout, max-normalization, large decaying learning rates and high momentum combined effect better, such as Max-norm Regularization can prevent the parameter blow up caused by the large learning rate.
- Using the Pretraining method can also help with dropout training parameters, multiplying all parameters by 1/p when using dropout.
Some experimental conclusions
The experimental part of the paper is very rich, there are a lot of evaluation data.
-
Maxout Neural network to get another method, Cifar-10 on the dropout
-
Text classification, the dropout effect is limited, the analysis may be because the amount of REUTERS-RCV1 data is large enough, Overfitting merge is not the main problem of the model
- dropout compared to other standerd regularizers
- L2 weight decay
- lasso
- kl-sparsity
- max-norm regularization
- dropout
- feature learning
- standard neural networks, the correlation between nodes so that they can work together to fix the noise in other nodes, but these collaborations do not generalize on unseen data, so, overfitting, dropout destroys this correlation. On Autoencoder, there are dropout algorithms that can learn more about meaningful features (but only intuitively, not quantifiable). The vector produced by the
- is sparse. The
- keeps the number of hidden nodes unchanged, the dropout rate changes, the number of hidden nodes remains active, and the number of hidden nodes changes.
- When the amount of data is small, the dropout effect is not good, the amount of data is large, the dropout effect is good
-
Model mean prediction
- use weight-scaling to make predictions mean
- to make predictions using the Mente-carlo method. That is, each sample according to dropout the first sample out of K net, and then make predictions, the greater the K, the better the effect.
-
Multiplicative Gaussian noise
using Gaussian distribution dropout instead of Bernoulli model dropout
- The disadvantage of dropout is that the training time is 2-3 times less than the dropout network.
Further needs to understand the point of knowledge
- Dropout RBM
- Marginalizing dropout
Specifically, the randomization of the dropout becomes deterministic, such as for logistic regression, whose dropout is equivalent to adding a regularization term.
- Bayesian Neural network is particularly useful for sparse data such as medical diagnosis, genetics, drug discovery and other computational biology applicatio Ns
Noise Faction
The second paper in the reference is also very powerful.
View
The point is very clear, that is, for each dropout after the network, training, equivalent to do the data augmentation, because, can always find a sample, so that in the original network can also achieve the effect of dropout unit. For example, for a certain layer, dropout some units, the result is (1.5,0,2.5,0,1,2,0), where 0 is the drop unit, then always find a sample, so the result is the same. In this way, each time the dropout is actually equivalent to an increase in the sample.
Sparse knowledge Point A
First, let's look at a knowledge point:
When the data points belonging to a particular class is distributed along a linear manifold, or sub-space, of the input s Pace, it is enough to learn a single set of features which can span the entire manifold. But when the data was distributed along a highly non-linear and discontinuous manifold, the best-to represent such a di Stribution is to learn features which can explicitly represent small local regions of the input space, effectively "tiling "The space to define non-linear decision boundaries.
The General meaning is:
In linear space, it is sufficient to learn a set of features of an entire space, but when the data is distributed in a nonlinear discontinuous space, it is better to learn the feature set of the local space.
Knowledge Point B
Suppose there is a bunch of data that is represented by m different non-contiguous clusters, given K data. A valid feature representation is that after each cluster of input is mapped to a feature, the overlap between clusters is minimal. Use a to represent the collection of dimensions that are activated in the feature representation of each cluster. Overlap refers to the minimum jaccard similarity between AI and AJ for two different clusters, then:
- When K is large enough, even a is large, you can learn the smallest overlap
- When k small m is large, the method to learn the smallest overlap is to reduce the size of a, which is sparse.
The above explanation may be a bit too professional, rather awkward. The main meaning is this, we have to distinguish between different categories, it is necessary to learn the characteristics of a large degree of differentiation, in the case of sufficient data, there will not be a fitting behavior, do not worry. However, when the amount of data is small, it is possible to increase the sensitivity of the feature by Sparsity.
Thus the interesting hypothesis comes, using the dropout, the equivalent to get more local clusters, the same data, the cluster becomes more, so in order to make the distinction becomes larger, it makes the sparsity become larger.
To verify this data, the paper also made an experiment, such as:
The experiment used a simulation data, that is, on a circle, there are 15,000 points, the circle is divided into several arcs, on one arc belongs to the same class, a total of 10 classes, that is, different arcs may belong to the same class. By changing the size of the arc, you can make more arcs that belong to the same class.
The experimental conclusion is that when the arc length becomes large, the number of clusters becomes less and the sparsity becomes lower. Consistent with assumptions.
Personal point of view: This hypothesis not only explains why dropout results in sparsity, but also explains that dropout because the local cluster is more exposed, and according to the knowledge point A can be, so that the local cluster exposed is dropout can prevent overfitting, and sparsity is only its external performance.
Other technical knowledge points in the paper
Reference documents
[1]. Srivastava N, Hinton G, Krizhevsky A, et al dropout:a simple-to-prevent neural networks from overfitting[j]. The Journal of machine learning, 2014, 15 (1): 1929-1958.
[2]. Dropout as data augmentation. http://arxiv.org/abs/1506.08700
[Turn] Understanding dropout