Google researcher Ilya Sutskever: 13 Tips for successful training LDNN
Abstract: This article is written by Ilya Sutskever (Google Researcher, student of deep learning Geoffrey Hinton, Dnnresearch co-founder), and describes insights and practical advice on deep learning, Including deep learning why strong, how powerful, and training deep neural network tricks.
"Editor 's note" This article by Ilya Sutskever (Google Researcher, deep learning Geoffrey Hinton students, Dnnresearch co-founder) by the Yisong Yue invited to write, A comprehensive account of insights and practical advice on deep learning. The article is Yisong Yue Authorized "programmer" electronic magazine Translation, and published in the "programmer" 15 2b period.
In recent years it has set off a deep learning heat, it is expected to become the next technological innovation Tuyere. In areas such as speech recognition, image recognition, and other language-related applications such as machine translation, deep learning has achieved quite a good result.
Why is it? What is the avatar of deep learning? (Next in the discussion of the deep neural network, will be abbreviated to the ldnn,large depth neural Networks) What are the similarities and differences between LDNN and the past? Finally, you may be asked how to train a ldnn. The popular saying is "difficult, difficult, difficult", at the same time give LDNN training put on a "black magic" coat-experience, experience! It is true that experience is indispensable, but should not be overstated. In addition, many open source neural network solutions have helped many beginners to find a way to get started (such as Caffe, Cuda-convnet, Torch, Theano, etc.).
Why deep learning is possible
First, 工欲善其事, its prerequisite. If the lack of a strong enough model, then good learning algorithm can only ineffective.
Second, the model must be available for training. Otherwise, in an era of rapid change, the untrained model is the Achilles heel of Achilles.
Fortunately, LDNN is both powerful and training.
LDNN, where is the strong?
When it comes to ldnn, I usually refer to the 10-20-layer neural network (which is what the existing algorithm can handle). Here are a few examples of the strengths of LDNN.
The traditional statistical model learns simple patterns or clusters. Instead,Ldnn learns to calculate, although it requires an appropriate amount of computational steps to perform large-scale parallel operations. This is an important watershed in LDNN and other statistical models .
More specifically: It is well known that any algorithm can be implemented by a suitable deep circuit diagram (e.g., each time step in the algorithm implementation as a layer). Furthermore, the deeper the circuit diagram is, the more difficult it is to implement the algorithm (let's say runtime). If the neural network is used as a circuit diagram, the deeper the neural network can perform more algorithm operations-depth = force.
Note: It is important to understand that a single neuron in a neural network can calculate input aggregation or input diversity, and to do so is to assign appropriate values to their connections.
Surprisingly, the neural network bibble circuit is actually more efficient. Further, to solve a problem, a fairly shallow layer of DNN overhead bibble the circuit requires much less. For example, a dnn with two hidden layers and an appropriate number of compute cells can sort the nn digits. I was quite surprised when I knew the conclusion, so I tried to create a small neural network and then train it to sort the 106-bit numbers, and the results were highly consistent with the conclusions! If you use Boolean circuits to sort the nn bits, this is not possible under the same conditions.
The DNN bibble circuit is efficient because neurons perform threshold operations, and micro-Boolean circuits do not.
Finally, although the human neuron is slow, it can accomplish many complex tasks in a short period of time. Specifically, there is a common sense that a human neuron operates less than 100 times per second. If someone can solve the problem within 0.1 seconds, our neurons will only work 10 times in sufficient time. So a 10-layer large neural network can accomplish any task that human beings need to finish in 0.1 seconds.
People's universal cognition is that human neurons are much more powerful than artificial ones, but they may be the opposite. It is still too early to too early the merits.
Interestingly, humans can often solve extremely complex cognitive problems in about 0.1 seconds. For example, we can quickly identify and respond to the things we see, whether it's expression, face, or verbal communication. In fact, if there is one person in the world who can solve the most complex problems, even one is enough to persuade people that LDNN can do the same thing-make sure that you get the right data input.
Is it possible to have a small neural network? Yes, maybe. The human neural network must not carry on the scale growth, because the human brain is impossible to grow up! If a person's neurons appear to be noisy, it means that when an artificial neuron can complete a work that requires a combination of multiple human neurons, the number of neurons required to match the DNN to the human brain will be significantly reduced.
The above points of contention mainly clarify that there is a LDNN connection configuration which can solve the problem basically in different situations. Critical, the number of units required to solve the problem is far less than the number of digits, so it is possible to train and acquire a high-performance network with existing hardware. The last point is very important, and the following continues in-depth.
We know that machine-learning is persistent: that is, if you provide enough data, they can solve the problem. But sustainability often means exponential big data volumes. For example, when the nearest neighbor algorithm takes all the possibilities into account, it can solve any problem, and the vector machine, like this, needs to provide enough training samples. The same applies to a neural network with a single hidden layer: If every possible training sample corresponds to a neuron, and the neuron does not care about other cases, then we can learn from the inputs and outputs and show each possible function. Problem resolution relies on data, but it is not always possible in a limited resource.
This is the difference between LDNN and the old version: using large rather than huge ldnn to solve real-world problems. If people can solve the problem in a short time, it is enough to prove that even the smallest neural network has a huge capacity.
One thing I have to admit is that it'sstill time to prove that DNN can solve a given problem, although LDNN often handles the same problem within actionable data .
That's what I'm going to say. For a problem, such as visual target recognition, all we have to do is to train a 50-story giant convolutional code neural network. Apparently reaching such a level of neural networks is enough to rival the human Neural network, isn't it? So finding the right value is the key to success or failure.
Learn
What is learning? Learning is to find the right value for the neural network to maximize the effectiveness of the training data. In other words, we want to use the information from the identity data as an input parameter for the neural network.
The success of deep learning depends on a fortunate fact: well-provisioned and initialized random gradient descent (SGD) is a good way to train ldnn. This is meaningful because a neural network training error is also a function of its weight, and the error is highly non-convex. When the non-convex optimization, the result is unpredictable, only the result is convex is good, non-convex is not good. But SGD seems to be able to make a difference. It is NP hard to train neural network, in fact it is NP hard to find out the best data set of neural network with 3 hidden units. And SGD can solve the problem in practice. This is the foothold of deep learning.
It can be fairly confident that successful LDNN training relies on the "simple" Association of data, so that Ldnn constantly learns from itself to solve the "complex" Association of data. I have a successful experiment with this: it is more difficult to train a neural network to solve the parity problem. I managed to achieve a 25bits, 29bits test, but never more than 30bits (maybe someone could, but I really haven't). Now, we know that the parity problem is very unstable, and it lacks any linear association: each linear input is not associated with the output, which is a problem for neural networks because the neural network is highly linear at initialization (does that mean I need to use a larger initial weight?). The initial questions about the weights are described later). So my hypothesis is (as many experts have shared) that when neural networks begin to learn, they are concerned with the high correlation of input and output, and when a hidden unit is introduced to monitor it, the neural network can handle more complex associations. I envision a more diverse range of associations, including simplicity and complexity, and a network jumping into more complex associations from an association, isn't that the true portrayal of opportunistic climbers?
Generalization
While there is a lack of substantive conversation about neural network optimization (except for the convex and boring local minima), the discussion of generalization can be interesting and much more specific.
For example: Valiant published a famous paper entitled "Learning Theory" in 1984, and he simply proved that if a finite number of functions is given, say N, once there are more training cases than log n and the increment is a small constant factor, Each training error will be close to every test error. Obviously, if all the training errors are close to its test errors, overfitting is basically impossible (fitting occurs when the gap between the training error and the test error is too large.) I have also seen similar conclusions in the Vapnik book. This theorem is easy to prove that there is not much to say here.
But this simple result is of great importance to the realization of any neural network. If there is a neural network with n parameters, each parameter is a 32-bit float. So when a neural network is designated as 32xN bits, it means that different neural networks will not exceed 2³², or even less. In other words, once you have more than 32xN of training samples, we will not over-fit how much. That's good. Theoretically we can count the parameters. More importantly, if we believe that each weight requires only 4 bits, and the others are noise, then the number of the training sample must be a very small constant factor of 4xN, not 32xN.
Conclusion
If we want to use a LDNN to solve the problem, we need to give enough parameters. So we need an ultra-high-quality tagging training set to make sure that there is enough information to match all of the network connections. Once the training set has been obtained, we should run SGD on top of it until the problem is resolved. This is achievable if the neural network is large and deep.
Changes since the 80
In the past, it was thought that neural networks could solve "all problems". But why not succeed in the end? There are several reasons for this.
- In the past, the computer was very slow , so the neural network in the past was tiny, which led to the deficiency of performance. In other words, the small neural network function is not strong.
- The data set is small . Even if the LDNN can be miraculously trained, there is not enough big information data set to constrain the huge amount of neural network parameters. So failure is inevitable.
- no one knows how to train a deep network . Deep networking is important. The current 20-25 consecutive lap layers are the best object-aware network configuration. A two-layer neural network is doomed to be inefficient in object recognition. In the past, it was thought that SGD could not be used to train deep networks because it was believed to be unbelievable at the time.
The scientific development is so interesting, especially looking back on the past, it will be found that the depth of neural network training is a trivial one.
Practical advice
Well, maybe you've been tempted. LDNN represents the future now, don't you want to train it? But the rumor is that LDNN is very advanced, is it true? The past may be, but now many communities have made efforts and attempts to keep the following in mind, training a neural network is not too difficult. The following is a summary of the community knowledge, which is important please read it carefully.
Get Data: Make sure you have high-quality input/output datasets that are large enough, representative, and have relatively clear labels. A lack of data sets is difficult to succeed.
preprocessing: It is important to centralize the data, that is, to make the data mean 0, so that each change in each dimension is 1. Sometimes, it is best to use the log (1+x) of that dimension when the input dimension changes with the order of magnitude. Basically, it's important to find a 0-value trusted encoding and a natural demarcation dimension. Doing so will make learning work better. This is the case because the weights are updated by the formula: the change in Wij \propto Xidl/dyj (W represents the weight from layer x to layer y, and L is the loss function). If the mean value of X is large (for example, 100), then the update of weights will be very large and interrelated, which makes learning poor and slow. Maintaining a 0 mean and a small variance are key success factors.
Batch processing: It is inefficient to perform only one training sample at a time on today's computers. Conversely, if you are working on a batch of 128 examples, the efficiency will be greatly increased because the output is very impressive. In fact, batch processing with an order of 1 is good, which not only improves performance but also reduces overfitting, but it is likely to be overtaken by large batches. Do not use overly large batches, however, because it is possible to cause inefficiencies and excessive overfitting. So my advice is: Choose the right batch size based on your hardware configuration, and possibilities will be more efficient.
gradient Normalization: splits the gradient based on the size of the batch. This is a good idea because if you multiply (or subtract) the batch, you don't need to change the learning rate (anyway, not too much).
Learning rate Plan: start with a normal-sized learning rate (LR) and shrink toward the end.
The
typical value of 1LR is
0.1, and surprisingly, for a large number of neural network problems, 0.1 is a good value for the learning rate. Usually the learning rate tends to be smaller rather than larger.
Use a
validation set -a subset of training sets that are not trained-to determine when to reduce the learning rate and when to stop training (for example, when errors in the validation set begin to increase).
Practice recommendations for learning rate plans: If you find a bottleneck in your validation set, you might want to divide LR by 2 (or 5) before continuing. Eventually, LR will become very small, and it's time to stop training. This ensures that you do not fit (or over-fit) the training data when verifying that performance is compromised. It is important to reduce LR, and it is a good practice to control LR through a validation set.
But the most important thing is to pay attention to the learning rate . Some researchers, such as Alex Krizhevsky, use the method of monitoring the ratio between the update norm and the weight norm. The ratio value is approximately 10¯³. If the value is too small, then learning will become very slow, if the value is too large, then learning will be very unstable or even failure.
initialization of the weight value. attention weights are randomly initialized at the beginning of the learning.
If you want to be lazy, try 0.02*randn (Num_params). The value of this range works well on many different issues. Of course, a smaller (or larger) value is also worth a try.
If it does not work well (for example, an unconventional and/or very deep neural network architecture), then you need to use INIT_SCALE/SQRT (layer_width) *randn to initialize each weight matrix. In this case, the Init_scale should be set to 0.1 or 1, or a similar value.
Random initialization is extremely important for deep, circular networks. If it's not handled well, then it looks like nothing has been learned. We know that once the conditions are set, the neural network will learn.
An interesting story: Over the years, researchers believe that SGD cannot train deep neural networks from random initialization. Every attempt is unsuccessful. It is embarrassing that they did not succeed because of the use of "small random weights" to initialize, although the practice of small numbers work very well on a shallow network, but the performance on the deep network is not good. When the network is deep, many weight matrices are multiplied, so bad results are magnified.
However, if it is a shallow network, SGD can help us solve the problem.
So it is necessary to focus on initialization. Try a variety of different initializations, and the effort will be rewarded. If the network does not work at all (i.e. it cannot be implemented), continuing to improve random initialization is the right choice.
If you are training rnn or lstm, use a hard constraint on the gradient (remember that the gradient is divided by the batch size) norm . Restraints like 15 or 5 work very well in my own experiments. Divide the gradient by the batch size, and then check to see if its norm is more than 15 (or 5). If it is exceeded, narrow it down to 15 (or 5). This trick plays a huge role in the training of RNN and lstm, and without doing so, the explosive gradient will lead to learning failure, and eventually the use of tiny, useless learning rates like 1e-6.
numerical gradient check: if the Theano or torch is not used, the gradient can only be achieved by the pro-force. It is easy to make mistakes when implementing gradients, so it is important to use a numerical gradient check. Doing so will give you confidence in your code. Adjusting the super parameters (such as learning rate and initialization) is very valuable, so a good knife should be used on the blade.
If you are using LSTM and want to train them on a problem with a wide range of dependencies, then you should initialize the deviations from the lstm forgotten gates to a larger value. by default, the Forgotten gate is the all input of the S-type, the authority value is very small, the forgotten gate will be set to 0.5, which can only be valid for some problems. This is a warning for lstm initialization.
Data augmentation: using algorithms to increase the number of training instances is a creative approach. If it is an image, then it should be converted and rotated, and if it is audio, it should be mixed with a clear part and all types of noise. Data addition is an art (unless you are working with images), which requires some common sense.
Dropout:dropout provides an easy way to improve performance. Remember to adjust the exit rate, and in the test do not forget to close the dropout, and then the weight of the product (that is, 1-dropout rate). Of course, be sure to train the network a little longer. Unlike regular training, validation errors usually increase after you enter in-depth training. The dropout network will work better and better over time, so patience is the key.
synthesis (ensembling). train 10 neural networks and then average their predictive data. This approach is simple, but gives you a more direct and measurable performance boost. Some may be confused, why is it so effective on average? An example might be used to illustrate: if the error rate of two classifiers is 70%, then the average prediction will be closer to the correct result if the correct rate of one of them remains high. This will be more obvious to the trusted network, and when the network is trustworthy the result is correct, and the result is wrong when it is not.
The above 13 points covered the success of training ldnn everything, I hope I have not missed.
The final summary is as follows:
- The LDNN is very powerful;
- If there is a high-performance computer, LDNN is available for training;
- If there is an ultra-high-quality data set, we can find the best ldnn for the task;
- LDNN can solve the problem, or at least help solve the problem.
Written in the last
What will the future be like? Predicting the future is obviously difficult, but in general, models that can perform more calculations may be very good. The neuro-Turing has taken a very important step in this direction. Other issues, including unsupervised learning, were just the tip of the iceberg for me as of January 8, 2015. Using unsupervised learning to learn complex data is a good attempt. The road is long, we still need to work hard.
Original link: A Brief Overview of deep learning (translation/Wu Kun school Trial/Gaobo Zebian/Zhou Jianding)
"Preview" The First China AI Congress (CCAI 2015) will be held in July 26-27th in Beijing Friendship Hotel. Machine learning and pattern recognition, big data opportunities and challenges, artificial intelligence and cognitive science, intelligent robotics four subject experts gathered. AI Product Library will be synchronized online, appointment consultation: qq:1192936057. Welcome attention.
This article for CSDN compilation, without permission not reproduced, if necessary reprint please contact market#csdn.net (#换成 @)
Google researcher Ilya Sutskever: 13 Tips for successful training LDNN