Paper Reading SNAPSHOT Ensembles

Source: Internet
Author: User

Introduced 1. Characteristics of random gradient descent

random Gradient Descent method (Stochastic Gradient descent) as an optimization method for the mainstream use of deep learning, has the following advantages:

    • The ability to evade and escape false saddle points and local minima

This paper argues that these local minima also contain useful information that can help improve the ability of the model.

2. Significance of local minima

Optimization of neural networks in general, it does not converge to the global minimum , but converges to a local minimum. These local minima have good and bad differences. For good or bad distinction, it is generally believed that:

    • Local minima have a flat area flat basin , and the generalization of these points corresponds to the model, which is better local minima.
3. SGD and local minima

In the optimization process, SGD avoids steep local minima because:

    • The computed gradient is obtained by the Mini-batch and is therefore imprecise .
    • When the learning rate is learning rate relatively large, the movement along this imprecise gradient will not reach the smallest point with steep local

This is the advantage of SGD in the optimization process, avoiding the steep local minima of the convergent domain.

However, when the learning rate is relatively small, the SGD method tends to converge to the nearest local minima.

These two distinct behaviors of SGD are demonstrated at different stages of training:

    • The initial stage uses a large learning rate to move quickly to areas close to the flat local minima
    • When the search is carried out to a stage without ascension, the learning rate is reduced and the search converges to the final local minima.
4. Model Training and local minima

The number of local minima increases exponentially with the increase of parameters in the model. So the local minima in the neural network are numerous. The same model, because of the different initialization, or training sample batch order, will converge to a different local minima, so the model will behave differently.

In practice, the final total error of different local minima is approximate, but in reality, different models with different local minima can produce different errors when predicted. The differences between the models are used in Ensemble(polling, averaging) and tend to improve the results of the final predictions, so multi-model Ensemble are widely used in various races.

5. Ensemble and neural networks

Due to the time-consuming of neural network training, multi-model Ensemble is less widely used in deep learning field than traditional machine learning method. Because each base model used for Ensemble is trained individually, it is often time-consuming to train a single model, so the method cost of the model performance is quite high.

This paper proposes a method that does not require additional training consumption, obtains several models through a single training, and EnsembleThese models to get the final model.

Principle 1. Summarized

First, when using the SGD method to train the neural network, the SGD method can converge and escape The local minima, and in a training process, the Model \ (m\) times converge to different Local minima , each convergence, represent this final model, and we will save the model at this time. Then use a large learning rate to escape the local minima at this time.

In this paper, the control of the learning rate uses a cosine function, which behaves as:

    • Dramatically increase learning rates
    • During a training session, the learning rate is rapidly declining.

This kind of training is like a few snapshot Snapshotin the optimal route, so it's named Snapshot ensembling. The right half of this is the image representation of this method.

2. Implicit and explicit ensemble of neural networks

Various dropout techniques are an implicit Ensemble technique that randomly hides some of the nodes in the hidden layer during training, and the nodes that are hidden during each training session are not the same, and all nodes are used during training.

Thus, in the course of using dropout technology training, countless models of shared weights are created by randomly removing the nodes of hidden layers. These models are implicitly ensemble when predicted.

The Snapshot Ensemble presented in this paper is an explicit combination of multiple models of non-shared weights to achieve an improved effect.

3. Details

In general, Snapshot Ensemble is in a training (optimization) process, before the final convergence, access to a number of local minima, in each local minimum save 快照 as a model, in the use of prediction using all the saved models to predict, The final result is taken as the average.

While these model SavePoint (snapshot points) are not randomly selected, we want to:

    • Have as little error as possible
    • Try not to repeat the sample of each model by mistake, to ensure the difference of the model

This requires some special operations in the optimization process.

Observing the optimal path of the standard, usually, the error of the development set will only drop sharply after the learning rate has been lowered, which will often occur after a number of epochs in accordance with the normal learning rate reduction strategy.

However, it is possible to reduce the learning rate and continue training very early, and the final error does not have a big impact, but it greatly improves the efficiency of training, so that the model can achieve local minima after the few iterations of the epoch .

Therefore, the paper uses the Cyclic cosine annealing method, the study rate has been lowered very early, so that training to reach the first local minimum, to get the first model. It then increases the learning rate, disrupts the model, makes the model out of local minima, and repeats the steps several times until a specified number of models are obtained.

And the learning rate changes, the paper uses the following functions:

\[\alpha (t) =f (\mod (t-1, \lceil t/m \rceil)) \]

where \ (t\) is the number of iterations, which refers to the number of batch rounds; \ (t\) is the total batch quantity; \ (f\) is a monotonically decreasing function ; \ (m\) is the number of loops, which is the number of final models. In other words, we divide the entire training process into a (m\) cycle, using a larger learning rate at the beginning of each cycle, and then annealing to a small learning rate. \ (\alpha=f (0) \) gives the model enough energy to get out of the local minimum, while the smaller learning rate (\alpha=f (\lceil t/m \rceil) \) can make the model converge to a better local minimum.

The following shifted cosine functionis used in this paper:

\[\alpha (t) =\frac{\alpha_0}{2} (\cos (\frac{\pi\mod (T-1,\lceil t/m \rceil)}{\lceil t/m \rceil}) +1) \]

\ (\alpha_0\) is the initial learning rate, while \ (\alpha=f (\lceil t/m \rceil) \approx0\) This ensures that the minimum learning rate is small enough. Each batch acts as a loop (instead of each epoch). The following is the performance of the entire learning process.

Paper Reading SNAPSHOT Ensembles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.