1804.03235-large Scale distributed neural network training through online distillation.md

Source: Internet
Author: User

The model of existing distributed model training
    • Distributed SGD
      • Parallel SGD: The longest time in a large-scale training depends on the slowest machine
      • Asynchronous SGD: Data that is not synchronized may cause weight updates to be in the unknown direction
    • Parallel multi-model: Multiple clusters train different models and then combine the final model, but consume the inference runtime
    • Distillation: Complex process
      • Selection of student training data set
        • Unlabeled's data
        • Raw data
        • The data left out
Collaborative distillation
    1. Using the same architecture for all the models;
    2. Using the same dataset to train all the models; and
    3. Using the distillation loss during training before any model have fully converged.

Characteristics
-even if the Thacher and student are the exact same model settings, they will be able to get an effective boost if their content is different enough
-that is, the model is not convergent, the benefits are also some
-It's also good to lose the distinction between teacher and student and train each other.
-A model that is not synchronous is also possible.

The algorithm is easy to understand, and the steps don't look very complicated.

Explanation for using the out of the state model weights:

    1. Every weights leads to a change in gradients, but as training progresses towards convergence, weight updates sho Uld substantially change only the predictions on a small subset of the training data;
    2. Weights (and gradients) is not statistically identifiable as different copies of the weights might has arbitrary scaling Differences, permuted hidden units, or otherwise rotated or transformed hidden layer feature space so that averaging grad Ients does not make sense unless models is extremely similar;
    3. Sufficiently out-of-sync copies of the weights would has completely arbitrary differences that change the meaning of Indiv Idual directions in feature space that is not distinguishable by measuring the loss on the training set;
    4. In contrast, output units has a clear and consistent meaning enforced by the loss function and the training data.

So here seems to say, the benefits of randomness?

A practical and instructive framework design:

    1. Each worker trains a independent version of the the model on a locally available subset of the training data.
    2. Occasionally, workers checkpoint their parameters.
    3. Once This happens, other workers can load the freshest available checkpoints into memory and perform codistillation.
      Plus, you can use distributed SGD on smaller clusters.

In addition, it is mentioned in this paper that, compared with each direct sending of gradients and weights, it is only necessary to occasionally load checkpoint, and each model cluster is completely independent of each other in operation. This does reduce some of the problems.
But what if a model breaks down and does not converge at all?

Also, it's not easy to see how simple this framework is, and managing models and checkpoint is not a simple matter.

Experimental conclusion

20TB of data, rich and capricious

In the paper, it is not the more machines, the better the final model, it seems that 32-128 is more appropriate, more, the model convergence speed and performance will not be better, and sometimes there will be a decline.

The experimental results in the paper 2a, the best or double model parallel, followed by collaborative distillation, the worst is unigram smooth0.9,label smooth 0.99 with the direct training performance is similar, after all, just a random noise.
In addition, by comparing the co-distillation 2b of the same data with the collaborative sorting of random data, the experiment finds that the random data actually makes the model better performance.
3 experiments on imagenet showed similar results as 2a.
Although it is not necessary to use the latest models in the 4, co-distillation can significantly reduce training efficiency by using checkpoint.

Models that are under-fitted are useful, but overfitting models may be less valuable in distillation.
Co-distillation is quicker to converge and more efficient than two-step distillation.

3.5, it is also a lot of time to face the problem, because the initialization, training process parameters, and so on, may lead to two training out of the output of the model is very different. For example, the classification model, perhaps the last training in some categories accurate, and this training, in these categories are not accurate. The model average or distillation method can effectively avoid this problem.

Summarize

Balabalabala
Experiments have only tried two models, and many topologies of multiple models are worth trying.

A paper that is worth reading.

1804.03235-large Scale distributed neural network training through online distillation.md

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.