1503.02531-distilling the knowledge in a neural network.md

Source: Internet
Author: User

The original cross-entropy also has a tempature, this tempature like the following definition:


where T is tempature, generally this t value is 1, if improved:

In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2))Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])In [7]: mx.nd.softmax(mx.nd.array([1,2,3,4]))Out[7]: [0.0320586 0.08714432 0.23688284 0.6439143 ]<NDArray 4 @cpu(0)>

That is

Using a higher value for T produces a softer probability distribution over classes.

With a higher tempature system, its entropy will be higher, that is, the chaos is higher, the direction is not consistent, and this inconsistency, in fact, is a kind of information,
Information that can describe more structures in the data. The large model is forced regularization, which makes the last output information entropy lower. So

Our more general solution, called "distillation", are to raise the temperature of the final softmax until the cumbersome mo Del produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model are actually a special case of distillation.

That is, when you train a big model, you force high tempature? But it feels like it's going to be more serious.

When training a large model, normal training. Its logits used when using high T, small model when training, also use high T, but when validating, use T1.

In the simplest form of distillation, knowledge was transferred to the distilled model by training it on a transfer set and Using a soft target distribution for each case in the transfer set, is produced by using the cumbersome model with a High temperature in its softmax. The same high temperature was used when training the distilled model, but after it had been trained it uses a temperature o F 1.

You can use the label of both the Softlabel and the dataset to do the training, but when Softlabel uses a different T, you need to multiply the loss of the Softlabel accordingly by $t^2$

The advantage of using Softtarget is that softtarget carries more information, so it can be trained with less data.

Models distilled from multiple large models may have better performance than multiple model combinations.

How are multiple models distilled? With the output of multiple models, the target of the final distillation model, multiple target loss are added. It is a multi-tasking learning.

Confusion matrix This thing can be used to explore which classifications are most easily mistaken for a model.

Looking wrong, it seems that the paper ended up in a discussion of training multiple speciallist models, but did not talk about how to combine these models back into a large model. This may be a problem.

1503.02531-distilling the knowledge in a neural network.md

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.