International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Web Develop

1503.02531-distilling the knowledge in a neural network.md

Last Update:2018-07-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The original cross-entropy also has a tempature, this tempature like the following definition:

$$
Q_I=\FRAC{E^{Z_I/T}}{\SUM_J{E^{Z_J/T}}}
$$

where T is tempature, generally this t value is 1, if improved:

In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2))Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])In [7]: mx.nd.softmax(mx.nd.array([1,2,3,4]))Out[7]: [0.0320586 0.08714432 0.23688284 0.6439143 ]<NDArray 4 @cpu(0)>

That is

Using a higher value for T produces a softer probability distribution over classes.

With a higher tempature system, its entropy will be higher, that is, the chaos is higher, the direction is not consistent, and this inconsistency, in fact, is a kind of information,
Information that can describe more structures in the data. The large model is forced regularization, which makes the last output information entropy lower. So

Our more general solution, called "distillation", are to raise the temperature of the final softmax until the cumbersome mo Del produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model are actually a special case of distillation.

That is, when you train a big model, you force high tempature? But it feels like it's going to be more serious.

When training a large model, normal training. Its logits used when using high T, small model when training, also use high T, but when validating, use T1.

In the simplest form of distillation, knowledge was transferred to the distilled model by training it on a transfer set and Using a soft target distribution for each case in the transfer set, is produced by using the cumbersome model with a High temperature in its softmax. The same high temperature was used when training the distilled model, but after it had been trained it uses a temperature o F 1.

You can use the label of both the Softlabel and the dataset to do the training, but when Softlabel uses a different T, you need to multiply the loss of the Softlabel accordingly by $t^2$

The advantage of using Softtarget is that softtarget carries more information, so it can be trained with less data.

Models distilled from multiple large models may have better performance than multiple model combinations.

How are multiple models distilled? With the output of multiple models, the target of the final distillation model, multiple target loss are added. It is a multi-tasking learning.

Confusion matrix This thing can be used to explore which classifications are most easily mistaken for a model.

Looking wrong, it seems that the paper ended up in a discussion of training multiple speciallist models, but did not talk about how to combine these models back into a large model. This may be a problem.

1503.02531-distilling the knowledge in a neural network.md

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

1503.02531-distilling the knowledge in a neural network.md

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support