The original cross-entropy also has a tempature, this tempature like the following definition:
$$
Q_I=\FRAC{E^{Z_I/T}}{\SUM_J{E^{Z_J/T}}}
$$
where T is tempature, generally this t value is 1, if improved:
In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2))Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])In [7]: mx.nd.softmax(mx.nd.array([1,2,3,4]))Out[7]: [0.0320586 0.08714432 0.23688284 0.6439143 ]<NDArray 4 @cpu(0)>
That is
Using a higher value for T produces a softer probability distribution over classes.
With a higher tempature system, its entropy will be higher, that is, the chaos is higher, the direction is not consistent, and this inconsistency, in fact, is a kind of information,
Information that can describe more structures in the data. The large model is forced regularization, which makes the last output information entropy lower. So
Our more general solution, called "distillation", are to raise the temperature of the final softmax until the cumbersome mo Del produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model are actually a special case of distillation.
That is, when you train a big model, you force high tempature? But it feels like it's going to be more serious.
When training a large model, normal training. Its logits used when using high T, small model when training, also use high T, but when validating, use T1.
In the simplest form of distillation, knowledge was transferred to the distilled model by training it on a transfer set and Using a soft target distribution for each case in the transfer set, is produced by using the cumbersome model with a High temperature in its softmax. The same high temperature was used when training the distilled model, but after it had been trained it uses a temperature o F 1.
You can use the label of both the Softlabel and the dataset to do the training, but when Softlabel uses a different T, you need to multiply the loss of the Softlabel accordingly by $t^2$
The advantage of using Softtarget is that softtarget carries more information, so it can be trained with less data.
Models distilled from multiple large models may have better performance than multiple model combinations.
How are multiple models distilled? With the output of multiple models, the target of the final distillation model, multiple target loss are added. It is a multi-tasking learning.
Confusion matrix This thing can be used to explore which classifications are most easily mistaken for a model.
Looking wrong, it seems that the paper ended up in a discussion of training multiple speciallist models, but did not talk about how to combine these models back into a large model. This may be a problem.
1503.02531-distilling the knowledge in a neural network.md