No Shadow Random Thoughts
Date: January 2016.
Source: http://www.zhaokv.com/2016/01/softmax.html
Disclaimer: Copyright, reprint please contact the author and indicate the source
Softmax is one of the most common output functions in machine learning, and there is a lot of information on the web about what it is and how it is used, but there is no data to describe the rationale behind it. This paper first briefly introduces the Softmax, and then focuses on the mathematical analysis of the principle behind it.
Classification problem is one of the most important problems in supervised learning, it tries to predict the probability of corresponding label $y$ according to input $\bf{x}$. Softmax is one of the most important tools for calculating label probabilities:
${\BF P}=\rm{softmax} ({\BF a}) \leftrightarrow P_i=\frac{\exp ({a_i})}{\sum_j\exp ({A_j})}$
Where $a_i$ is the output of the model for the $i$ classification. The following is a simple proof: the $p_i$ approximation to the true probability of the $i$ classification can be achieved by the logarithmic maximum likelihood and gradient descent method. The loss function in logarithmic maximum likelihood is $L_{NLL} ({\bf p},y) =-\log p_y$, which is derivative of ${\BF a}$:
$\frac{\partial}{\partial A_K}L_{NLL} ({\bf p},y) =\frac{\partial}{\partial A_k} (-\log p_y) =\frac{\partial}{\partial A_k} (-a_y+\log\sum_j E^{a_j}) $
$=-{\BF 1}_{Y=K}+\FRAC{E^{A_K}}{\SUM_J{E^{A_J}}}=P_K-{\BF 1}_{y=k}$
i.e. $\frac{\partial}{\partial {\BF A}}L_{NLL} ({\bf p},y) = ({\BF P}-{\BF e}_y) $, where ${\BF e}_y=[0,\cdots,0,1,0,\cdots,0]$ is a vector, except for the position $y$ of 1 is all 0. The same ${\BF x}$ sample corresponds to the same ${\BF a}$, and we can see that as more and more samples participate in the gradient drop, $p _i$ will approximate the true probability of the $i$ classification, i.e. ${\BF P}=\MATHBB{E}[{\BF e}_{y}|{ \BF x}]$, because $\lim\limits_{n\to\infty}\frac{1}{n}\sum\limits_{i=1}^n ({\bf p}-{\bf e}_y^{(i)}) =0$, where $\lim\limits_{n\ TO\INFTY}\FRAC{1}{N}\SUM\LIMITS_{I=1}^N{\BF e}_y^{(i)}$ is the true probability.
From the convergence speed aspect, the logarithm maximum likelihood and the gradient descent in the Softmax body is absolutely perfect match. For a sample input of ${\BF x}$, assuming that its true classification is $i$, the $j (J\NEQ i) $ output for the model has $\frac{\partial}{\partial A_J}L_{NLL} ({\BF p}, y) =p_j$, if $p_ J\approx 0$ (that is, the model is unlikely to be classified $j$, the predictions are in line with the actual), the gradient is close to 0, will be very small correction, if the $p_j\approx 1$ (that is, the model is very confident that the prediction is categorical $j$, the prediction results contrary to the actual), the gradient is close to 1, Will make a big correction. In addition, for the $i$ output of the model there is $\frac{\partial}{\partial A_I}L_{NLL} ({\BF p}, y) =1-p_i$, if $p_i\approx 0$ (that is, the model is considered unlikely to be categorical $i$, The predicted result is opposite to the actual), the gradient is close to 1, there will be a great correction, if the $p_i\approx 1$ (that is, the model is very confident to predict the classification $i$, the prediction results in line with the actual), the gradient is close to 0, will be very small correction. In summary, the use of logarithmic maximum likelihood as a loss function on Softmax, the gradient descent is very ideal-the prediction error when the correction is large, the correct prediction is small .
Of course, others try other loss functions on the Softmax, such as the most famous least squares. The result is that the two do not match, because the model can be very small if the prediction is completely wrong at the least squares. Set ${\BF Y}={\BF e}_i$ (note here the ${\BF y}$ is blackbody), to the least squares $l_2 ({\BF A}), {\BF y}) \bf | {\BF P} ({\BF A})-{\BF y}| | ^2$ about $a_i$ (assuming $i$ is the right category) derivative
$\frac{\partial}{\partial a_i}l_2 ({\BF p} ({\BF A}), {\BF y}) =\frac{\partial{l_2 ({\BF p} ({\BF A}), {\BF y})}}{\partial {\ BF p} ({\BF a})}\frac{\partial {\BF P} ({\BF a})}{\partial a_i}$
$=\sum_{j\neq i}2 (P_J-{\BF y}_j) P_j (0-p_i) +2 (P_I-{\BF y}_i) p_i (1-p_i) $
If the predictions for the correct category $i$ model are $p_i\approx 0$ (strongly inconsistent with the actual), there is obviously $\frac{\partial}{\partial a_i}l_2 ({\BF p} ({\BF A}), {\BF y}) \approx 0$, That is to say, gradient descent has almost no correction to the model, so it is not good to see the gradient descent of Softmax with least squares.
Ps:softmax also has an important property of translation invariance, i.e. ${\rm Softmax} ({\BF a}) ={\rm Softmax} ({\BF a}+b) $, because $\frac{\exp ({a_j+b})}{\sum_k\exp ({A_ K+B})}=\frac{\exp ({A_j})}{\sum_k\exp ({a_k})}$. Due to the existence of translational invariance, the model only needs to learn the relative size of the elements in the ${\BF a}$, without having to learn the absolute size. In addition, we can effectively reduce calculation errors based on ${\rm Softmax} ({\BF a}) ={\rm Softmax} ({\BF a}-\max_ia_i) $.
In summary, first of all, Softmax indeed can represent the probability, and with the increase of the sample through the logarithmic maximum likelihood and gradient descent can be infinite approximation of the real probability value; Secondly, the combination of softmax and logarithmic maximum likelihood has a good correction speed in gradient descent; Finally, because of translational invariance, We only need to be concerned about the relative size of the output between different categories of the model, not the absolute size.
From the point of view of mathematical analysis Softmax