?
The change of weights \ (w^{(l)}_{ij}\) in the neural network will affect the next layer, reaching the output layer, and finally affecting the cost function.
?
\ (\color{red}{formula derivation symbol description}\)
symbols |
Description |
\ (n_l\) |
Number of network layers |
\ (y_j\) |
Output Layer section \ (j\) class label |
\ (s_l\) |
Number of neurons ( l\) layer (excluding bias) |
\ (g (x) \) |
Excitation function |
\ (w^{(l)}_{ij}\) |
The link parameter between the section \ ( j\) and the section \ ( i\) of the ( l+1\ ) layer of the first \ ( l\) layer |
\ (b^{(l)}_i\) |
Section \ (l+1\) layer (i\) Unit offset |
\ (z^{(l)}_i\) |
The input weighting sum (including bias) for section \ ( i\) units of the \ (l\) layer |
\ (a^{(l)}_i\) |
The activation value (output value) of unit \ (i\) of section \ (l\) |
\ (\delta^{(l)}_i\) |
Local gradient (or residuals) of section \ (i\) units of section \ (l\) |
\ (\color{red}{basic formula}\)
\[\begin{align*} z^{(l)}_i &= b^{(l-1)}_i + \sum^{s_l}_{j=1}{w^{(l-1)}_{ij}a^{(l-1)}_i} \tag{1} \quad \quad \quad \q uad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ g( x) &= \frac{1}{1 + e^{-x}} \tag{2}\ a^{(l)}_i &= g (z^{(L)}_i) \tag{3} \ J (\theta) &= \frac12{{\sum^{s_l}_{j=1} \big ({y_j-a^{(L)}_j}}}\big) ^2 \tag{4}\ \delta^{(L)}_i &=\frac{\partial{j (\theta)}}{\partial{z^{(L)}_i}} \tag{5 } \ \ \delta ^{(n_{l})}_{I}&=\frac {\partial {J (\theta)}} {\partial {z^{(n_{L})}_{i}} \ \ &=\f RAC {1} {2} \frac {\partial {}} {\partial {z^{(n_{L})}_{i}}} \sum _{j=1}^{s_{n_l}} \left (y_{j}-a^{ (n_{L})} _{J} \right) ^{2}\\ &=\frac {1} {2} \frac {\partial {}} {\partial {z^{(n_{L})}_{i}}} \sum _{j=1}^ {s_{n_l}} \left (y_{J}-g (z^{(n_{L})}_{J}) \right 2 ^{}\\ {1} {2} &=\frac {\frac {}} {\partial tial {z^{(n_{l})}_{i}} \left (y_{i}-g (z^{(n_{l})}_{i}) \right 2 ^{}\\ (&=-\left i y_{(}-a^{l}) n_{i}}_{ ) G\prime (z^{(n_{l})}_{i}) \ \\delta ^{(l)}_{i}&=\frac {\partial {J (\theta)}} {\partial {z^{(l)}_{ i}}} \ \ &=\sum _{j=1}^{s_{l+1}} \frac {\partial {J (\theta)}} {\partial {z^{(l+1)}_{J}}} \frac {\ Partial {z^{(l+1)}_{J}} {\partial {z^{(L)}_{i}}} \ &=\sum _{j=1}^{s_{l+1}} \delta ^{(l)}_{i} \frac {\partial {z^{(l+1)}_{J}}} {\partial {z^{(L)}_{i}}} \ &=\sum _{j=1}^{s_{l+1}} \delta ^{(l )}_{I}\frac {\partial {}} {\partial {z^{(L)}_{i}}} \left (b^{(l)}_{i}+\sum _{k=1}^{s_{L}} w^{(L)} _{JK}a^{(l)}_{k}) \right) \ &=\sum _{j=1}^{s_{l+1}} \delta ^{(l)}_{i}\frac {\partial {}} {\partial {z^{(L)}_{i}}} \left (b^{(l)}_{i}+\sum _{k=1}^{s_{L}} w^{(l)}_{JK}g (z^{(l)}_{k}) \right) \ &=\sum _{j=1}^{s_{L +1} \delta ^{(l)}_{i}\frac {\partial {}} {\partial {z^{(L)}_{i}}} \left (w^{(L)}_{ji} g (z^{(l)}_{I }) \right) \&=\sum _{j=1}^{s_{l+1}} \delta ^{(l)}_{i}\left (w^{(l)}_{ji}g\prime (z^{(L)}_{i}) \right ) \ \ &=g\prime (z^{(L)}_{i}) \sum _{j=1}^{s_{l+1}} \delta ^{(l)}_{I}w^{(l)}_{ji}\\\frac {\partial { J (\theta)}} {\partial {w^{(l)}_{ij}}} &=\frac {\partial {J (\theta)}} {\partial {z^{(l+1)}_{i}}} \frac {\partial {z^{(l+1)}_{I}}} {\partial {w^{(l)}_{ij}}} \ &=\delta ^{(l+1)}_{i}\frac {\partial {z^{(l+1)}_{J}}} {\partial {w^{(l)}_{ij}}} \ &=\delta ^{(l+1)}_{i}\frac {\partial {}} {\partial {w^{(l)}_{ij}}} \left (b^{(l)}_{i}+\sum _{k=1}^{s_{L}} w^{(l)}_{ik}a^{(l)}_{k}) \right) \ \ &=\delta ^{(l+1)}_{i} a^{(L)}_{J}\\frac {\partial {J (\theta)}} {\partial {b^{(L)}_{i}}} &=\delta ^{(l+1)}_{i}\end{align*}\]
Derivation of the inverse propagation algorithm