This article is heavily referenced by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, Learning representation by back-propagating errors, Nature, 323 (9): p533-536, 1986.
In modern neural networks, the most used algorithms are backward propagation (BP). Although BP has a slow convergence, easy to fall into the local minimum and other defects, but its ease of use, accuracy is unmatched by other algorithms.
in this article, $w _{ji}$ is the weight of the previous layer of $unit_{i}$ and the next layer of $unit_{j}$.
In MLP, for the latter layer of neuron $unit_{j}$, its input $x_{j}$ is computed as follows (ignoring bias):
$x _{j} = \sum_{i} y_{i} w_{ji}$
You can see that its input equals the weighted sum of the output $y_{i}$ and corresponding connections of all neurons in the previous layer.
The output of the $unit_{j}$ is calculated as follows:
$y _{j} = \frac{1}{1+e^{-x_{j}}}$.
For supervised training, there are expected output $d$ and actual output $y$, which can be defined as:
$E =\frac{1}{2}\sum_{c} \sum_{j} (Y_{j,c}-d_{j,c}) ^2$
In order to find out $\partial e/\partial w_{ji}$, we first seek $\partial e/\partial y_{j}$ (then we know why):
$\partial e/\partial Y_{j} = y_{j}-d_{j}$
By the chain rule:
$\partial e/\partial X_{j} = \partial e/\partial y_{j} * d y_{j}/d x_{j}$, and the above input and output relations
$d y_{j}/d X_{j} = (\frac{1}{1+e^{-x_{j}}) ' = \frac{e^{-x_{j}}}{(1+e^{-x_{j}}) ^{2}} = y_{j} * (1-y_{j}) $ available:
$\partial e/\partial X_{j} = \partial e/\partial y_{j} * y_{j} * (1-y_{j}) $
At this point, we get the $layer_{j}$ error $e$ for the input $x_{j}$, but the network training is the weight (bias), so we must know $e$ for the $w_{ji}$ of the partial derivative expression.
Also by the chain rule:
$\partial e/\partial W_{ji} = \partial e/\partial x_{j} * \partial x_{j}/\partial w_{ji}$, and the relationship of input and weight of this layer:
$x _{j} = \sum_{i} y_{i} w_{ji}$, available $\partial x_{j}/\partial W_{ji} = y_{i}$, i.e.:
$\partial e/\partial W_{ji} = \partial e/\partial x_{j} * y_{i}$, tidy up,
$\partial e/\partial W_{ji} = (Y_{j}-d_{j}) * Y_{j} * (1-y_{j}) * y_{i}$
Where $y_{i}$ is the output of the $unit_{i}$ (after a nonlinear transformation), the output of the $y _{j}$ is $layer_{j}$.
Also in accordance with the chain rule, for the first neuron, we can obtain the gradient of the error to its output:
$\partial e/\partial Y_{i} = \partial e/\partial x_{j} * \partial x_{j}/\partial y_{i} = \partial E/\partial x_{j} * W_{ji }$, taking into account the first
So far, using the above formula, as long as the known expected output $d_{j}$ and each layer of output $y_{i}$, we can roll out the error relative to each layer of the weight of the gradient, so the weight of the adjustment, adjust the formula as follows:
$\delta w =-\epsilon \partial e/\partial w$
[NN] Some understandings of backpropagation (BP, error reverse propagation)