Dl4nlp--neural network (a) BP inverse propagation algorithm for feedforward neural networks steps to organize

Source: Internet
Author: User
Tags scalar

Here is the [1] derivation of the BP algorithm (backpropagation) steps to tidy up, memo Use. [1] the direct use of the matrix differential notation is deduced, the whole process is very concise. And there is a very big advantage of this matrix form is that it is very convenient to implement the programming Control.

But its practical scalar calculation deduction also has certain advantages, for example, can clearly know that a weight is affected by who.

Marking Conventions:

$L $: The number of layers of the neural network. The input layer is not counted.

$n ^l$: The number of neurons in the first $l $ layer. Biased neurons are not counted.

$W ^{l}\in\mathbb r^{n^l\times n^{l-1}}$: The weight matrix of the first $l -1$ layer to the $l $ layer. wherein, $W _{ij}^{(l)}$ represents $l the -1$ layer $j $ neuron to the third $l $ layer $i The weight of the Neuron.

$\TEXTBF B^{(L)}\IN\MATHBB r^{n^l}$: The offset of the first $l $ layer.

$\TEXTBF Z^{(L)}\IN\MATHBB r^{n^l}$: The input of each neuron in the first $l $ layer.

$f _l (\cdot) $: The activation function for the $l $ layer. For categorical tasks, The last layer is softmax.

$\TEXTBF A^{(L)}\IN\MATHBB r^{n^l}$: The output of each neuron in the first $l $ layer, i.e. the active Value. For the input layer (No. 0 level), $\TEXTBF A^{(0)}=\TEXTBF x$.

The corresponding illustrations are Given.

Image Source: [1]

You can begin to deduce it below. The first is forward propagation, which computes the network output:

$$\TEXTBF Z^{(L)}=W^{(L)}\TEXTBF A^{(L-1)}+\TEXTBF b^{(l)}$$

$$\TEXTBF a^{(l)}=f_l (\TEXTBF Z^{(l-1)}) $$

$$\TEXTBF a^{(L)}=f (\TEXTBF x; W,\TEXTBF B) $$

The whole network can actually be seen as a function $f $, in 1989 it has been proved that a single hidden layer network can approximate arbitrary functions.

Since the completion of the mapping from input to output, then the network can be regarded as a classifier, but also as a feature extractor, and then the output of the network as a classifier input, equivalent to the original feature $\TEXTBF x$ mapping into a new feature $\TEXTBF a^{(L)}$.

If the network as a classifier, then the number of neurons in the last layer $n ^l$ should be equal to the number of categories $C $, and $f _l (\cdot) =\text{softmax} (\cdot) $, normalized to a probability distribution:

$$\HAT{\TEXTBF Y}=\TEXTBF a^{(L)}=\text{softmax} (\TEXTBF Z^{(L-1)}) $$

Given the sample $ (\TEXTBF x,\textbf y) $, where $\TEXTBF y\in\mathbb r^c$ is the one-hot vector, then use the cross-entropy loss function to get a loss of the network to a sample

$$\mathcal L (\TEXTBF y,f (\TEXTBF x; W,\TEXTBF b)) =-\TEXTBF y^{\top}\log\hat{\textbf y}=-\textbf y^{\top}\ln\text{softmax} (\textbf z) =-\TEXTBF y^{\top}\ Ln\frac{\exp (\TEXTBF Z)}{\TEXTBF 1^{\top}\exp (\TEXTBF z)}$$

The number of training samples is $N $, then the empirical risk is

$ $R =\frac1n\sum_{i=1}^n\mathcal L (\TEXTBF y_i,f (\TEXTBF x_i; W,\TEXTBF b)) $$

If you use the Frobenius norm of the weight matrix as a regular term

$$|| w| | _F=\BIGGL (\sum_{l=1}^l\sum_{i=1}^{n^l}\sum_{j=1}^{n^{l-1}} (w_{ij}^{(l)}) ^2\biggr) ^{\frac12}$$

Then the structural risk is

$ $R =\frac1n\sum_{i=1}^n\mathcal L (\TEXTBF y_i,f (\TEXTBF x_i; W,\TEXTBF b)) +\frac12\lambda| | w| | _f^2$$

In general, the bias is not regular.

The structural risk minimization is now considered and the gradient descent method is used to update the weight matrix and Bias. Simply require a model to take the loss of a sample gradient $\frac{\partial \mathcal L (\textbf y,f (\TEXTBF x; W,\TEXTBF b))}{\partial w^{(l)}}$ (because A gradient to multiple samples is a summation of the gradient of a sample):

$ $W ^{(l)}=w^{(l)}-\alpha \frac{\partial r}{\partial w^{(l)}}$$

$$\frac{\partial r}{\partial w^{(l)}}=\frac1n\sum_{i=1}^n\frac{\partial \mathcal l (\TEXTBF y,f (\TEXTBF x; W,\TEXTBF b))}{\partial W^{(l)}}+\lambda w^{(l)}$$

The bias is not Written.

four Important formulas

well, the ink for a long while, the focus has finally come. The following is to solve $\dfrac{\partial\mathcal L (\TEXTBF y,f (\TEXTBF x; W,\TEXTBF b))}{\partial w^{(l)}}$, $\dfrac{\partial\mathcal l (\TEXTBF y,f (\TEXTBF x; W,\TEXTBF b))}{\partial \TEXTBF b^{(l)}}$. For the sake of simplicity, the $\dfrac{\partial \mathcal l}{\partial w^{(l)}}$, $\dfrac{\partial \mathcal l}{\partial \TEXTBF b^{(l)}}$ are directly recorded.

All of my symbols are in use [1], and the subsequent derivation process is Also.

First put some formulas to Review:

$$\frac{\partial A^{\TOP}\TEXTBF X}{\PARTIAL\TEXTBF x}=\frac{\partial \TEXTBF X^{\TOP}A}{\PARTIAL\TEXTBF x}=A$$

$$\frac{\partial \TEXTBF Y^{\TOP}\TEXTBF z}{\partial\textbf x}=\frac{\partial \textbf Y}{\PARTIAL\TEXTBF X}\TEXTBF z+\ Frac{\partial \TEXTBF Z}{\PARTIAL\TEXTBF X}\TEXTBF y$$

$$\frac{\partial \TEXTBF Y^{\TOP}A\TEXTBF z}{\partial\textbf x}=\frac{\partial \TEXTBF Y}{\PARTIAL\TEXTBF X}A\TEXTBF Z +\frac{\partial \TEXTBF Z}{\PARTIAL\TEXTBF X}A^{\TOP}\TEXTBF y$$

$$\frac{\partial Y\TEXTBF Z}{\PARTIAL\TEXTBF x}=\frac{\partial Y}{\PARTIAL\TEXTBF X}\TEXTBF z^{\top}+y\frac{\partial \TEXTBF Z}{\PARTIAL\TEXTBF x}$$

$$\frac{\partial \text{tr}ab}{\partial a}=b^{\top}\quad\quad\frac{\partial \text{tr}ab}{\partial A^{\top}}=B$$

$$\frac{\partial F (a)}{\partial a^{\top}}= (\frac{\partial F (a)}{\partial A}) ^{\top}$$

Then there is the chain rule (chain rule):

$$\frac{\partial \TEXTBF z}{\partial \textbf x}=\frac{\partial \textbf y}{\partial \textbf x}\frac{\partial \TEXTBF z}{\ Partial \TEXTBF y}$$

$$\frac{\partial z}{\partial x_{ij}}= (\frac{\partial z}{\partial\textbf y}) ^{\TOP}\FRAC{\PARTIAL\TEXTBF y}{\partial x_{ij}}$$

$$\frac{\partial z}{\partial x_{ij}}=\text{tr}\biggl ((\frac{\partial z}{\partial Y}) ^{\top}\frac{\partial Y}{\ Partial X_{ij}}\biggr) $$

and an element-wise function $f $ (its derivative is $f ' $) and its gradient:

$$\frac{\partial F (\TEXTBF x)}{\partial\textbf x}=\text{diag} (f ' (\TEXTBF x)) $$

It needs to be declared here that in the above system, the denominator layout is used, that is to say: the dimension is $p $ of the column vector to the dimension is $q $ of the column vector derivation of the matrix dimension is $q \times p$.

Refer to [3] for the derivation of the Matrix.

and ink for half a day, the following began to beg $\dfrac{\partial \mathcal l}{\partial w^{(l)}}$, $\dfrac{\partial \mathcal l}{\partial \TEXTBF b^{(l)}}$.

According to the chain rules,

$$\frac{\partial\mathcal l}{\partial w_{ij}^{(l)}}= (\frac{\partial\mathcal l}{\partial \TEXTBF Z^{(l)}) ^{\top}\ Frac{\partial \TEXTBF z^{(l)}}{\partial w_{ij}^{(l)}}$$

Now define the error term $\delta^{(l)}=\dfrac{\partial\mathcal l}{\partial \TEXTBF z^{(l)}}\in\mathbb r^{n^l}$ to characterize the $l $ The degree of sensitivity of the neurons in the layer to the error . Now ask $\frac{\partial \TEXTBF z^{(l)}}{\partial w_{ij}^{(l)}}$:

$$\frac{\partial \TEXTBF z^{(l)}}{\partial w_{ij}^{(l)}}=\frac{\partial W^{(L)}\TEXTBF a^{(l-1)}}{\partial W_{ij}^{( l)}}= (0,..., a_j^{(l-1)},..., 0) ^{\top}$$

The element with only the first $i $ line is Nonzero. Since then,

$$\frac{\partial\mathcal l}{\partial w_{ij}^{(l)}}=\delta_i^{(l)}a_j^{(l-1)}$$

As can be seen from this equation, if the output value of the neuron $a _j^{(l-1)}$ is small, then it will be slow to update the $W _{ij}^{(l)}$ with a downward layer of weight.

The form of a matrix is the following equation:

$$\frac{\partial\mathcal l}{\partial w^{(l)}}=\delta^{(l)} (\TEXTBF A^{(l-1)}) ^{\top}$$

$$\frac{\partial\mathcal l}{\partial \TEXTBF b^{(l)}}=\delta^{(l)}$$

The following question is how to calculate the error term $\delta^{(l)}$. For the calculation of error terms, two situations need to be considered, one is the output layer and the other is the hidden layer.

For the output layer,

$$\begin{aligned}\delta^{(l)}&=\frac{\partial \mathcal l}{\partial \TEXTBF z^{(l)}}\\&=\frac{\partial \ TEXTBF a^{(l)}}{\partial \TEXTBF z^{(l)}}\frac{\partial \mathcal l}{\partial \textbf a^{(l)}}\\&=\text{diag} (f_L ' (\TEXTBF z^{(L)})) \frac{\partial \mathcal l}{\partial \textbf a^{(l)}}\\&=f_l ' (\TEXTBF z^{(l)}) \odot \frac{\partial \mathcal L}{\ Partial \TEXTBF a^{(L)}}\end{aligned}$$

For the hidden layer,

$$\begin{aligned}\delta^{(l)}&=\frac{\partial \mathcal l}{\partial \TEXTBF z^{(l)}}\\&=\frac{\partial \ TEXTBF a^{(l)}}{\partial \TEXTBF z^{(l)}}\frac{\partial\textbf z^{(l+1)}}{\partial \TEXTBF a^{(l)}}\frac{\partial \ Mathcal l}{\partial \textbf z^{(l+1)}}\\&=\text{diag} (f_l ' (\TEXTBF z^{(L)}) (w^{(l+1)}) ^{\top}\delta^{(l+1)}\\ &=f_l ' (\TEXTBF z^{(l)}) \odot \bigl ((w^{(l+1)}) ^{\top}\delta^{(l+1)}\bigr) \end{aligned}$$

As you can see from this equation, the error term can be passed in the direction of the input, which is called error reverse Propagation.

In other words, there is a derivative function for the activation Function. The following two common relationships are available:

$$\sigma ' (\cdot) =\sigma (\cdot) \odot (1-\sigma (\cdot)) $$

$$\text{softmax} ' (\cdot) =\text{softmax} (\cdot) \odot (1-\text{softmax} (\cdot)) $$

Where $\sigma (\cdot) $ represents the logistic Function.

So the BP algorithm boils down to the fact that only four of the equations are more important:

$$\delta^{(l)}=f_l ' (\TEXTBF z^{(l)}) \odot \frac{\partial \mathcal l}{\partial \TEXTBF a^{(l)}}$$

$$\delta^{(l)}=f_l ' (\TEXTBF z^{(l)}) \odot \bigl ((w^{(l+1)}) ^{\top}\delta^{(l+1)}\bigr) $$

$$\frac{\partial \mathcal l}{\partial w^{(l)}}=\delta^{(l)} (\TEXTBF A^{(l-1)}) ^{\top}$$

$$\frac{\partial \mathcal l}{\partial \TEXTBF b^{(l)}}=\delta^{(l)}$$

I have also been deduced in scalar form, the process is as follows ...

Whether scalar form or matrix form, the key is a point--chain rule.

Practical Examples

The above four formulas only formally give the gradient form of the inverse propagation algorithm. Below: for classification problems, that is $f _l (\cdot) =\text{softmax} (\cdot) $, and using the Cross-entropy loss function $\mathcal L (\TEXTBF y,f (\TEXTBF x; W,\TEXTBF b)) =-\TEXTBF y^{\top}\log\hat{\textbf y}$, $\delta^{(l)}=\dfrac{\partial \mathcal L}{\partial \TEXTBF z^{(l What is the form of}}$.

In fact, the question is clear: this is the same as the Softmax regression model, where the $\delta^{(L)}$ is an intermediate result in the gradient process of the Softmax regression model. The following is a complete solution process based on the method given in article [2]. For the sake of brevity, all $\TEXTBF z^{(L)}$ are recorded as $\TEXTBF z$:

The article [2] describes the derivation of such a form: the known matrix $X $, the function $f (X) $ of the function value is a scalar, seeking $\dfrac{\partial f}{\partial x}$. A typical example is to find the derivative of the loss to the weight matrix.

For unary calculus, $\text{d}f=f ' (x) \text{d}x$; multivariate calculus, $\text{d}f=\sum_i\dfrac{\partial f}{\partial x_i}\text{d}x_i= (\dfrac{\ Partial f}{\partial \TEXTBF X}) ^{\TOP}\TEXT{D}\TEXTBF x$; Thus establishing the relation of matrix derivative and differential:

$$\text{d}f=\sum_{i,j}\frac{\partial f}{\partial x_{ij}}\text{d}x_{ij}=\text{tr} ((\frac{\partial f}{\partial X}) ^{ \top}\text{d}x) $$

The second equals sign is formed because there are $\text{tr} (a^{\top}b) =\sum_{i,j}a_{ij}b_{ij}$ for two of the Same-order Matrices. The solution is to find the differential $\text{d}f$ expression first, then set the trace (because the scalar trace equals the scalar itself), and then the expression $\text{tr} (\text{d}f) $ and $\text{tr} (\dfrac{\partial f}{\ Partial X}) ^{\top}\text{d}x) $ to compare, and then $\dfrac{\partial f}{\partial x}$ to "dig" Out.

As a result, the problem is transformed from a gradient to a differential. There are a number of rules and tricks to be asked for differentiation, along with Introductions.

first, $\text{d} \mathcal l$ is Obtained.

$$\begin{aligned} \mathcal L&=-\TEXTBF y^{\top}\ln\frac{\exp (\textbf z)}{\textbf 1^{\top}\exp (\TEXTBF z)}\\& =-\TEXTBF y^{\top} (\TEXTBF z-\ln\begin{pmatrix}\textbf 1^{\top}\exp (\textbf z) \ \textbf 1^{\top}\exp (\textbf z) \ \ \ Vdots \ \TEXTBF 1^{\top}\exp (\textbf z) \end{pmatrix}) \quad \textbf 1^{\top}\exp (\TEXTBF z) \text{is a scalar}\\&=\ln (\ TEXTBF 1^{\top}\exp (\TEXTBF Z))-\TEXTBF Y^{\TOP}\TEXTBF z\end{aligned}$$

According to the law $\text{d} (g (x)) =g ' (x) \odot\text{d}x$, $\text{d} (XY) = (\text{d}x) y+x (\text{d}y) $, available

$$\text{d} (\ln (\TEXTBF 1^{\top}\exp (\textbf z))) =\frac{1}{\textbf 1^{\top}\exp (\textbf z)}\odot\text{d} (\TEXTBF 1^{ \top}\exp (\TEXTBF z)) $$

$$\text{d} (\TEXTBF 1^{\top}\exp (\textbf z)) =\textbf 1^{\top}\text{d} (\exp (\textbf z)) =\textbf 1^{\top}\odot (\exp (\ TEXTBF z) \TEXT{D}\TEXTBF z) $$

So

$$\text{d} \mathcal L=\FRAC{\TEXTBF 1^{\top}\odot (\exp (\textbf z) \text{d}\textbf z)}{\textbf 1^{\top}\exp (\TEXTBF z)} -\TEXTBF Y^{\TOP}\TEXT{D}\TEXTBF z$$

It is now possible to set traces, according to identities $\text{tr} (a^{\top} (b\odot C)) =\text{tr} ((a\odot B) ^{\top}c) =\sum_{i,j}a_{ij}b_{ij}c_{ij}$, which can be

$$\begin{aligned}\text{d} \mathcal l&=\text{tr} (\frac{(\textbf 1\odot \exp (\textbf z)) ^{\TOP}\TEXT{D}\TEXTBF z}{ \TEXTBF 1^{\top}\exp (\TEXTBF z)})-\text{tr} (\textbf y^{\top}\text{d}\textbf z) \\&=\text{tr} (\biggl (\frac{(\exp (\TEXTBF Z)) ^{\TOP}}{\TEXTBF 1^{\top}\exp (\TEXTBF z)}-\textbf y^{\top}\biggr) \text{d}\textbf z) \\&=\text{tr} ((\HAT{\TEXTBF Y}-\TEXTBF y) ^{\TOP}\TEXT{D}\TEXTBF z) \\&=\text{tr} ((\frac{\partial\mathcal L}{\PARTIAL\TEXTBF Z}) ^{\top}\text {D}\TEXTBF Z) \end{aligned}$$

This gives the form of the $\delta^{(L)}$:

$$\delta^{(l)}=\frac{\partial \mathcal l}{\partial \TEXTBF Z^{(L)}}=\HAT{\TEXTBF Y}-\TEXTBF y$$

It is not difficult to see why the $\delta^{(l)}$ is called the error term.

Resources:

[1] The Lecture Notes on neural networks and deep learning

[2] "matrix derivation" (upper)

[3] Matrix_calculus

Dl4nlp--neural network (a) BP inverse propagation algorithm for feedforward neural networks steps to organize

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.