Machine Learning Public Course notes (5): Neural Network (neural networks)--learning

Source: Internet
Author: User

This chapter may be the most unclear part of Andrew Ng's story, why do you say so? This chapter focuses on the post-propagation (backpropagration, BP) algorithm, Ng spends more than half time talking about how to calculate the error term $\delta$, how to calculate the $\delta$ matrix, and how to use MATLAB to achieve post-propagation, but the most critical problem-- Why are you doing this? The amount of these calculations above represents what, Ng basically did not explain, and did not give examples of mathematical deduction. So this content I do not intend to follow the contents of the public class to write, after consulting a lot of information, I would like to start with a simple neural network gradient deduction, understand the basic principle of the post-propagation algorithm and the actual meaning of each symbol, and then follow the course to give the specific steps of the BP calculation, which is more helpful to understand.

The back propagation of simple neural networks (backpropagration, BP) algorithm 1. Review of previous forward propagation (Forwardpropagration, FP) algorithms

FP algorithm is still very simple, white is based on the value of the previous layer of neurons, first weighted and then take the sigmoid function to get the value of the next layer of neurons, written in the form of mathematics is:

$ $a ^{(1)}=x$$

$ $z ^{(2)}=\theta^{(1)}a^{(1)}$$

$ $a ^{(2)}=g (z^{(2)}) $$

$ $z ^{(3)}=\theta^{(2)}a^{(2)}$$

$ $a ^{(3)}=g (z^{(3)}) $$

$ $z ^{(4)}=\theta^{(3)}a^{(3)}$$

$ $a ^{(4)}=g (z^{(4)}) $$

2. Review the cost function of the neural network (excluding regularization items)

$J (\theta) =-\frac{1}{m}\left[\sum\limits_{i=1}^{m}\sum\limits_{k=1}^{k}y^{(i)}_{k}log (H_\theta (x^{(i)})) _k + (1- y^{(i)}_k) log (n (H_\theta (x^{(i)) _k) \right]$

3. BP derivation process of a simple neural network

What problems does the BP algorithm solve? We've got the cost function $j (\theta) $, next we need to use the gradient descent algorithm (or other advanced optimization algorithms) to optimize $J (\theta) $ to get training parameters $\theta$, but the key problem is that the optimization algorithm needs to pass two important parameters, One cost function $j (\theta) $, the other is the gradient of the cost function $\frac{\partial J (\theta)}{\partial \theta}$, theBP algorithm is actually solving the problem of how to calculate the gradient .

Here we start with a simple example of how to calculate the gradient of the cost function mathematically, consider the following simple neural network (for convenience, the forward propagation (FP) calculation process is given on the way), the neural network has three layers of neurons, corresponding to two weights matrix $\theta^{(1)}$ and $ \theta^{(2)}$, to calculate the gradient we only need to calculate two partial derivatives: $\frac{\partial J (\theta)}{\partial\theta^{(1)}}$ and $\frac{\partial J (\theta)}{\ partial\theta^{(2)}}$.

First, we first calculate the partial derivative of the 2nd weight matrix, namely $\frac{\partial}{\partial\theta^{(2)}}j (\theta) $. First we need to establish a connection between $J (\theta) $ and $\theta^{(2)}$, it is easy to see that the value of $j (\theta) $ depends on $h_\theta (x) $, while $h_\theta (x) =a^{(3)}$, $a ^{3}$ is also by $z ^{(3)}$ take sigmoid get, finally $z^{(3)}=\theta^{(2)}\times a^{(2)}$, so the connection between them can be indicated as follows:

According to the chain rule of derivation, we can first seek the derivative of $j (\theta) $ to $z^{(3)}$ and multiply it by $z^{(3)}$ the derivative of $\theta^{(2)}$, i.e.

$$\frac{\partial}{\partial\theta^{(2)}}j (\theta) = \frac{\partial}{\partial z^{(3)}}j (\theta) \times \frac{\partial z^{(3)}}{\partial \theta^{(2)}} $$

by $z^{(3)}=\theta^{(2)}a^{(2)}$ not difficult to calculate $\frac{\partial z^{(3)}}{\partial \theta^{(2)}}= (a^{(2)}) ^t$, so $\frac{\partial} {\partial z^{(3)}} J (\theta) =\delta^{(3)}$, the upper type can be rewritten as

$$\frac{\partial}{\partial\theta^{(2)}}j (\theta) =\delta^{(3)} (a^{(2)}) ^t$$

Next only you need to calculate $\delta^{(3)}$ can, from the contents of the previous chapter we already know $g ' (z) =g (z) (1-g (z)) $, $h _\theta (x) =a^{(3)}=g (z^{(3)}) $, ignoring the preceding $1/m\sum\ limits_{i=1}^{m}$ (Here we are only a example deduction, the last accumulation can be)

$$\begin{aligned}\delta^{(3)}&=\frac{\partial J (\theta)}{z^{(3)}}\\&= (-y) \frac{1}{g (z^{(3)})}g^{'} (z^{( 3)})-(1-y) \frac{1}{1-g (z^{(3)})} [1-g (z^{(3)})] ' \\&=-y (1-g (z^{(3)}) + (1-y) g (z^{(3)}) \\&=-y+g (z^{(3)}) \ \ &=-y+a^{(3)}\end{aligned}$$

So far we have obtained the partial derivative of $j (\theta) $ to $\theta^{(2)}$, i.e.

$$\frac{\partial J (\theta)}{\partial\theta^{(2)}}= (a^{(2)}) ^t\delta^{(3)}$$

$$\delta^{(3)}=a^{(3)}-y$$

Next we need to ask $j (\theta) $ to $\theta^{(1)}$ derivative, $J (\theta) $ to $\theta^{(1)}$ the following:

According to the chain-type derivation law, there

$$\begin{aligned}\frac{\partial J (\theta)}{\partial \theta^{(1)}} &= \frac{\partial J (\Theta)}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial \theta^{(1)}} \end{aligned}$$

We calculate the three items to the right of the equation separately:

$$ \frac{\partial J (\theta)}{\partial z^{(3)}}=\delta^{(3)}$$

$$\frac{\partial z^{(3)}}{\partial a^{(2)}}= (\theta^{(2)}) ^t$$

$$\frac{\partial a^{(2)}}{\partial \theta^{(1)}}=\frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\ Partial \theta^{(1)}}=g ' (z^{(2)}) a^{(1)}$$

To be brought in after

$$\frac{\partial J (\theta)}{\partial \theta^{(1)}}= (a^{(1)}) ^t \delta^{(3)} (\theta^{(2)}) ^t G ' (z^{(2)}) $$

Make $\delta^{(2)}=\delta^{(3)} (\theta^{(2)}) ^TG ' (z^{(2)}) $, the upper can be overridden as

$$\frac{\partial J (\theta)}{\partial \theta^{(1)}}= (a^{(1)}) ^t \delta^{(2)}$$

$$\delta^{(2)}=\delta^{(3)} (\theta^{(2)}) ^t G ' (z^{(2)}) $$

Putting the above results together, we get $j (\theta) $ for the partial derivative of the two weights matrix:

$$\delta^{(3)}=a^{(3)}-y$$

$$\frac{\partial J (\theta)}{\partial\theta^{(2)}}= (a^{(2)}) ^t\delta^{(3)}$$

$$\delta^{(2)}=\delta^{(3)} (\theta^{(2)}) ^t G ' (z^{(2)}) $$

$$\frac{\partial J (\theta)}{\partial \theta^{(1)}}= (a^{(1)}) ^t \delta^{(2)}$$

Looking at the four equations above, we found

    • The partial derivative can be multiplied by the current layer neuron vector $a^{(l)}$ and the next layer error vector $\delta^{(l+1)}$
    • The error vector of the current layer $\delta^{(L)}$ can be obtained by the product of the next layer error vector $\delta^{(l+1)}$ and the weight matrix $\delta^{l}$

Therefore, the error vectors can be calculated from the back to the next layer (this is the source of the backward propagation ), and then the partial derivative of the cost function to each layer weight matrix is obtained by the simple multiplication operation. It's finally clear why the error vectors are calculated and why there is a recursive relationship between the error vectors. Although the neural network here is very simple, the derivation process is not very rigorous, but through this simple example, basically can understand the principle of the back propagation algorithm.

Rigorous forward propagation algorithm (computational gradients)

Suppose we have a $m$ training example, $L $ layer neural network, and here we consider the regular term, i.e.

$J (\theta) =-\frac{1}{m}\left[\sum\limits_{i=1}^{m}\sum\limits_{k=1}^{k}y^{(i)}_{k}log (H_\theta (x^{(i)})) _k + (1- y^{(i)}_k) log (n (H_\theta (x^{(i)) _k) \right] + \frac{\lambda}{2m}\sum\limits_{l=1}^{l-1}\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\theta_{ji}^{(L)}) ^2$

Initialize: Sets the $\delta^{(l)}_{ij}=0$ (interpreted as the biased cumulative value of the weight matrix of the $l$ layer)

For i = 1:m

    • Set $a ^{(1)}=x$
    • A forward propagation algorithm (FP) is used to calculate the predictive value of $a^{(L)}$ for each layer, where $l=2,3,4,\ldots,l$
    • Calculate the error vector of the last layer $\delta^{(l)}=a^{(L)}-y$, using the Back Propagation algorithm (BP) to calculate the error vector from the back to front layer $\delta^{(L-1)}, \delta^{(L-1)}, \ldots, \delta^{(2)} $, calculated formula $\delta^{(L)}= (\theta^{(L)}) ^t\delta^{(l+1)}.*g ' (z^{(L)}) $
    • Update $\delta^{(L)}=\delta^{(L)}+\delta^{(l+1)} (a^{(L)}) ^t$

End//For

Calculate gradient:

$ $D _{ij}^{(L)}=\frac{1}{m}\delta^{(L)}_{ij}, j=0$$

$ $D _{ij}^{(L)}=\frac{1}{m}\delta^{(L)}_{ij}+\lambda\theta_{ij}^{(L)}, J\neq 0$$

$$\frac{\partial J (\theta)}{\partial \theta^{(L)}}=d^{(L)}$$

The skills of BP in practical use 1. Expand a parameter into a vector

For the four layer three weights matrix parameter $\theta^{(1)}, \theta^{(2)}, \theta^{(3)}$ expands it into a parameter vector, Matlab code is as follows:

Thetavec = [Theta1 (:); Theta2 (:); Theta3 (:)];
2. Gradient Check

In order to ensure the correctness of the gradient calculation, the numerical solution can be used to check, according to the definition of derivative

$$\FRAC{DJ (\theta)}{d\theta} \approx \frac{j (\theta + \epsilon)-j (\theta-\epsilon)}{2\epsilon}$$

Matlab Code is as follows

For i = 1:n    thetaplus = theta;    Thetaplus (i) = Thetaplus (i) + EPS;    Thetaminus = theta;    Thetaminus (i) = Thetaminus (i)-EPS;    Gradapprox (i) = (J (thetaplus)-J (Thetaminus))/(2 * EPS); end

Finally, check whether the Gradapprox is approximately equal to the previously computed gradient value. It is important to note that because the approximate gradient calculation is expensive, remember to close the gradient check code after the gradient check.

3. Random initialization

Initialization of the initial weight matrix should break the symmetry (symmetry breaking) and avoid using the full 0 matrix for initialization. Can be initialized with a random number, i.e. $\theta^{(l)}_{ij} \in [-\epsilon, +\epsilon]$

How to train a neural network
    1. Random initialization of weights matrix
    2. Using forward propagation algorithm (FP) to calculate model predictive value $h_\theta (x) $
    3. Computational cost function $j (\theta) $
    4. Calculate the cost function gradient $\frac{\partial J (\theta)}{\partial \theta^{(L)}}$ using the Back Propagation algorithm (BP)
    5. Gradient check using numerical algorithms (gradient checking) to ensure correct post-close gradient check
    6. Using gradient descent (or other optimization algorithm) to obtain optimal parameter $\theta$
Attached: A short back-propagation instructional video

Reference documents

[1] Andrew Ng Coursera public class fifth week

[2] Derivation of backpropagation. Http://web.cs.swarthmore.edu/~meeden/cs81/s10/BackPropDeriv.pdf

[3] wikipedia:backpropagation. Https://en.wikipedia.org/wiki/Backpropagation

[4] How the backpropagation algorithm works. Http://neuralnetworksanddeeplearning.com/chap2.html

[5] Neural Network and reverse propagation algorithm derivation. Http://www.mamicode.com/info-detail-671452.html

Machine Learning Public Course notes (5): Neural Network (neural networks)--learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.