This series of articles is the study notes of "machine learning", by Prof Andrew Ng, Stanford University. This article is the notes of week 5, neural Networks learning. This article contains some topic on cost Function and backpropagation algorithm.
Cost Function and BackPropagation
Neural networks is one of the most powerful learning algorithms, we have today. In this and in the next few sections, We ' re going to start talking about a learning algorithm for fitting the parameters O f a neural network given a training set. As with the discussion of learning algorithms, we ' re going to begin by talking about the cost function for fit Ting the parameters of the network.
1. Cost function
I ' m going to focus in the application of neural networks to classification problems. So suppose we had a network like that shown in the picture. And suppose we have a training set such as this is X (i), y (i) Pairs of M training example.
L = Total No. of layers in Network, L = 4.
SL = No. of units (not counting bias unit) in layer L,s1 = 3, s2 = 5,s4 = SL = 4
Binary ClassificationThe first is a Binary classification, where the labels y is either 0 or 1. In this case, we'll have 1 output unit, so the neural Network unit on top had 4 output units, but if we had binary clas Sification we would has only one output unit, that computes
h(
x). And the output of the neural network would be
h(
x) is going to be a real number.
y = 0 or 1
Multi-Class Classification (
KClasses
K Output Units
Cost function
Logistic regression
The cost function, we use for the neural network are going to be a generalization of the one, we use for logistic regre Ssion. For logistic regression we used to minimize the cost function J (theta) that is minus 1/m of this cost function and then plus This extra regularization term here, where this is a sum from J=1 through N, because we do not regularize the bias term
θ0.
Neural NetworkFor a neural network, we cost function was going to being a generalization of this. Where instead of having basically just one, which are the compression output unit, we may instead has K of them. So here's our cost function. Our new network now outputs vectors in RK where K might is equal to 1 if we have a binary classification problem. I ' m going to use this notation H (x) subscript I to denote the ith output. That's, h (x) is a k-dimensional vector and so this subscript I just selects out the ith element of the vector that's out Put by my neural network. My cost function J (θ) are now going to be the following.
Is-1/m of a sum of a similar term to "we have" for "logistic regression, except that we had the sum from K equals 1 t Hrough K. This summation was basically a sum over my K output. So if I had four output units, which is if the final layer of my neural network had four output units and then this is a sum From K equals one through four of basically the logistic regression algorithm's cost function but summing so cost functi On all of my four output units in turn. And finally, the second term here's the regularization term, similar to what we had for the logistic regression. This summation term looks really complicated, but All It's doing is it's summing over these terms
theta
ji l for all values of
Ji and L. Except th At we don ' t sum over the terms corresponding to these bias values like we have a for logistic progression.
2. backpropagation algorithm
In the previous sections, we talked about a cost function for the neural network. In this section, let's start to talk about a algorithm, for trying to minimize the cost function. In particular, we'll talk about the back propagation algorithm.
Gradient Computation
Here's the cost function, the we wrote down in the previous section. What does we ' d like to do are try to find parameters theta to try to minimize
J(
θ). In order to use either gradient descent or one of the advance optimization algorithms.
Need code to compute:
What we need to do therefore are to write code that takes this input the parameters theta and computes J of Theta And these partial derivative terms. Remember, that's the parameters in the neural network of these things, theta superscript l subscript ij, that ' s th e real number and so, these is the partial derivative terms we need to compute. In order to compute the cost function J of Theta, we just use this formula up here and so, what I want to does for The most of this video are focus on talking about how we can compute these partial derivative terms.
Given One training example (
x,
y )
Let's start by talking about the case of when we had only one training example, our entire training set comprises E Training example which is a pair (x, y).
And let's tap through the sequence of calculations we would do with this one training example. The first thing we do are we apply forward propagation in order to compute whether a hypotheses actually outputs given the Input.
Forward Propagation
Vectorized implementation of forward propagation and it allows us to compute the activation values Of the neurons in our neural network.
Gradient computation:back Propagation algorithm
Next, in order to compute the derivatives,we ' re going-to-use a algorithm called back propagation. The intuition of the back propagation algorithm is, for each note, we ' re going to compute the term δ SUPERSCRI PT L Subscript Jthat's going to somehow represent the error of note Jin thelayer l.
Intuition:
For each output unit (layer L = 4)
If you think of the Delta a and y as vectors then you can also take those and come up with a vectorized impl Ementation of it, which is justδ(4) gets set as a (4)
Where here, each of the these δ(4),a(4) and y, each of these are a vector whose dimension is equal T o The number of output units in our network.
What are we do next are compute the Delta terms for the earlier layers on our network. Here's a formula for computingδ(3) areδ(3) is equal to Theta 3 transpose timesδ(4). And this dot times, which is the elementy' s multiplication operation that we know from MATLAB.
backpropagation Algorithm
Training set:
3. BackPropagation Intuition
BackPropagation Maybe unfortunately is a less mathematically clean, or less mathematically simple algorithm, compared to L Inear regression or logistic regression. And I ' ve actually used backpropagation, you know, pretty successfully for many years. And even today I still don ' t sometimes feel like I had a very good sense of the just what it's doing, or intuition about what Back propagation is doing. If, for those of your that is doing the programming exercises, that would at least mechanically step you through the differ ENT steps by implement back prop. So you'll be able to get it to work for yourself. And what I want to does in this section are look a little bit more at the mechanical steps of backpropagation, and try to giv e a little more intuition on what the mechanical steps the back prop are doing to hopefully convince Now, it's at least a reasonable algorithm.
Forward PropagationIn order to better understand backpropagation, let's take another closer look at what forward propagation is doing. Here's a neural network with both input units that's not counting the bias unit, and both hidden units in this layer, and T Wo hidden units in the next layer. And then, finally, one output unit. Again, these counts, a, a, and a counting these bias on top.
In order to illustrate forward Propagation,i ' m going to draw this network a little bit differently. And in particular I ' m going to draw this neural-network with the nodes drawn as these very fat ellipsis, so that I can WRI Te text in them. When performing forward propagation, we might has some particular example. Say Some example (xi, yi) and it ' ll be this xI so we feed into the input layer.
So the-the-compute this value, z1 (3) is
When we forward propagated to the first hidden layer here,what We do are compute z1 (2) and Z2 (2). So these is the weighted sum of inputs of the input units. And then we apply the sigmoid of the logistic function, and the sigmoid activation function applied to the z value. Here's the activation values. So, gives us a1 (2) and a2 (2). And then we forward propagate again to get here Z1 (3). Apply the sigmoid of the logistic function, the activation function to, and get a1 (3). And similarly, like so until we getz1 (4). Apply the activation function. This gives us a1 (4), which is the final output value of the neural network.
What is backpropagation doing?
What BackPropagation is doing was doing a process very similar to Forward propagation. Except that instead of the computations flowing from the left to the right of this network, the computations since their F "Low" from the "right" to the "the" network. And using a very similar computation as this. Cost function of neural network is
Focusing on A example x(i),y(i), the case of 1 output UN It (K=1), and ignoring regularization (λ=0), the cost function can Writte N as follows
And what that cost function does is it plays a role similar to the squared arrow. So, rather than looking @ this complicated expression, if you want you can think of cost of I being approximately the squ Was difference between what's the neural network outputs, versus what's the actual value. Think of
I.e.how Well was the network doing on example I ?
More formally, what's the Delta terms actually is it, they ' re the partial derivative with respect to ZJ(l ), that's this weighted sum of inputs that were confusing these z terms. Partial derivatives With respect to these things of the cost function. So concretely, the cost function was a function of the label y and of the value, this h(x) output value n Eural Network. And if we could go inside the neural network and just change those zj(l) values A Little bit, then that would affect these values, the neural network is outputting. And that'll end up changing the cost function.
We don ' t compute the bias termAnd by the to, so far I ' ve been writing the delta values is only for the hidden units, but excluding the bias units. Depending on how to define the backpropagation algorithm, or depending on how do you implement it, your may end up Implementi Ng something that computes delta values for these bias units as well. The bias units always output the value of plus one, and they is just what they is, and there ' s no-for US-to-change t He value. And so, depending on your implementation of back prop, the I usually implement it. I do end to computing these delta values, but we just discard them, we don't use them.
Machine learning-neural Networks learning:cost Function and BackPropagation