Source: Michael Nielsen's "Neural Network and Deep learning", click the end of "read the original" To view the original English.
This section translator: Hit Scir master Li Shengyu
Disclaimer: If you want to reprint please contact [email protected], without authorization not reproduced.
Using neural networks to recognize handwritten numbers
How the inverse propagation algorithm works
Warm-up: A method of fast computing neural network output based on matrix
Two assumptions about the loss function
Hadamard product
Four basic equations behind the reverse propagation
Proof of four basic equations (selected readings)
Inverse propagation algorithm
Inverse Propagation Algorithm Code
Why is it that the reverse propagation algorithm is efficient
Reverse Propagation: Overall description
Learning method of improving neural network
Neural network can calculate visual proof of arbitrary function
Why the training of deep neural networks is difficult
Deep learning
Hint: This section code is many, recommended to read on the computer
After a theoretical understanding of the inverse propagation algorithm, it is possible to understand the code used to implement the reverse propagation algorithm in the previous chapter. Recall the code for the update_mini_batch
and Backprop
methods in the Network
class in the first chapter. This code can be seen as a direct translation of the above algorithm description. Specifically, the update_mini_batch
method updates the weight and bias of the Network
for the current small batch (Mini_batch) by calculating the gradient.
class Network (object): Def update_mini_batch (self, Mini_batch, eta): "" Update the Network ' s weigh TS and biases by applying gradient descent using BackPropagation to a single mini batch. The "Mini_batch" is a list of tuples "(x, Y)", and "ETA" are the learning rate. "" Nabla_b = [Np.zeros (b.shape) for B in self.biases] nabla_w = [Np.zeros (w.shape) for W in Self.weights] for X , y in mini_batch:delta_nabla_b, delta_nabla_w = Self.backprop (x, y) nabla_b = [Nb+dnb for NB, DNB In Zip (Nabla_b, delta_nabla_b)] Nabla_w = [NW+DNW-NW, DNW in Zip (Nabla_w, delta_nabla_w)] Self.weigh TS = [W (Eta/len (mini_batch)) *nw for W, NW in Zip (self.weights, nabla_w)] Self.biases = [B- (Eta/len (Mini_batch)) *nb for B, NB in Zip (self.biases, nabla_b)]
Most of the work is done by delta_nabla_b, Delta_nabla_w = Self.backprop (x, y)
this line of code. It uses the Backprop
method to calculate the biased and. The Backprop
method is basically implemented as described in the previous section, but with a different point: we used a slightly different method to index the layer. This change leverages the advantages of the list negative index feature in Python to index a list from the back forward. For example, l[-3]
represents the penultimate third item of the list L
. The code for the Backprop
method is as follows, along with some helper methods for calculating the function, the derivative, and the derivative of the cost function. You should be able to understand the code below. However, if you encounter difficulties, you can refer to the first chapter of the code described in this section.
Class Network (object): ... def backprop (self, x, y): "" "Return a Tuple" (Nabla_b, Nabla_w) "representing the Gradient for the cost function c_x. "Nabla_b" and "Nabla_w" is layer-by-layer lists of numpy arrays, similar to "self.biases" and "self.weights ".""" Nabla_b = [Np.zeros (b.shape) for B in self.biases] nabla_w = [Np.zeros (w.shape) for W in Self.weights] # fee Dforward activation = x activations = [x] # list to store all the activations, layer by layer ZS = [] # list to store all the z-vectors, layer by layer for B, w in Zip (self.biases, self.weights): Z = np.do T (w, activation) +b zs.append (z) activation = sigmoid (z) activations.append (activation) # backward Pass Delta = self.cost_derivative (activations[-1], y) * Sigmoid_prime (zs[-1]) NA BLA_B[-1] = Delta Nabla_w[-1] = Np.dot (Delta, Activations[-2].transpose ()) # note that the variable L in the loop below was used a little # differently to the notation in Chapter 2 of the book. Here, # L = 1 means the last layer of neurons, L = 2 are the # Second-last layer, and so on. It's a renumbering of the # scheme in the book, used here to take advantage of the fact # that Python can us e negative indices in lists. For L in Xrange (2, self.num_layers): Z = zs[-l] sp = Sigmoid_prime (z) delta = Np.dot (self . Weights[-l+1].transpose (), Delta) * SP NABLA_B[-L] = Delta Nabla_w[-l] = Np.dot (Delta, activations[ -l-1].transpose ()) return (Nabla_b, nabla_w) ... def cost_derivative (self, output_activations, y): "" "Retu RN the vector of partial derivatives \partial c_x/\partial A for the output activations. "" Return (OUTPUT_ACTIVATIONS-Y) def sigmoid (z): "" "the sigmoid function." " Return 1.0/(1.0+np.exp (z)) def sigmoid_prime (z): "" "DerivAtive of the sigmoid function. "" return sigmoid (z) * (1-sigmoid (z))
Problem
In our implementation of the random gradient descent algorithm, we need to traverse the training sample in one batch (Mini-batch) in turn. We can also modify the inverse propagation algorithm so that it can calculate gradients for all training samples in a batch at the same time. We pass in a matrix (instead of a vector) at the input, and the columns of this matrix represent the vectors in this batch. In forward propagation, each node multiplies the input by multiplying the weight matrix, adding a bias matrix, and applying sigmoid
functions to get the output, which is also calculated in a similar way when it is transmitted in reverse. Explicitly write this method of reverse propagation and modify network.py
it so that it is calculated using this completely matrix-based method. The advantage of this approach is that it makes better use of the modern linear library and works faster than the loop. (for example, when solving a mnist classification problem similar to the one discussed in the previous chapter on my laptop, it can be up to twice times faster.) In practice, all formal reverse propagation algorithm libraries use this completely matrix-based approach or its variants.
In the next section we will cover "why the reverse propagation algorithm is efficient", so stay tuned!
"Hit Scir" public number
Editorial office: Guo Jiang, Li Jiaqi, Xu June, Li Zhongyang, Hulin Lin
Editor of the issue: Li Zhongyang
Neural network and deep Learning series Article 16: Reverse Propagation algorithm Code