This series of articles is about UFLDL Tutorial's learning notes.
Neural Networks
For a supervised learning problem, the training sample input form is (x (i), Y (i)). Using neural networks we can find a complex nonlinear hypothesis H (x (i)) to fit our data y (i). Let's first look at the mechanism of a neuron:
Each neuron is a computational unit, input is x1,x2,x3, and the output is:
where F () is an activation function, the usual activation function is the S function:
The shape of the S function is as follows, it has a good property of the derivative is very convenient to ask: F ' (z) = f (z) (1 f (z)):
Another common activation function is the hyperbolic tangent function Tanh:
Its derivative f ' (z) = 1? (f (z)) ^2):
Softmax the derivative of the activation function: F ' (z) =f (z)? f (z) ^2
 The value of the S function is between [0,1], and the value of the tan function is between [ 1,1].
Neural Network Model
A neural network is the connection of a single neuron above a complex network:
The threelayer neural network consists of an input layer (3 input units), a hidden layer (3 compute units), and an output layer (an output unit). All of our parameters are described as (w,b) = (w (1), B (1), W (2), B (2)). Ws_ij represents the weight of the first and second neurons of section s layer J neurons to the s+1 layer, and the same bs_i represents the deviation of the first neuron in the s+1 layer. It is known that the number of deviations B equals the total number of neurons minus the number of neuron in the input layer (the input layer does not need to be biased), and the number of weights w equals the number of connections of the neural network. The final output is calculated as follows:
A more compact way of writing is to write the product of the vector, so that the storage factor in the form of a matrix can accelerate the calculation. where z represents the input of all neurons in the layer, a represents the output of that layer:
We refer to the algorithm above for final output called forward propagation (feedforward) algorithm. For the input layer, we can use the a1=x representation. Given the activation value of the Llayer AL, the formula for calculating the activation value of the L+1 layer is as follows:
BackPropagation algorithm

Suppose we now have a fixed size of M training set {(X1,Y1), (x2,y2) ... (Xm,ym)}, we use the batch gradient descent method to train our network. For each training sample (x, y), we calculate its loss function as follows:

Introducing Normalization ( weight decay item : Weight decay), Make the coefficients as small as possible, the total loss function is as follows:

It is important to note that normalization will not be used on the bias B , since there is no significant difference between using normalization for B and eventually getting the network. The normalization here is actually a variant of the Bayesian normalization, which can be seen in the video of CS229 (machine learning) at Stanford. The loss function above

is often used in classification and regression problems. In the classification, the value of y is usually 0, 1, and the range of our s function is between [0,1]. And if the Tanh function is used, since the value range is [ 1,1], we can indicate that 1 means 0,1 represents 1, and 0 can be used as the cutoff value. For regression problems we need to scale our output proportionally to [0,1], then zoom in on the last prediction.

Our goal is to minimize loss function J : First random initialization of W and B, because the gradient descent method is easy to local optimization, so to conduct multiple experiments, each randomly selected parameters can not be the same. After calculating the loss of all samples, the formula for updating W and B is as follows, where α is the learning rate:

For this update formula, the most important thing is to calculate the bias. We use the reverse propagation (backpropagation) algorithm to calculate the bias. Reverse propagation can help explain how the weights and biases of the network change the cost function . In the final analysis, it means to calculate the partial derivative? C/?WL_JK and? C/?bl_j. But in order to calculate these partial derivatives, we first introduce an intermediate quantity, Δl_j, which we call the error amount of the J neuron of the first layer of the Llevel (error). We can first calculate the gradient vector produced by a sample, and finally the average of all the sample gradients:
The reverse propagation is specifically deduced as follows:
First use forward propagation to calculate the l2,l3 ... and the activation value of the output layer.
Calculates the error amount for each output unit of the output layer (A=F (z) uses the biased chain rule):
For the hidden layers in the middle, we calculate the amount of error for each neuron. Where the parentheses can be understood as the l+1 layer of all the Llayer of the neuron connected to the partial derivative multiplied by the weight is the loss of the neuron value:
The definition bias is as follows:
 The pseudocode for training The neural network is as follows (the following λw are normalized entries):
According to neural networks and deep learning this book is so understanding of reverse propagation:
For the Llayer J neurons, if the zl_j becomes Zl_j+δzl_j, then the final overall loss is brought (? C/?zl_j) changes in *δzl_j. The goal of the reverse propagation is to find this δzl_j, making the final loss function smaller. Assume? The value of C/?zl_j is very large (either positive or negative), we expect to find a and? C/?zl_j the opposite δzl_j of the symbol makes the loss less. Assume? The value of C/?zl_j is approaching 0, so the change in δzl_j to loss is negligible, indicating that the neuron and the training are close to optimal. Here with a heuristic feeling will? C/?zl_j as a method of measuring the error of a neuron.
Inspired by the above, we can define the error of the J neurons in the L layer :
The equation for the output layer error , which is based on the derivative chain rule? C/?z= (? C/?a) * (? a/?z):
The righthand first term represents the rate at which the cost varies with the output value of the J Neuron . If c is not too dependent on a specific neuron J, then the Δl_j will be small, which is what we want. The second item in the righthand section depicts the speed at which the function σ changes at Zl_j . If you use a twotime cost function, then ? C/?al_j = (AJ yj) is easy to calculate.
use the next layer of error δl+1 to represent the error δl of the current layer, because the output layer is calculated as determined, so the other layers are calculated backwards to propagate forward:
where (wl+1) T is the transpose of the l+1 layer weight matrix wl+1. Assuming we know the error of the l+1 layer δl+1, when we apply the transpose weight matrix (wl+1) T, we can intuitively think of it as a reverse moving error along the network, giving us an error method to measure the output of the Llayer, and we do Hadamard (vector corresponds to position multiplication) product Operation ⊙σ′ (ZL). This causes the error to be passed back through the activation function of the L layer and gives the error δ vector of the weighted input in the L layer.
Cost function about the change rate of arbitrary bias in the network:
Error Δl_j and partial derivative values? C/?bl_j is exactly the same for the same neuron.
the change rate of the cost function with respect to any one weight :
This tells us how to calculate the partial derivative? C/?wl_jk, where δl and al?1 we already know how to calculate.
To quantify it:
Where ain is the activation value of the neuron input to the weight w , Δout is the error of the neuron that outputs the selfweight w . When the activation value ain is small, ain≈0, gradient? C/?w also tend to be very small. In this way, we say that the weight is slow to learn , that the weight will not change too much when the gradient falls. In other words, weight learning from neurons with low activation values is very slow .
When σ (Zl_j) is approximately 0 or 1, the Σ function becomes very flat. At this time σ ' (Zl_j) ≈0. So if the output neuron is in either a low activation value (≈0) or a high activation value (≈1), the weight of the final layer learns slowly. In this case, we often say that the output neurons are saturated , and that weight learning is terminated (or learning is very slow). If the input neuron activation value is low , or if the output neuron is saturated (too high or too low), the weight will be slow to learn.
 A proof of the second equation
 The reverse propagation process gives a method for calculating the cost function gradient:
 Given a small batch of data of size m, a onestep gradient descent learning algorithm is applied based on this small batch of data:
 A little modification of the weights of the J neurons in the L layer will result in some changes in the column activation values:
 WL_JK cause the change of the activation value of the J neurons in the Llevel? Al_j.
 Al_j changes will result in the next layer of all the activation value changes, we focus on one of the activation values to see the impact of the situation, may wish to set Al+1_q:
 This change? Al+1_q will go to the next level of activation value. In fact, we can imagine a path from WL_JK to C, and then the change of each activation value will result in the change of the activation value of the next layer, and ultimately the cost of the output layer. Assume that the sequence of activation values is as follows Al_j, Al+1_q, ...,
Al1_n, al_m, then the expression of the result is:
 We use this formula to calculate the rate of change in C about a weight in the network. This formula tells us that the connection between two neurons is actually associated with a mutation factor, which is a partial derivative of the activation value of a neuron relative to the activation value of other neurons. The change rate factor from the first weight to the first neuron is AL_J/?WL_JK. The change rate factor of a path is actually the product of many factors on this path . and the whole rate of change? C/?wjk L is the sum of the change rate factors for all possible paths of cost functions from initial weights to final output . For a particular path, this process is explained as follows:
Gradient Checking and advanced optimization
 In this section, we describe a numerical method to check if the code you are solving the derivative is correct, and the derivative check process will greatly increase the correctness of your code.
Suppose we want to minimize j (θ), we can do gradient descent:
Suppose we find a function g (θ) equal to this derivative, then how can we confirm if this g (θ) is correct? Definition of recall derivative:
 We can compare whether the upper and g (θ) are the same to verify that the function g is correct, usually ε is set to a very small number (such as 10^4). Assuming ε=10^4, you will usually find that the above two formulas have at least 4 valid digits (usually more)
Now consider that θ is a vector instead of a value, and we define:
Where θi+ and θ are almost the same, the addition of ε to the first position is added.
We can verify the correctness of GI (θ) by checking that each of the I checks is true:
When using reverse propagation to solve a neural network, the correct algorithm will get the following derivative, we need to use the above method to verify that the derivative is correct:
Sparse Autoencoder Sparse Automatic coding