Visual machine Learning reading notes--------BP learning

Source: Internet
Author: User

The inverse propagation algorithm (back-propagtion algorithm), BP learning is a supervised learning algorithm, which is an important method of artificial neural network learning, which is often used to train feedforward multilayer perceptron neural networks.

First, the principle of BP learning

1. Feed-forward neural network

Refers to the network in the processing of information, the information can only be entered into the network by the input layer, and then pass by layer forward, until the output layer, the network does not exist in the loop; Feedforward Neural network is a typical layered structure in neural network, according to the difference of neuron transfer function, network layer, basic unit number and weight adjustment mode in Feedforward network Can form neural networks with different functional characteristics. The Feedforward neural network consists of input layer, middle layer (hidden layer) and output layer, and the input layer has no computational function, only to characterize the values of input vector elements; the nodes of the middle layer and output layer are all in the compute nodes, corresponding to different computational functions, expressed as Δ (x), commonly used nonlinear functions, such as the sigmoid and Tanh functions:

1, the sigmoid and Tanh functions corresponding to the graph are the S shape, the main difference is that the function value range is not the same, the Sigmoid function value range is (0,1), and the Tanh function value range is ( -1,1).

As shown in 2, the nodes between adjacent layers are all connected, and the connection strength is represented by weight parameters. The specific form of the network output layer is determined by the task to be performed, such as in the multi-classification problem, each dimension of the output layer can represent the predictive confidence of the corresponding category. In Feedforward network, the network layer number and the node connection weight together determine the specific Feedforward network instances. After the structure of a given neural network, the training aim of neural network is to learn the connection weights between network nodes according to the training samples, so that the neural network can get better test performance when the same task is performed on the unknown test data set.

2, Problem formalization

Taking the common classification task as an example, the learning problem of Feedforward network is formalized. The dimension and range of input and output of feedforward network are determined by the characteristics of practical application problems. As shown in 2, its network input is x=[x1,x2,..., xni]t, the input characteristic of solving the problem is NI dimension. In general, it is necessary to normalized the range of input features to a certain interval, and improve the convergence speed of the training process of the network. Assuming that the class information for a sample is a total of m possible, network output y can be represented as a class vector for the n0 dimension, i.e. Y=[y1,y2,..., yno]t, where each dimension of the vector represents the confidence of the sample belonging to the corresponding category. The function of the whole network can be represented as y=f (x), which corresponds to a forward computation process of the network, that is, to predict the class information of the sample according to the eigenvector of the input sample. After determining the concrete structure of the Feedforward network, that is to determine the network layer, the node points of each layer, the connection between layer and layer, the training goal of Feedforward network is to determine the weights of the connections of different nodes, that is, the value of each connection weight wij (l) in the network, where Wij (L) Represents the L-layer J neurons The weight between the 1-layer I neurons.

To determine all connection weights for a network, the most common approach is to learn these weight parameters through a collection of samples called training sets. A training set containing n samples can be represented as x={xn,tn}n=1n, where xn represents the ni-dimensional input characteristics of the sample. TN is a class vector of N0, where a dimension with a value of 1 represents a category label for that sample. Define the loss function L (θ;xn,tn) to calculate the loss value of a particular set of model parameters on a sample, where θ is a set of parameters representing all the connection weights, i.e. Θ={WIJL,BJL}, where Wijl represents the weights between the node I of the J node of the L layer and the first l+1 layer, BJL is the offset of the first node of the l+1 layer.

In order to obtain the optimal value of connection weights, the following optimization problems need to be solved:

The goal of this optimization problem is to get the specific parameter values, so that the loss function on the training set of total loss and minimum. Because of the specific definition of loss function, the solving process of the above optimization problem will be different. When the derivative of the loss function exists, the BP learning algorithm gives a general process to solve the above optimization problem.

3. Algorithm derivation

Based on the very mature gradient descent algorithm in optimization theory, the BP learning algorithm starts from an initial value, searches for the target function value which is smaller than the current function value in the negative gradient direction of the target function, then starts with the position corresponding to the new target function, and repeats the process. Until the algorithm finds the minimum target function value in a local area and the corresponding position. BP learning algorithm is a specific application of gradient descent algorithm in the study of Feedforward neural network parameters. In order to obtain the minimum value of L (θ) by using the gradient descent method, the θ value corresponding to the minimum value of l (θ) can be found, thus obtaining all the connection weight values of the Feedforward network.

The basic idea of using BP learning algorithm to train Feedforward network is to calculate the actual output value of the input characteristics of training samples through a forward process, then compare the output values with the expected value, get the error signal, and then adjust the connection strength between the neurons in the neural network layer according to the error signal. And then repeated the forward process calculation, the error is reduced, and the new output value is compared with the expected values, a new error signal is smaller than the previous, and then, according to the small error signal, from the back forward to re-adjust the connection strength between the neurons of the neural network layer, This forward process and the back process are continuously repeated until the error satisfies the requirements.

The BP algorithm is deduced from the most common squares and errors of the loss function selection.

The loss function L (Θ;XN,TN) can be expressed as: L (θ;xn,tn) =1/2| | tn-yn| | 2, here yn represents the output function value when the input of the Feedforward network is xn. Accordingly, the target loss function L (θ) on the entire training set can be expressed as:

The gradient direction of the objective function relative to the parameter θ is:

Assuming that the initial value of the parameter θ is θ0, according to the gradient descent method, the parameter Θ:θk+1=θk+δθk=θk-ηδl (ΔΘK) can be continuously updated as follows, where: Δθk=-ηδl (ΔΘK) represents the update amount of the K-iteration parameter, η∈ (0,1) is the learning rate.

Based on the characteristics of Feedforward networks, the BP learning algorithm skillfully uses the chain derivation law to give a process of updating parameters from the back-forward layer. Firstly, the output of the training sample in the current network is calculated by the forward process, then the output value and expected output value are compared according to the network loss function, and the network error is calculated, and then the network parameters are updated by propagating the network error from the back to the next, and the process is repeated according to the updated network parameters until the training process converges.

Second, the algorithm improvement

There are many shortcomings in the classical inverse propagation algorithm, and the researchers have put forward many improvement methods in learning rate, training sample, loss function, connection method and so on.

2.1. Improve learning Rate

The improved learning rate is the most researched content in the inverse propagation algorithm, which improves the training convergence rate and shortens the training time. By sampling specific strategies during the training process, the learning rate η of the network is changed appropriately, so that the training process can converge faster and more stably.

2.1.1 Increase momentum term

Feedforward Network in the course of training, loss function often concussion, resulting in the training process is not convergent, in order to reduce the impact of this problem, you can try to use the smooth loss function of the oscillation curve to speed up the training process, through the design of low-pass filter can achieve this smoothing effect. When updating network parameters, the network parameter Update amount is calculated as: δθk=αδθk-1-(1-α) ηδl (δθk), where δθk-1 is the momentum term, which represents the previously accumulated parameter adjustment inertia, α∈ (0,1) is the momentum factor.

2.1.2 Variable Learning rate

The training error surface of Feedforward network can be very complex, and the surface shape varies with the parameter space area. In order to accelerate the training process of Feedforward network, the learning speed is adjusted according to the shape of the error surface in the learning process, and the convergence speed is improved.

The difficulty of this method is how to change the learning speed and how to change the learning speed, the simplest strategy is to choose a larger learning rate at the beginning of training, with the training process to continuously reduce the learning rate, the basic hypothesis is that the training start phase distance optimal value is far, training late distance is very close to the optimal value.

2.1.3 Adding the steepness factor

The training error surface of Feedforward Network may also exist flat area, the network parameter adjustment is very small in the flat area, and the network training process is very slow. Observing the input and output characteristics of nonlinear functions used by neurons, it is not difficult to find that when the input value is far away from the center position, the nonlinear function enters the saturated region, even if the input value varies greatly, the output value changes very little. The corresponding input values of these saturated regions correspond to the flat area of the Feedforward network training error, which can introduce the steepness factor to these nonlinear functions, adjust the saturation region of the nonlinear function, adjust the shape of the training loss function, and adjust the parameter adjustment out of the saturated area.

For the sigmoid function, the steepness factor (recorded as λ) can be set as follows: Δs (x) =1/(1+exp (-x/λ))

2.1.4 Using numerical optimization techniques

In order to improve the convergence speed and stability of neural network training, we can also use the numerical optimization algorithm to replace the gradient descent algorithm in BP algorithm to train Feedforward network, such as Newton method, conjugate gradient method and Levenberg-marquardt algorithm.

2.2 Training Sample Improvement

Training samples are very important for machine learning algorithms, especially for reverse propagation algorithms. In addition to ensuring that the training samples collected are representative and calibrated, because the Feedforward network typically contains a large number of training parameters, training samples that match the number of network parameters are required to train for a usable feedforward network. The number of training samples needs to be at least 10 times times the number of network parameters, and the more training sample number, the better the training department good Feedforward network model. In addition to ensuring the quality and quantity of training samples, common ways to improve training samples include the following.

2.2.1 Design a suitable training sample

You can use the original collection of data, or extract the appropriate features as a new training sample.

2.2.2 Proper normalization of training samples

Determine the sample dimension, spec input range, and set appropriate sample output.

2.2.3 Appropriate disturbance Training sample

Generate a new training sample, increase the number of samples, improve the robustness of the noise.

2.2.4 Improve use order and number

Batch mode can be used, all training samples can be taken at once, or online mode, using only one training sample at a time, or using a block mode, each time the use of partial samples, in the block selection method to introduce appropriate randomness, can improve the model training convergence speed and stability.

2.3 Improved loss function

The square sum of the model training error is the simplest kind of loss function, there are many potential problems, such as the loss function does not limit the size of training parameters, when the training parameters are very large or very small, the loss function can also obtain a very low value, but the training to get the amount of the model is very unstable. The common processing method is to add a regularization term to the parameter size limit in the loss function, to control the importance of the regularization item in the model through a regularization factor, and the training error squared sum function after the regularization term is trained to obtain a more stable model.

In practical application, the appropriate loss function can be designed according to the specific problem, so long as the loss function is relative to the training parameters, the inverse propagation algorithm can be used to train the Feedforward network.

In addition to improving the form of the whole loss function of Feedforward network, we can improve the non-linear function form of feedforward neural network, besides the sigmoid and Tanh functions, and try to choose other forms of nonlinear functions for the specific application problems, which can get faster convergence speed and better model performance.

2.4 Improve connection Mode

the fully-connected Feedforward network contains a large number of connection parameters, and training of such feedforward networks requires not only a large number of training samples, but also very difficult to train and spend a lot of time. In practical application, it is necessary to improve the connection mode of Feedforward network, design a network model that can describe the complex relationship between input and output with few connections, greatly reduce the number of training samples, reduce the complexity of the model and the difficulty of training.

2.4.1 Local Connection Policy

Restricting each neuron to only its neighboring neurons is straightforward, since only neighboring nodes have a significant impact on that node.

2.4.2 Weight Sharing policy

By reducing the number of connections through the Connection Sharing method and limiting the different neurons choosing the same connection weights, the weight sharing policy can further reduce the number of connections.

2.4.3 Sparse Connection constraints

Restricting the connection of each neuron to only a few other neurons, sparse join constraints demonstrate a superior effect on many application issues.

Three, the simulation experiment

The main process of learning feedforward neural network based on inverse propagation algorithm is:

Algorithm 1: Reverse propagation algorithm

Calculates the input zL and output al for all hidden layers and output layers;

The δbnl-1 and ΔWNL-1 are calculated for the output layer:

For hidden layer l=nl-1,nl-2,..., 2 compute δb (L-1) and Δw (L-1):

Algorithm 2: Feed-forward neural network learning

Training sample and Parameter input: Training sample set {xn,tn}n=1n, Feedforward neural Network, learning rate η;

Initialization: The connection weights and offsets of all layers are small random values;

While not converging

Sample samples from the training sample set (XN,TN);

The inverse propagation algorithm is used to calculate all the and;

End while

Output network parameter {WIJL,BJL}.

To make the presentation more concise, the algorithm uses vectors and matrices to represent the computational process, where the symbol "·" Represents the dot product operator between vectors, that is, if: a=b C

Then: Ai=bi CI

The definition of a nonlinear function is generalized to a vector input, i.e.:

In the algorithm, it is easy to deduce the sigmoid function as a piecewise function:

In order to apply the inverse propagation algorithm to the Learning Feedforward neural network, in the course of training, we need to iteratively apply the inverse propagation algorithm to the whole training set, and continuously update the network parameters until convergence.

Four, the characteristics of the algorithm

BP learning algorithm is a well-known algorithm applied to feedforward neural network training, which relies on the mature gradient descent algorithm in the optimization theory, the BP learning algorithm ends the multi-layer network without the history of training algorithm, has a strong mathematical foundation, in theory can simulate any complex function, is a very important learning algorithm.

In the concrete practice, there are many shortcomings in the inverse propagation algorithm.

(1) Local minimum value problem. Because the functions represented by Feedforward networks are very complex nonlinear functions, the inverse propagation algorithm, as an algorithm to optimize such complex nonlinear functions, inevitably has local minimum value problems. The optimization process is likely to converge to a local minimum value due to the influence of the initial value chosen by the optimization process. The more complex the nonlinear function to be optimized, the more local minimum points exist, and the more likely the optimization algorithm converges to the local minimum point.

(2) It takes a lot of time. When the inverse propagation algorithm is used to train a feedforward network with multiple hidden layers, the optimization process is very slow and needs to be iterated many times. The main reason is because the training process of the inverse propagation algorithm is to spread the training error gradient from backward to forward, when the number of hidden layers of the network, the gradient of the reverse propagation will become very small, so that the network parameters in an iterative process of change is very small, so the entire training process will take a lot of time.

(3) unstable. The inverse propagation algorithm is unstable in the process of training Feedforward network. Because of the initial value of training, the learning step setting, training sample use order and the way different, so that the network performance of each training can vary greatly, so training a good performance of the neural network often need to try different parameters, in order to make the network can reproduce the training, not only need to record the parameters, It is also necessary to record the initial weights randomly generated at the beginning of network training. In addition, the instability of the Feedforward network is also reflected in substantially the same network, the connection weights of the network can have countless different values, such as the Feedforward network can be a layer of all the connection weights increased to twice times, the other layer of all the connection weights are reduced by twice times, the new network from the original network is essentially the same network, However, their connection parameters vary greatly.

(4) Lack of unified and complete theoretical guidance. The Feedforward network, which is applicable to the inverse propagation algorithm, lacks the unified and complete theoretical guidance when designing the structure. Including the selection of nonlinear functions of neurons, the number of layers of hidden layers, the number of nodes in each layer of hidden layer, the connection of weights, etc., in practical applications need to continue to try and choose, in many cases have to undergo a large number of experiments experience set.

Visual machine Learning reading notes--------BP learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.