The third lecture of Professor Geoffrey Hinton's Neuron Networks for machine learning mainly introduces linear/logical neural networks and backpropagation, and the following is a tidy note.
Learning the weights of a linear neuron
This section introduces the learning algorithms for linear neural networks. The linear neural network is much like the perceptual machine, but it is different: in the perceptron, the weight vector is always getting closer to the good weight setting, and in the linear neural network, the output is always getting closer to the target output. In the Perceptron, each time the weight vector is updated, it is closer to each "generally feasible" weight vector, which restricts the perception machine from being applied to more complex networks, since the average of two good weights vectors can be a bad one. Therefore, in the multilayer neural network, we can not use the learning process of perceptual machines, nor can we use similar methods to prove the feasibility of learning.
In multilayer neural networks, we determine whether the performance of learning is improving by judging whether the actual output is getting closer to the target output. This strategy still works when solving non-convex problems, but it is not suitable for the learning of perceptual machines. The simplest example is a linear neural network (linear neurons) using squared errors, also known as a linear filter (linear filters). As shown, Y is an estimate of the desired output by the neural network, W is the weight vector, x is the input vector, and the goal of learning is to minimize the sum of errors made on all training samples.
It is very straightforward to write a series of equations, each of which corresponds to a training sample, from which the optimal weight vector is solved. This is the standard engineering method, why not use it? On the one hand, we want to get a real neural network to use the method, and another method, we hope that the final method can be generalized to the application of multilayer neural networks, nonlinear neural networks. The method of analysis relies on the existing linear and squared error measurement methods, and the iterative method is more likely to generalize more complex models.
Here is a simple example to illustrate and prove the iterative approach. Suppose you go to the cafeteria every day for lunch, you order fish, fries (chips), tomato sauce (ketchup), the same thing every day, but the quantity is different, the waiter only tells you the price of the day. After eating for a few days, you should be able to infer the price of each dish. First, you randomly give a set of menu prices, and then use an iterative approach to constantly revise the price of the dish until you get a proper menu price. The calculation formula is given below:
Given a lunch each dish's share and cost is known; the red number in the figure is the actual unit price of each dish, which is unknown, which is what we asked for.
We first randomly guess a set of menu prices (50, 50, 50), and then calculate the price of the reorganization of the rice, as shown, you can see the error is 350. The "Delta-rule" is then given to adjust the weight vector, where the ? is a parameter. When the parameter ? Take value 1 / 35 , the variation of each weight is +20,+50,+30, thus obtaining a new weight vector (70, 100, 80).
The Delta-rule is given:
In fact, this is the perception machine, which we have learned in Andrew Ng's course. The weighted vector obtained by iteration may not be perfect, but it should be a solution that makes the error small enough. If the learning step is small enough and the learning time is long enough, the resulting weight vectors should be close enough to the optimal solution.
The connection between online delta-rule and Perceptron is given.
The error surface for a linear neuron
In this section we will look at the error surface to understand how the linear neural network is learned. The error surface of a two-dimension vector is given (in fact, the two sections are given here, the first one is the longitudinal section, the second is the cross-section), and the entire surface is a bowl-mounted. For multilayer or nonlinear neural networks, the error surface is more complex.
In this paper, the convergence process of online learning and bulk learning is given, and more about the two are described in machine learning Tenth Week Note: Large-scale machines learning , which is no longer unfolding.
Learning the weights of a logistic output neuron
This section introduces the logical neural network (logistic neurons), which is only used for graphing and no longer unfolding, because the logical neural network is actually a logistic regression algorithm.
The backpropagation algorithm
This section introduces the back propagation algorithm, which is an important algorithm in the training of multilayer neural networks. Now that we've learned about logistic regression, let's look at how to learn the weights of the hidden units. The neurons that add the hidden cells become more powerful, but it is too difficult to add the hidden units by extracting the features manually, and we hope there is a way to replace the manual extraction process. The manual extraction feature is usually to guess some characteristics, then repeat the tracking, calculate the error rate, correction of a series of loops, we want to let the machine to achieve this cycle.
We usually randomly perturbation a weight and then see if it will improve the learning performance. If so, the disturbance is retained. This can be seen as a form of enhanced learning (reinforcement learning). A method of disturbing weights is too inefficient, which may involve a lot of useless work, and back propagation is much better.
We can simultaneously disturb all weights and then measure whether the change in weight has brought about a performance boost, but this approach requires a lot of work to track the impact of the model on each training sample. A better idea is a random perturbation concealment unit, and once we know what kind of change the stealth unit needs to make based on a given training sample, we can figure out how to correct the weights.
The idea behind back propagation is that we don't have to understand what the stealth unit should do, but we can calculate how the error derivatives of all the hidden units can be obtained at the same time as the activity errors of the hidden units change.
The following diagram gives a brief introduction to BackPropagation, which uses the chain rule. To learn more about backpropagation please read the explanations of the great gods: how to interpret the back propagation algorithm intuitively?
Using the derivatives computed by backpropagation
Figuring out how to obtain the error derivative of the ownership value in multilayer networks is the key to learning the effective neural network. We have a lot of questions to solve before we get a real learning process, for example, we need to determine how often the weights are updated. Or, we need to figure out how to prevent overfitting.
For a single training sample, the back propagation algorithm is an effective method to calculate the derivative of each weight, but it is not a complete algorithm, we also need to give the details of the algorithm. For example, optimization and generalization, these two issues will unfold in the sixth and seventh, and here is a brief overview.
The following is a brief description of the optimization problem (optimization issue), where online vs batch can be found in machine learning Week Tenth notes: Large-scale machines learning
The training data contains information such as the regularity of the mapping from input to output, but it also contains the following two types of noise:
- The target values may is unreliable (usually only a minor worry).
- There is sampling error. There'll is accidental regularities just because of the particular training cases that were chosen.
When we go to fit the model, we cannot tell which regularities are real and those that are caused by the sample error, and the model fits both regularities. If the model is adaptable enough to fit the sample error, then the model is a disaster. A simple example of overfitting is given.
Several methods to avoid overfitting are listed below, which will be expanded in the seventh lecture.
- Weight-decay
- Weight-sharing
- Early stopping
- Model avraging
- Bayesian fitting of neural nets
- Dropout
Using neural networks in machine learning Third lecture notes