Neural networks used in machine learning Tenth lecture notes

Source: Internet
Author: User
Tags neural net

Blog has migrated to Marcovaldo's blog (http://marcovaldong.github.io/)

The tenth lecture of Professor Geoffery Hinton, neuron Networks for machine learning, describes how to combine the model and further introduces the complete Bayesian approach from a practical point of view. Why it helps to combine models

In this section, we discuss why you should combine many models when making predictions. Using multiple models can make a good compromise between fitting a true rule and fitting a sample error.

We already know that when training data is less prone to fitting, if we average a lot of different models of predictions, then we can reduce the degree of fit. For regression, when the model's capacity is weak, it is prone to high-bias, and when the model's capacity is too strong, the model will fit the sample error too much, resulting in high-variance. Combined with a number of different models, we can achieve a better compromise.

Let's analyze how to compare the individual model (individual model) with the average model (an average of models). On any single test set, it is normal for some individual models to predict results that are better than those of the combined model. On different samples, different individual predictors can have advantages and disadvantages. If the difference between the individual predictors is very large, then on average, the combined predictor is definitely stronger than the individual, so we should try to make the individual predictor gap larger (the mistakes are very different, but their performance is still possible).

Now, let's take a look at the mathematical derivation behind the network combination. As shown in the following illustration, Y¯\overline{y} is the average of the predictions for the same input by n different models, written as <yi>i _i, whereas the < (T−yi) 2>i _i represents the mean value of the squared error of the target output T with each predicted value Yi y_i. We can take <>i _i as an averaging symbol. Note that the final step of the deduction, the sum of the square and the expansion of the combination of zero, can be omitted directly. As can be seen, the target output T and each predicted value Yi y_i the mean value of the squared error can be written as the target output T and the combined prediction Y¯\overline{y} error squared plus a positive value. This means that the combined post-prediction is better than the pre-combination.

The following figure shows a schematic diagram showing a difference between the different predictions and the target output: Red indicates a bad prediction, and the distance from the target output T is much greater than the distance between y¯\overline{y} and T; Green is a good predictor of the distance to the target output T y¯\overline{ The distance between y} and T is small. But because the squared error is used, bad predictions take a dominant position.

Then we do a mathematical calculation, we assume that y¯\overline{y} to good guy and bad guy are the same distance, and then do a calculation to get the equation in the figure above. But this kind of equation is not always set up, this is mainly because we are using squared error. In exchange for an error measurement, the equation is not necessarily true. The following figure shows an example.

The following figure shows a number of different ways to make a predictor:

The following figure shows the ways to get different models by using different training datasets:

Mixtures of Experts

This section introduces the multi-expert model (the mixture of Experts model), the idea of which is to train multiple neural networks (that is, multiple experts), and each neural network (expert) is assigned (specialize) to different parts of the dataset. This means that the dataset may have several different sources (different regimes, which means that the data in the dataset is produced in a different way, I translated it into "different sources"), and the data provided by different sources is large (but true), So we deal with a given neural network for each source of data, and the model has a managing neural net to determine which neural network an input should be given to handle.

For smaller datasets, the model may not perform well, but as the size of the dataset increases, the performance of the model can be significantly improved. More importantly, a single model is often good at processing part of the data, not good at dealing with another part of the data (in this part of the error), and multi-expert system is a good solution to this problem: every neural network in the system, that is, every expert will have a good data area, It's better than other experts in this set of areas.

The following figure shows a comparison of the local model (very local models) with the global model (fully global models).

Multi-expert system is a good compromise of a single global model or multiple local models, but one of the most important problems we now face is how to divide the datasets into different parts. The following figure shows several ways to divide a dataset: According to the input-to-output mapping, the data in the graph can be divided into two groups, a set corresponding to the red parabola, a set corresponding to the green parabolic line, only according to the input as a cluster, the picture is divided into two categories of Blue Line. The purpose of dividing the training data set here is to be able to get a local model from the input and output of each cluster well.

Let's introduce the loss function that makes the model cooperate, which uses the things described in the previous section, so that the trained models perform better than training each model individually.

The following figure shows why averaging models can make the model cooperation. The right side of the graph below is the average of the predictions for an input of all models except model I, the middle T is the target output, and the left Yi Y_i is the prediction of model I. When we add Yi y_i to calculate a new mean, the value is definitely closer to T, which allows for a little bit of correction. So, to make the average more and more close to T, we need Yi Y_i to move left.

But in fact, what we want to do is make Yi y_i more and more close to target T, and this practice will make the model specialization. The following figure shows a loss function that makes the model specialization. The loss here is an expectation, where the Pi p_i is the probability that we will use model I when processing the input data for that group.

The second type of loss function is used in multi-expert system, and the diagram is given below. There is a Softmax gating network in a multi-expert system that outputs the reliability of each expert's handling of the input for a given input. The final loss function of the system is the reliability of the square of the difference between the output and the target of each expert.

The following figure shows the two bias guides. The gradient obtained by the previous bias represents an expert's correction: if the reliability of the expert (

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.