Stanford University machine Learning lesson 10 "Neural Networks: Learning" study notes. This course consists of seven parts:

1) Deciding what to try next (decide what to do next)

2) Evaluating a hypothesis (Evaluation hypothesis)

3) Model selection and training/validation/test sets (Model selection and training/verification/test Set)

4) Diagnosing bias vs. variance (diagnostic deviation and variance)

5) Regularization and bias/variance (Regularization and deviation/variance)

6) Learning curves)

7) Deciding what to try next (revisited) (determine what to do next again)

The following is a detailed explanation of each part.

**1) Deciding what to try next (decide what to do next)**

Debug the learning algorithm:

Suppose you have implemented a regular linear regression algorithm to predict house prices:

However, when you use it to test a batch of new house data and find that the predicted data is inaccurate, what should you do next? The following provides some options, but we can't explain them much at the moment. When we finish this chapter, we will know the basis for choosing these options.

-Get more training samples

-Try to use a set with fewer features

-Try to obtain other features

-Try to add multiple combinations of features

-Try to reduce λ

-Add Lambda

Machine Learning (algorithm) diagnosis (Diagnostic) is a testing method that enables you to have a deep understanding of a Learning Algorithm and know what can be run and what cannot be run, it also guides you how to maximize the performance of learning algorithms.

Although the diagnosis test takes some time to implement, this can make more effective use of your time.

**2) Evaluating a hypothesis (Evaluation hypothesis)**

If Hypotheis is:

The following features are defined:

The training data is well fitted:

However, the prediction of new data that is not in the training set is poor, and the versatility is lost. How should we evaluate this assumption?

First, we need to split the dataset, one part (for example, 70%) as the training set, and the other part (for example, 30%) as the test set:

For Linear Regression:

-The θ parameter is learned by minimizing error J (θ) of the training set;

-Calculate the test set error:

For logistic regression, it is similar to linear regression:

-First, the training set learns the parameter θ;

-Calculate the test set error:

-Add an error (or 0/1 error) for the error category );

**3) Model selection and training/validation/test sets (Model selection and training/verification/test Set)**

First, let's review the above overfitting example:

Once the θ 0, θ 1 ,..., θ 4 for certain datasets (training sets) Adaptation (final learning parameters), then the error (training error J (θ) calculated based on the data and Parameters) it is likely to be smaller than the Generalized error in practice.

Therefore, we need to consider the Model Selection problem. First, let's look at an example of selecting a polynomial regression Model. We have a polynomial regression Model of 1 to 10 power, or hypothesis:

How do I select a model?

Here we first learn the parameters based on the training set, then calculate the test set error, and finally select the polynomial regression model with the smallest test set error. For example, here we choose:

So what is the generalization ability of this model? The error Jtest (θ (5) in the test set basically represents its generalization ability, but is this accurate?

We use the test set to select parameters, and then use the test set to evaluate the hypothesis (hypothesis). It seems that such an evaluation is optimized based on the test set?

There is indeed a problem, so here we introduce the third set: cross-validation set, which we use to select parameters, instead of simply evaluating assumptions in the test set.

For the original dataset, a typical partitioning method is 60% training sets, 20% cross verification sets, and 20% test sets:

With these three data sets, we can also define their respective errors:

However, in actual use, we learned the parameters through the training set, calculated the error in the Cross verification set, and then selected a model with the smallest error in the verification set, finally, the generalized error (error) of the model is estimated in the test set ):

**4) Diagnosing bias vs. variance (diagnostic deviation and variance)**

First, let's take a look at the examples of deviations and variance. These examples are the same as those in the regularization chapter, but they are labeled as deviations or variance at the same time:

A) high deviation (underfitting ):

B) high variance (over fitting ):

C) suitable fitting:

We will calculate the train error and cross validation error of the three models:

We will find that:

When the number of times of Polynomial Regression Model d = 1, that is, high deviation (under fitting), the training set error and validation set error are relatively large;

When d = 4, that is, the high variance (over fitting), the error of the training set is very small (the fitting is very good), but the error of the validation set is very large;

When d = 2, that is, the fitting is just fine, no matter whether the training set error or the verification set error is just right, between the above two.

The following figure shows the image:

With the above explanation, we can diagnose the deviation or variance problem. If your learning algorithm is not doing well and does not meet your expectations, how can you determine whether it is a deviation problem or a variance problem? We can calculate their training set errors and cross-validation set errors. if they fall into the "Header" area, we can determine the deviation (underfitting) problem, if you fall into the "tail" area, you can determine the variance (overfitting) problem, as shown in:

Finally, we can summarize the deviation and variance issues as follows:

**5) Regularization and bias/variance (Regularization and deviation/variance)**

Regularization is a very effective solution for over-fitting, so we will consider the relationship between Regularization and deviation/variance. First, let's look at a regular linear regression example:

If the regularization parameter λ is too large, in an extreme case such as λ = 10000, all other parameters except θ 0 will be approximately 0, this is the case of underfitting or high deviation:

If λ is too small, the extreme case is that λ = 0, which means that the linear regression model is not normalized, the problem of over-fitting height difference is very likely to occur:

If Lambda selects a suitable one between the two, we will get a suitable fit:

Then, how do I select the regularization parameter λ?

For a dataset, we still classify it into three parts: training set, verification set, and test set. For a given regularization model, for example, in the above example, we take the number in sequence from small to large by λ, and then learn the model parameters in the training set, calculate the verification set error on the Cross verification set, and select the model with the smallest error, that is, select λ, and then evaluate the hypothesis on the test set:

Deviation/variance can be used as a function of the regularization parameter λ. Similar to the previous section, we can also draw this function diagram so that we can evaluate the proper selection range of λ:

**6) Learning curves)**

This section considers the Learning curve (Learning curve) and focuses on the number of training samples to observe the differences between the training set error and the validation set error:

The relationship between the number of training samples and the model is considered below. Taking quadratic polynomial regression as an example, if there is only one training sample, the model is easy to fit with the sample points. The training set error is approximately 0, which is almost negligible, the verification set error may be large. If there are two sample points, the model can easily fit the sample points. The training set error may be slightly larger, and the verification set error may be smaller. Similarly, when there are many sample points, although the model cannot fit all the sample points, the generalization ability is better, so the training set error is a little larger, and the verification set error is smaller, as shown in:

The relationship between the error and the number of training samples m or the learning curve is as follows:

The following uses the learning curve to consider the issue of high deviation and high variance. For the underfitting problem of high deviation:

Even if the number of training samples is increased, the model fitting problem is still not enough. The following is a learning curve for the underfitting problem of high deviation:

We found that if a learning algorithm is highly biased, its training error and validation set error will be high after a certain number of training samples, it does not change as the number of samples increases. Therefore, it is not a good solution to increase the number of training samples for the problem of high deviation and underfitting.

For the problem of high variance over fitting:

When the number of samples is increased, the generalization ability of the model will be better, and some are learning curves of the High-square-difference over-fitting problem:

We found that if a learning algorithm is highly accurate, its training error and validation set error are different after a certain number of training samples, however, the gap between them is reduced as the number of samples increases. Therefore, increasing the number of training samples is one of the solutions for the High-square-difference overfitting problem.

**7) Deciding what to try next (revisited) (determine what to do next again)**

Well, after talking about so many issues related to deviation/variance, let's go back to the question at the beginning of this chapter,

Suppose you have implemented a regular linear regression algorithm to predict house prices. However, when you use it to test a batch of new house data, we find that the predicted data is inaccurate, so what do you do next? The following options are aimed at the problem of high variance or high deviation. You can use some methods in the preceding section to diagnose your learning algorithm. However, for the following options, you need to consider whether the problem is high deviation or variance. You can first think about it for one minute and then look at the answer:

-Get more training samples

-Try to use a set with fewer features

-Try to obtain other features

-Try to add multiple combinations of features

-Try to reduce λ

-Add Lambda

Answer:

-Get more training samples-solve the High-Level Deviation

-Try to use a set with fewer features-solve the high variance

-Try to obtain other features-solve High Deviation

-Try to add multiple combinations of features-solve High Deviation

-Try to reduce λ-solve High Deviation

-Try to add λ to solve the high variance

Finally, let's take a look at the problem of neural networks and overfitting:

The following is a "small" Neural Network (which has few parameters and is easy to be unfitted ):

It has a low computing cost.

The following is a "big" Neural Network (which has many parameters and is easy to overfit ):

It has a high computing cost. For the problem of Neural Network overfitting, it can be solved through the regularization (λ) method.

**References:**

Machine Learning video can be viewed or downloaded on Coursera Machine Learning Course: https://class.coursera.org/ml

Download link of Lesson 10 courseware materials: PPT PDF

Professor Mitchell's classic book machine learning

Dr. Li Hang's Statistical Learning Method

Source: http://52opencourse.com/217/coursera%E5%85%AC%E5%BC%80%E8%AF%BE%E7%AC%94%E8% AE %B0-%E6%96%AF%E5%9D%A6%E7%A6%8F%E5%A4%A7%E5%AD%A6%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AC%AC%E5%8D%81%E8%AF%BE-%E5%BA%94%E7%94%A8%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%9A%84%E5%BB%BA%E8% AE %AE-advice-for-applying-machine-learning