10.1 Decide what to do next
10.2 Evaluating a hypothesis
10.3 Model selection and cross-validation sets
10.4 Diagnostic deviations and variances
10.5 Normalization and deviation/variance
10.6 Learning curve
10.7 Decide what to do next
10.1 decide what to do next
So far, we've introduced many different learning algorithms, and if you've been following the progress of these videos, you'll find yourself unknowingly becoming an expert on many advanced machine learning techniques.
However, there is still a big gap between the people who understand machine learning, and some people really have mastered how to use these learning algorithms efficiently and powerfully. And some of them may not be so familiar with what I'm going to say right away. They may not fully understand how to use these algorithms. So always waste time on meaningless attempts. What I want to do is make sure that you know how to choose the most appropriate and correct path when designing a machine learning system. So, in this video and in the next few videos, I'll show you some practical tips and guidance to help you understand how to make a choice. Specifically, the question I'll focus on is, if you're developing a machine learning system, or trying to improve the performance of a machine learning system, how do you decide which path to choose Next? To explain this, I would like to still use the study example of predicting house prices if you have completed the regularization linear regression, which is the value of minimizing the cost function J, if, after you get your learning parameters, if you want to put your hypothesis function on a new set of house samples for testing, If you say that you have a huge error in predicting house prices now, your problem is to improve the algorithm, what should be done next? You can actually come up with a number of ways to improve the performance of this algorithm, one of which is to use more training samples. Specifically, perhaps you can think of a phone survey or a home survey to get more different home sales data. Unfortunately, I've seen a lot of people spend a lot of time trying to collect more training samples. They always thought that if I had twice or even 10 times times the number of training data, that would definitely solve the problem, right? But sometimes getting more training data doesn't really work. In the next few videos, we'll explain why.
We will also know how to avoid wasting too much time on collecting more training data, which is actually unhelpful. Another way you might think of is to try to use fewer feature sets. So if you have a series of features like x1x2 X3 and so on. There may be many features, and perhaps you can take a moment to carefully select a small fraction of these features to prevent overfitting . Or maybe you need to use more features, perhaps the current feature set, is not very helpful to you.
You want to collect more data from the point of view of more features, so you can expand the problem into a large project, such as using a telephone survey to get more housing cases, or land surveying to get more information about the land, and so on, so this is a complex problem. In the same vein, we really want to know how well it works before we spend a lot of time doing it. We can also try to increase the polynomial characteristics of the method, such as X1 squared, x2 Squared, x1x2 product, we can spend a lot of time to consider this method, we can also consider other methods to reduce or increase the value of the regularization parameter lambda. The list we've listed, and many of the methods above, can be expanded to extend into a six-month or longer project. Unfortunately, the standard that most people use to choose these methods is the feeling, that is to say, the choice of most people is to choose one of these methods casually, for example, they said "Oh, let's find some more data", then spent six months to collect a lot of data, then maybe another person said: "Well, Let's find some more features in the data from these houses. " I'm sorry to see more than once that a lot of people have spent at least six months to complete one of their random choices, and after six months or more, they regret to find that they have chosen a way of no return. Fortunately, there are a number of simple ways that you can do more with less, eliminate at least half of the list, leave those that are really promising, and there is a very simple way to easily eliminate many of your choices if you use them, saving you a lot of time that you don't have to spend. Finally achieve the goal of improving machine learning system performance assuming we need a linear regression model to predict the price, when we use the well-trained model to predict the unknown data, we find that there is a big error, what can we do next?
1. Get more training examples-often effective, but at a higher cost, the following methods may also be effective, consider using the following methods first.
2. Try to reduce the number of features
3. Try to get more features
4. Try adding polynomial features
5. Try to reduce the degree of normalization λ
6. Try to increase the degree of normalization λ
Instead of randomly selecting one of the methods above to improve our algorithm, we use machine learning diagnostics to help us know which of these methods are effective for our algorithm.
In the next two videos, I'll first describe how to evaluate the performance of a machine learning algorithm, and then in a few subsequent videos, I'll start talking about these methods, which are also known as "machine learning diagnostics."
"Diagnostics" means: This is a test method that you can use to gain insight into the usefulness of an algorithm. This can also tell you that it is meaningful to try to improve the effectiveness of an algorithm. In this series of videos we will introduce specific diagnostic methods, but I would like to explain in advance that the implementation and implementation of these diagnostic methods will take some time, and sometimes it really takes a lot of time to understand and implement, but it is really to spend time on the edge, because these methods allow you to develop learning algorithms, Save a few months, so in the next few lessons, I'll show you how to evaluate your learning algorithm. After this, I will introduce some diagnostic methods, hoping to make you more aware. In the next attempt, how to choose a more meaningful approach.
10.2
evaluate a hypothesis
In this video I want to show you how to evaluate hypothetical functions using the algorithms you've learned. In the following lesson, we will use this as a basis to discuss how to avoid overfitting and less-fitting problems.
When we determine the parameters of the learning algorithm, we consider the selection of parameters to minimize the training error, some people think that getting a very small training error must be a good thing, but we already know, just because this hypothesis has a very small training error, does not mean that it must be a good hypothetical function. And we've also learned the example of fitting a hypothetical function, so it's not applicable to the new training set.
So, how do you determine if a hypothetical function is over-fitted? For this simple example, we can paint the hypothetical function h (x) and then look at the trend of the graph, but for the general case of more than one characteristic variable, as well as the problem of many characteristic variables, it becomes difficult or impossible to observe by drawing the hypothetical function.
Therefore, we need another way to evaluate our hypothetical function over-fitting test. To check whether the algorithm is over-fitting, we divide the data into training sets and test sets, usually using 70% of the data as the training set, with the remaining 30% of the data as the test set. It is important to note that both the training set and the test set contain various types of data, usually we "shuffle" the data and then divide it into training sets and test sets.
Test set evaluation after the training set lets our model learn its parameters, we use this model for the test set, we have two ways to calculate the error:
1. For linear regression models, we use test set data to calculate the cost function J
2. For the logistic regression model, we can use the test data set to calculate the cost function in addition to:
The rate of the mis-classification, for each test set instance, is calculated:
It then averages the results of the calculations.
10.3
model selection and cross-validation sets
Let's say we choose between 10 different number of two-item models:
Obviously the higher the number of polynomial models can adapt to our training data set, but the adaptive training data set does not mean that can be generalized to the general situation, we should choose a more adaptable to the general situation of the model.
We need to use a cross-validation set to help select the model.
That is: Using 60% of the data as a training set, using 20% of the data as a cross-validation set, using 20% of the data as a test set
The model selection method is:
1. Train 10 models using the training set
2. Cross-validation error (Value of cost function) is calculated for cross-validation set with 10 models
3. Select the model with the lowest cost function value
4. Use the model selected in step 3 to calculate the generalization error (value of the cost function) for the test set.
10.4
diagnosing deviations and variances
When you run a learning algorithm, if the performance of the algorithm is not ideal, then most of the two cases occur: either the deviation is relatively large, or the variance is relatively large . In other words, what happens is either a lack of fit or an over-fitting problem. So in both cases, which one is related to the deviation, which is related to the variance, or is it related to the two? It is very important to understand this because it is possible to judge what is happening in both cases. is actually a very effective indicator that guides the most effective ways and means to improve the algorithm. In this video, I'd like to delve deeper into the question of deviations and variances, and hopefully you'll have a deeper understanding of them, and also figure out how to evaluate a learning algorithm that can determine whether an algorithm is biased or variance, because it's important to figure out how to improve the learning algorithm. High Deviation and the problem of high variance is basically the problem of under-fitting and overfitting.
We usually draw on the same chart to help analyze the cost function errors and the number of polynomial of the training set and cross-validation set :
For the training set, when D is small, the model fit is lower, the error is large, and with the increase of D, the fitting degree increases and the error decreases.
for cross-validation sets, when D is small, the model fits low and the error is large , but as D grows, the error appears to decrease and then increase, and the turning point is when our model begins to fit the training data set.
If our cross-validation set error is large, how can we determine whether it is variance or deviation? According to the chart above, we know:
Training set error and cross-validation set error approximation: deviation/Under-fitting
Cross-validation set error is much larger than training set error: Variance/overfitting
10.5
Normalization and deviations
/
Variance
In the course of training the model, we usually use some normalization methods to prevent overfitting.
But we may be normalized too high or too small, that is, when we choose the value of λ, we also need to think about the number of times the polynomial model has just been selected.
We choose a series of lambda values that we want to test, usually a value that renders twice times the relationship between 0-10 (for example, a total of 12 0,0.01,0.02,0.04,0.08,0.15,0.32,0.64,1.28,2.56,5.12,10).
We also divide the data into training sets, cross-validation sets, and test sets.
The method for selecting λ is:
1. Use the training set to train 12 different degrees of normalized models
2. Cross-validation errors for cross-validation sets calculated using 12 models
3. Select the model with the least cross-validation error
4. Using the model chosen in step 3 to calculate the generalization error of the test set, we can also draw the cost function error of the training set and the cross-validation set model on a chart:
• When λ is small, the error of the training set is small (over-fitting) and the cross- validation set error is large
• with the increase of λ, the error of the training set is increasing (under-fitting), while the cross-validation set error decreases first and then increases
10.6
Learning curve
Learning curve is a good tool, I often use the learning curve to determine whether a learning algorithm is in the deviation, variance problem. The learning curve is a good test of the Learning Algorithm (sanity check). The learning curve is a chart that draws the training set error and cross-validation set error as a function of the number of instances of the training set (m).
That is, if we have 100 rows of data, we start with 1 rows of data, gradually learning more rows of data. The idea is that when training less data, the trained model will be able to adapt to less training data perfectly, but the trained model does not fit well with cross-validation set data or test set data.
How to use the learning curve to identify high deviation/Under-fitting : As an example, we try to use a straight line to fit the data below, and we can see that no matter how great the error is in the training set, it doesn't change much:
In other words, adding data to a training set is not necessarily helpful in the case of high deviations/under-fitting.
How to use the learning curve to identify Gaofangcha/overfitting : Suppose we use a very high polynomial model, and the normalization is very small ,
As you can see, adding more data to the training set can improve the effectiveness of the model when the cross-validation set error is much larger than the training set error.
In other words, in the case of Gaofangcha/overfitting , adding more data to the training set may improve the algorithm effect.
Personal Summary: with the increase of the number of variables, the more features, the degree of fit, the occurrence of the phenomenon of overfitting, the error decreases, the variance decreases. The smaller the normalization level.
10.7 decide what to do next
We have described how to evaluate a learning algorithm, and we discuss the problem of model selection, deviation and variance. So how can these rules help us determine which methods might help improve the effectiveness of learning algorithms, and which might be futile?
Let's go back to the first example, where we look for answers, and that's what we've been doing. Looking back at the six optional next steps presented in 1.1, let's take a look at the circumstances under which we should choose:
1. Get more training examples-to solve high deviations
2. Try to reduce the number of features-resolving high deviations
3. Try to get more features-solve high variance
4. Try adding polynomial features--solving high variance
5. Try to reduce the degree of normalization λ--solve high variance
6. Try to increase the degree of normalization λ--resolve high deviations
Variance and deviation of neural networks:
The use of smaller neural networks, similar to those with fewer parameters, leads to high variance and less fitting, but less computational cost using larger neural networks, similar to the case of more parameters, prone to high deviations and overfitting, although the computational cost is relatively large, but can be adjusted by means of normalization and more adaptive data.
It is usually better to choose a larger neural network and use normalization to treat it than to use a smaller neural network effect.
For the selection of layer number of hidden layer in neural network, we usually increase the number of layers from the first layer, in order to make a better choice, we can divide the data into training set, cross-validation set and test set, train the neural network for different hidden layers of neural network, then select the neural network with the least cost of cross-validation set.
Okay, here's the deviation and variance problem we're introducing, and the learning curve approach to diagnosing the problem. When improving the performance of learning algorithms, you can use these to determine which avenues may be helpful. And which methods may be meaningless. If you understand what is covered in the video above, and you know how to use it. Then you can use the machine learning method to solve practical problems effectively. You can also like most of the Silicon Valley machine learning practitioners, their daily work is to use these learning algorithms to solve many practical problems. I hope that some of the techniques mentioned in these sections, such as variance, deviation, and the learning curve, can really help you to apply machine learning more efficiently and make them work effectively.
Stanford Tenth lesson: Applying Machine learning Recommendations (Advice for applying machines learning)