Drawing a learning curve is useful, for example, if you want to check your learning algorithm and run normally. Or you want to improve the performance or effect of the algorithm. Then the learning curve is a good tool. The learning curve can judge a learning algorithm, which is the problem of deviation, variance, or both.
In order to draw a learning curve, the average error squared sum (jtrain) of the training set data is plotted first, or the mean error squared sum (JCV) of the cross-validation set data. Draw it into a function about the parameter m. This is a function of the training set and the total number of samples. M is generally a constant, such as M equals 100, representing 100 sets of training samples. But we have to take some of the value of M, that is, the value of M to do a little limit, for example, take 10, 20 or 30, 40 sets of training sets, and then draw the training set error, and cross-validation set error. So let's see what this curve looks like. Suppose there is only one set of training samples, that is m=1. As shown in the first picture (, M=1), and assuming that the model is fitted with two functions, then since I have only one training sample, the result of fitting is obviously very good, and the error of fitting a training sample with two functions is bound to be 0. If there are two sets of training samples, the two functions can be well fitted. Even with regularization, the results of fitting are good. And if you don't use regularization, then the fitting effect is absolutely fantastic. If you use three sets of training samples, it still looks good to fit with two functions. That is, when M equals 1, 2, 3, the training set data is predicted, the resulting training set error will be equal to 0, here is assumed not to use regularization. Of course, if regularization is used, then the error is slightly greater than 0. If the training set sample is large, you have to artificially limit the capacity of the training set sample. For example, set the M value to 3, and then only use the three sets of samples for training, and then correspond to this diagram. I only look at these three sets of training samples, to predict the resulting training error. It is also a sample of three samples fitted with the model, so even if there are 100 sets of training samples, and we still want to draw the training error when M equals 3 o'clock, we still have to focus on the error of predicting the three sets of training samples. Similarly, these three sets of samples are also the three groups of samples we use to fit the model. All other samples were selectively ignored during the training process. To summarize, we have now seen that when the training sample size M is very small, the training error will be very small. Because obviously, if we have a small training set, it's easy to fit the training set well, or even fit seamlessly. Now we see that when M equals 4, two functions seem to fit nicely into the data. Then we look at the case where M equals 5, and then use the two function to fit it, as if the effect had fallen but was still passable. And when my training set is getting bigger and larger, it's getting harder to make sure that the two-time function is still well-fitted.
In fact, with the increase of training set capacity, the average training error is gradually increasing. So if you draw this curve, you will find that the training set error (the average of the error of the assumptions predicted) increases with the increase of M (). Again, the understanding of this problem, when the training sample is very small, for each training sample can easily fit well, so the training error will be very little. Conversely, when the value of M increases gradually, it becomes more difficult to fit each training sample well, so the training set error will become more and more important. So what about cross-validation set errors? Cross-validation set error is the error of predicting a completely unfamiliar cross-validation set of data, so we know that when the training set sample is very small, the generalization is not very good, meaning that it is not well adapted to the new sample. Therefore, this hypothesis is not an ideal hypothesis. Only when we use a larger training set can we get a possible assumption that we can better fit the data. Therefore, both the validation set error and the test set error will decrease with the increase in the sample capacity m of the training set, because the more data you use, the more you will be able to achieve better generalization performance, or the ability to adapt to new samples is stronger. As a result, the more data you have, the more appropriate assumptions can be fitted. So, if you draw Jtrain and JCV, you should get the following curves.
Now let's see what these learning curves will look like when they are in high or high variance situations. If our hypothesis is in a high-variance problem, in order to explain the problem more clearly, here is a simple example of using a straight line to fit the data example (). It is clear that a straight line does not fit well with the data.
Now let's think about what happens if we increase the sample capacity of the training set. It is not difficult to see that a similar line hypothesis will be obtained. But it is not possible to fit this set of data very well if a straight line is closer.
So, if you plot cross-validation set errors, it should be like (blue curve). The left-most means that the training set sample size is very small, for example, only a set of samples, then the performance is certainly not good, and as you increase the number of training sets, when you reach a certain capacity value, you will find the most likely to fit the data of the line, and even if you continue to increase the training set of sample capacity, Even if you keep increasing the value of M, you will basically get a similar straight line. Therefore, cross-validation set errors, or test set errors, will quickly become horizontal and no longer change. As long as the sample capacity value of the training set reaches or exceeds that particular value, the cross-validation set error and the test set error tend to be constant, so that you get the line that best fits the data. So what about the training error? Again, the training error is very small at first, and in the case of high deviations, you will find that the training set error will increase gradually. has tended to approach cross-validation set errors because you have few parameters. But when M is large, there is too much data, and the predictions for the training set and the cross-validation set will be very close, which is the approximate direction of the learning curve when your learning algorithm is in a high-variance situation. Finally, add that the problem with high deviations is that cross-validation sets and training set errors are large. That is, you will eventually get a value larger than JCV and Jtrain. This is also a very interesting conclusion, that is, if a learning algorithm has a large deviation, then when we choose more training samples, that is, in this picture, as we increase the horizontal axis, we found that the cross-validation set error value, will not show a significant decline, actually become level. Therefore , if the learning algorithm is in a high-variance situation, then the use of more training set data, for improving the algorithm performance is not good. As the two graphs on our right show, here we have only five sets of training samples, we find this line to fit, and then we add more training samples, but we still get almost the same line. Therefore, if the learning algorithm is in high deviation, it will not help to give more training data. Cross-validation set errors or test set errors do not degrade much. Therefore, it is significant to be able to see that the algorithm is in a high-variance situation, because it avoids wasting time collecting more training set data. Because no number of data is meaningless.
Let's take a look at what the learning curve should look like when the learning algorithm is at a high variance. First, let's take a look at the training set error, if your training set sample size is very small, for example, as shown in the case only five training samples, if we use a very high-order polynomial to fit, such as the use of 100 polynomial functions. Of course, no one's going to use it, it's just a demo. And suppose we use a very small lambda value that may not be equal to 0, but small enough for a lambda. So obviously, we're going to fit this set of data very, very well, so this hypothetical function is over-fitting the data. So if the training set sample capacity is very small, the training set error jtrain will be minimal.
With the increase in the sample capacity of the training set, it is possible that this hypothetical function will still have a little more or less fit to the data, but it is obviously more difficult and laborious to fit the data well at this point (as shown).
So, with the increase in the sample capacity of the training set, we will find that the value of the jtrain increases with it. Because the more training samples we have, the more difficult it is to fit well with the training set data, but the overall training set error is still very small. What about cross-validation set errors? In the case of high variance, assuming that the function is over-fitting the data, the cross-validation set error will always be large, even if we choose a more appropriate training set sample number, so the cross-validation set error is almost the purple curve under the graph. So the most obvious feature of the algorithm in high variance is that there is a big gap between the training set error and the cross-validation set error. This graph also shows if we want to consider increasing the number of samples in the training set, that is, extending the curve to the right in this image. We can generally see the two learning curves, the two curves of blue and red are approaching each other. Therefore, if we extend the curve to the right, it seems that the training set error is likely to increase gradually. The cross-validation set error will continue to decline. Of course, we are most concerned with cross-validation set errors or test set errors. So from this picture, we can basically predict that if we continue to increase the number of training samples and extend the curve to the right, the cross-validation set error will gradually decrease. Therefore, in the case of high variance, using more training set data, the performance of the improved algorithm is effective. This also shows that it is very meaningful to know that your algorithm is in a high-variance situation. Because it can tell you if it is necessary to waste time to add more training set data.
The above-mentioned learning curve is quite idealized curve, for a practical learning algorithm, if you draw a learning curve, will basically get similar results. Even so, sometimes a curve with a little noise or interference can be seen. But in general, drawing a learning curve like this can really help us see if the learning algorithm is in high deviation, high variance, or both. So when we're going to improve the performance of a learning algorithm, one of the things we usually do is draw these learning curves.
Stanford University public Class machine learning: Advice for applying machines learning | Learning curves (Improved learning algorithm: the relationship between high and high variance and learning curve)