Course introduction:
After reviewing the VC analysis, this section focuses on another theory for understanding generalization: deviation and variance, the learning curve is used to compare the differences between vc analysis and deviation variance trade-offs.
Course outline: 1. Balance between deviation and variance 2. Learning Curve
1. Weigh deviation and variance in the previous lesson: in VC dimension, we have obtained the eout boundary, eout <ein + Ω. This formula describes the eout boundary. Now let's analyze eouts from different perspectives. We divide the eout into two parts: 1. Assume the capability of H to approximate F (that is, the error between G and F with the minimum distance between H) 2. We can find the G capability in H. (That is, the error size between G and G we found in H) when measuring the error, we use the square error method. The first and second points above are usually conflicting. To make h better approximate to F, H is required to be more complex, including more assumptions, so that H is more likely to be close to or even contain F (in this case, G is F ). However, with the increase of H complexity, the more likely the found g is to stay away from G. It is difficult to find g in a larger range. When data is poor, G may be far away from G. Deviation and variance are used to describe these two conditions and find a balance point, so that eout and Ein are close enough and eout is small enough. Because the obtained G is related to a specific hypothesis set, the rule is: Gd (x) indicates the output of the function obtained under a specific dataset D in X. Therefore, eout (Gd (x) = ex [(Gd (x)-f (x) ^ 2)]. To study general problems (not related to specific D), we need to remove d. The method is to calculate the mathematical expectation of eout about D, so we have: Ed [eout (Gd (x)] = ed [ex [(Gd (x)-f (x )) ^ 2)] = ex [Ed [(Gd (x)-f (x) ^ 2)] Now, first process ed [(Gd (X) -F (x) ^ 2)]. Definition ~ G (x) represents the mathematical expectation of g (x) and assumes that ~ G (x) is the assumption that H is closest to F. Then ~ G (x) = ed [Gd (x) ed [(Gd (x)-f (x) ^ 2)]
Why is the last item missing? Because Ed [2 (Gd (x )-~ G (x )(~ G (x)-f (x)] = 2 (~ G (x)-f (x) Ed (Gd (x )-~ G (x) = 2 (~ G (x)-f (x) (ED (Gd (x)]-ed [~ G (x)]) = 2 (~ G (x)-f (x) (ED (Gd (x)]-~ G (x) = 2 (~ G (x)-f (x ))(~ G (x )-~ G (x) = 0 where Ed [Gd (x )-~ G (x) ^ 2] indicates the variance. (~ G (x)-f (x) ^ 2 represents the deviation. Therefore, variance refers to the mean distance between the assumptions we have found and the final assumptions, deviation refers to the distance between the best hypothesis and the actual value. For a simple hypothesis set such as H = Ax, the variance is usually relatively small, while the deviation is very large. On the contrary, if it is a very complex model such as H = AX ^ 100 + bx ^ 99... + zx ^ 1, the variance will increase, while the deviation will be smaller. We hope to find a balance point here. Therefore, we should select different models based on the data volume. If we only have a small amount of data, we should select a simpler model. Otherwise, the variance may be large. When we have a large amount of data, we should choose a more complex model, because a large amount of data will reduce the variance, while a complex model will have less deviations. 2. learning curve: the learning curve is the relationship between eout and Ein and the number of samples. Next let's observe the learning curves of the two models: the left graph corresponds to the simple model, and the right is the complex model. Obviously, when a model is relatively simple, both eout and Ein are very large. As N increases, the distance between eout and Ein is very small. On the contrary, when a model is very complex, Ein can be even 0 at the beginning (imagine vc, when n is less than VC dimension, we can completely separate the data in the sample without generating errors), but when n is small, the eout becomes very large, because the small N is not enough to cope with complicated models. When N increases, the distance between eout and Ein is higher than that of a simple model. However, because our goal is to get a small eout, and when we have a large amount of data, the eout of a complex model is smaller than that of a simple model, so we should first select a complex model. The following compares VC and variance deviation in the learning curve:
We can see that for VC verbose, the blue points represent the intra-sample error: Ein, And the pink points represent the Ω (or the Ω is at least greater than the pink points ). For deviation variance, blue points represent deviations, while pink points-blue points represent variance. Why does the Blue Point represent the deviation for the deviation variance? That is, why does the deviation not change with n? According to the definition, the deviation is the best result for different datasets of the same size, and the distance between the hypothesis closest to F and F. Although it is possible that a dataset with 10 points can get a better approximation than a dataset with 2 points, when we have a lot of datasets, then their mathematical expectations should be close and close to F, so they are displayed as a horizontal line parallel to the X axis. The following is an example of a learning curve:
See the following linear model:
Why add noise? That is the interference. The purpose is to test the linear approximation between the model learned by machine and Y = wx. When the Model Obtained by machine learning is y = wx, it is considered to be the best situation. Therefore, we have the following conclusion:
D + 1 is the degree of freedom of the linear model, which is similar to the VC dimension. It can be said that the role is consistent. As to why it is (D + 1/n), the author did not say it, and I am not clear about it... I feel that I am not very familiar with the following parts...
Why do we need to analyze the deviation and variance? The objective is to give a guideline to strike a balance between H, D, and learning algorithms.
California Institute of Technology Open Course: machine learning and data mining-deviation and variance trade-offs (Lesson 8)