Course introduction:

This article introduces the definition of VC dimension, which is an important indicator of the Learning Performance of function sets defined by statistical learning theory. The example shows that the VC Dimension of the function set is the maximum number of samples that can be dispersed. At the end of the course, we will introduce the application of VC dimension, and point out that it reflects the learning ability of the function set. The larger the VC dimension, the more complicated the learning machine.

Course outline:

1. Definition)

2. VC Dimension of perceptrons)

3. Interpretation of VC Dimension (interpreting the VC dimension)

4. generalization Bounds)

1. Definition

DVC (H) = the most points h can shatter. Model (H) is the most scattered point. Scatter means that all vertices can be classified as needed. For binary classification, scatter N points to produce 2 ^ n possible classifications

If we understand the concept of breakpoint (K), it is easy to know that DVC (H) = k-1. (for convenience, D may be used to replace DVC (h ))

Relationship between VC and learning:

If DVC (H) is finite, gε H will be generalized (theoretically proven in Lesson 6 ).

Note: generalization in Machine Learning refers to the ability to apply the rules obtained by samples to data outside the samples, that is, the gap between EIN and eout.

The preceding statement has the following attributes:

1. It has nothing to do with learning algorithms. In any case, we have a way to get generalized G.

2. It is irrelevant to the distribution of input data. Because we have considered all situations, this statement applies to all situations.

3. It is irrelevant to the target function. We don't care about the target function at all. We only care about sample data and test data.

4.G is only related to sample data and hypothesis set h. Assume that the set determines the value range of G. The sample data determines the value of G.

2. VC Dimension of Sensor

Formula: DVC (H) = d + 1.

H is the sensor model, and D is the sensor dimension. (For binary classification)

Proof:

To prove DVC = d + 1, we first prove DVC> = d + 1, and then prove: DVC <= d + 1.

1) prove DVC> = d + 1:

According to the definition, we must have: when the number of vertices is D + 1, we can scatter all vertices. Assume that Y is the result of all possible classifications of D + 1 points (2 ^ N, we need to find a set of parameters w to make XW = y. If X is reversible, W = (x ^-1) * Y. Therefore, as long as we can find a set of datasets, their reversible nature proves to be true. Now we can find the following group of X with a total of D + 1 points. Each point is D + 1 dimension, which includes the constant part. The value is as follows. The matrix is reversible and proved to be complete.

2) prove DVC <= d + 1

To prove DVC <= d + 1, we only need to prove that we cannot scatter them for any d + 2 points.

That is, we only need to find a set of data that the sensor model cannot classify.

Like proof 1), we construct an X, but this X has d + 2 vectors, and the dimension of each vector is still D + 1. We know from the knowledge of linear algebra, when the number of vectors is greater than the dimension, these vectors are linearly related. There is a vector XJ to make :. Since the first dimension of each vector is 1 (Why ?) Therefore, AI is not all 0. Now we can find the following groups:

Yi = Sign (AI) Where AI is not equal to 0 (we ignore AI equal to 0) and YJ =-1. The sensor model cannot generate the classification.

If both sides of the preceding equation are multiplied by W (parameter), YI = Sign (wxi) and Yi = Sign (AI), so sign (wxi) * sign (AI)> 0. So we have the right side of the equation multiplied by W must be greater than 0, so wxj must be greater than 0, so YJ = Sign (wxj) = + 1, there is no way to generate YJ =-1 classification, so we can prove that DVC <= d + 1

In summary, we have DVC = d + 1, and the certificate is complete.

For the sensor model, the VC dimension is the number of parameters. The larger the VC dimension is, the more parameters there are, and the more complex the model is. Is this conclusion applicable to other sensors? This is what we will talk about next.

3. Explanation of VC Dimension

This section mainly solves two problems:

1. What is the significance of VC?

2. How to Use VC to guide machine learning?

3.1 significance of VC Dimension

For a model, the more parameters it has, the more degrees of freedom it has. As described in the previous section, the larger the dimension of the sensor, the more parameters and the more complex the sensor has, that is, the more free you are. Can we determine the degree of freedom of the model based on the model parameters?

Let's take a look at several examples:

1) positive ray model, DVC = 1, parameter equal to 1, Degree of Freedom 1

2) Line Segment interval model, DVC = 2, parameter equal to 2, Degree of Freedom 2

At first glance, it seems that a parameter can determine the degree of freedom of a model. However, we should be able to find the counterexample. Let's look at the example below:

The following figure shows a linear model. The linear model consists of four simple linear models. Each circle represents a simple linear model. Each simple linear model has two parameters, therefore, the parameter of the entire model is 8. Does it mean that the degree of freedom of the model is 8? Otherwise, redundancy exists. In fact, except for the first simple model, the input of other models is limited, either + 1 or-1. Therefore, the number of parameters cannot represent the degree of freedom. However, if we use the VC dimension, we do not need to know the composition of the model. We only care about the number of vertices that can be dispersed. Therefore, the VC dimension can better represent the degree of freedom of a model, which is the significance of the VC dimension. It represents the degree of freedom of a model and is not affected by the composition of the model.

3.2 How to Use VC to guide machine learning?

If we know the VC Dimension, how much data do we need to learn better?

Evaluate the following formula:

Given ε and λ (the right side of the inequality), what we really want to know is how many points will be satisfied. To better evaluate the inequality, we simplify the right side of the inequality as follows:

Y = n ^ D * E ^ (-N), where D is the VC dimension. We predict the original expression by examining the expressions similar to the expressions on the right of the above inequality. In this way, the analysis process can be simplified.

If we make y equal to a small value (this is what we want), then for different d we are like: the horizontal line in the figure is the value we want, we want y to be less than or equal to this value, the corresponding X coordinate is the required number of points: the size of N.

As the Extreme Value of the function increases rapidly when D increases, we need to evaluate the number of y values so that we can better observe them.

The first line D is 5, and the subsequent line D is the first line + 5. Through observation, we can probably know that D is proportional to n.

We have observed the following phenomenon: the larger the VC dimension, the larger the N, and the larger the linear relationship. This phenomenon has no mathematical proof and is observed through practice, but it has great guiding significance.

Given ε and λ, we can only get the approximate ratio of N to VC, rather than the actual ratio. However, in most cases, the approximate proportion is true. Through a wide range of observations, we have the following empirical formula:

N> = DVC * 10. (For most problems, most models and most datasets are established)

4. generalization of boundaries

For the following formula, we use Delta to represent the right side of the inequality and Delta to represent ε.

So we have:

We have previously studied the probability of occurrence of bad events. Now let's look at the probability of occurrence of optimistic events:

P [| ein (G)-eout (G) | <= ε]> = 1-Delta

Use Ω (n, H, Delta) instead of ε to get the desired good event definition: | eout-Ein | <= Ω (n, H, Delta)

Ω is positively related to N, Delta, and h or VC.

We ignore the Ω parameter first, so there are: | eout-Ein | <= Ω.

In most cases, eout is larger than EIN, because we are learning in the sample, therefore, the specific model we learned will be more inclined to our samples. Of course, some exceptions are not ruled out. Therefore, we can simplify it again to remove the absolute values:

Eout-Ein <= Ω

Get the items:

Eout <= ein + Ω

This involves a very important skill in Machine Learning: regularization. Eout is not only related to Ein, but also to Ω. Therefore, when we increase the number of hypothetical sets, although we can reduce the number of Ein but increase the size of Ω, there should be a balance between them to minimize ein + Ω, this is the problem of regularization.

Conclusion:

This section describes the significance of VC and VC. Through the VC dimension, we can describe the degree of freedom of a model and know the amount of data required for effective learning. In many cases, the amount of data required is only an experience value and cannot be obtained accurately. However, this value is very helpful for us to analyze machine learning. Finally, the model boundary is generalized by simplifying the original formula. We can conclude that the number of hypothesis sets is not the more the better, because eout is affected by both EIN and Ω.

Caltech Open Course: machine learning and Data Mining _ VC (Lesson 7)