Machine learning Cornerstone Note 7--Why machines can learn (3)

Last Update:2015-02-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://www.cnblogs.com/ymingjingr/p/4271742.html

Directory machine Learning Cornerstone Note When you can use machine learning (1) Machine learning Cornerstone Note 2--When you can use machine learning (2) Machine learning Cornerstone Note 3--When you can use machine learning (3) (modified version) machine learning Cornerstone Notes 4-- When to use machine learning (4) Machine learning Cornerstone Note 5--Why machines can learn (1) machine learning Cornerstone Notes 6--Why machines can learn (2) machine learning Cornerstone Notes 7--Why machines can learn (3) machine learning Cornerstone Notes 8-- Why machines can learn (4) machine learning Cornerstone Note 9--machine how to learn (1) machine learning Cornerstone Note 10--machine how to learn (2) machine learning Cornerstone Note 11--machine how to learn (3) machine learning Cornerstone Note 12-- How machines can learn (4) machine learning Cornerstone Note 13--Machine How to learn better (1) machine learning Cornerstone Note 14--Machine How to learn better (2) machine learning Cornerstone Note 15--Machine How to learn better (3) machine learning Cornerstone Note 16-- How the machine can learn better (4) the VC Dimension

VC Dimension.

7.1 Definition of VC Dimension

Definition of VC dimension.

Make a brief summary of the previous chapter: if there is a breakthrough in a hypothetical space. It is certain that the growth function is constrained by an upper bound function, and the upper limit function is equal to the sum form of a combination, and it is easy to know the highest order of the form. Figure 7-1a) and B) are the cases where the upper limit function is the upper limit of the growth function and the upper limit of the growth function.

Figure 7-1 A) upper limit function as upper limit B)

Can be seen in and the case, meet, get formula 7-1.

(Equation 7-1)

The formula 7-2 can be obtained by the conclusion of Equation 7-1 and the previous chapter.

(Equation 7-2)

The meaning of the formula is that when the input sample n is large, the VC limit must be set up, and the left side of the equation will certainly be in the case of the polynomial form () (note that the conditions here are not, the reason is very simple, the VC limit is a large case of sample n produced, so must meet the conditions), and in k< 3 of the cases there are other limitations that can be met (such as those mentioned in the previous chapters, such as positive-ray categories do not require a polynomial form of constraints can also constrain the growth function).

Now that the following conditions are met, the machine can learn:

Suppose the growth function of space has a breakout point k (There is a good hypothetical space H );
input Data Sample N large enough (with a good set of input samples D );

1 and 2 are co-launched through VC restrictions and there is a great possibility of being close.

an algorithm A able to find a small enough g (a good algorithm A );

Combined with the conclusions of 1 and 2, it is possible to learn (and of course, a bit of good luck here).

Next introduce the topic of this section, VC Dimension or VC dimension (VC dimension) is what meaning.

Its definition and breakout point are very much related, and the largest one is not the number of breakout points.

The VC dimension is a property of the hypothetical space, and the data sample can be the maximum value of the total dichotomy. Use as the mathematical symbol of the VC dimension, if the break point exists, that is, the smallest breakout point minus 1, as shown in Equation 7-3, if there is no breakthrough, then the VC dimension is infinite.

(Equation 7-3)

If the input data volume n is less than the VC dimension, then it is possible that the input data D will be completely two categorized, here is not necessarily, can only guarantee existence.

If the input data volume n (or k) is larger than the VC dimension, then K must be the breakthrough point of the hypothetical space H.

Use the VC dimension to override Equation 7-1, as shown in Equation 7-4.

(Equation 7-4)

The fifth chapter refers to several categories, using VC dimension to replace the breakthrough point, the VC dimension and growth function of the relationship, as shown in table 7-1.

Table 7-1 the relationship between VC dimension and growth function

Positive Ray


convex graphic classification
Perceptron for two-dimensional plane		At the time

A definition of the good hypothetical space in the above condition 1 is made, that is, the limited VC dimension.

A limited VC dimension is always guaranteed to find the approximate assumption that G satisfies, this conclusion is not related to the following parts:

the algorithm used A , even if it is large, can still satisfy the above-mentioned nature;
distribution of input data P ;
Unknown target function F .

That is, the VC dimension can deal with any hypothetical space, arbitrary data distribution, arbitrary objective function.

This property can be met by 7-2 of the flowchart shown, where the gray part of the above will not affect the results of the part.

Fig. 7-2 Flowchart of the VC-dimensional assurance machine can learn

7.2 VC Dimension of Perceptrons

The VC dimension of the Perceptron.

The following two conditions guarantee that 2-dimensional linear data can be learned.

linearly-divided data is passed PLA The algorithm runs long enough ( T step is large enough), you will find a line that can be correctly categorized, so that the sample does not produce a sub-error class, namely;
in the training sample and the entire data set are subject to the same distribution P under the premise that there VC limitations guaranteed, in and training samples N when big enough,.

The above two conditions together concluded.

This section discusses whether PLA can handle data with dimensions greater than two dimensions.

From the content of the previous section: As long as the finding is a finite number, you can use the VC limit to ensure. So the question becomes how the VC dimension of the Perceptron is represented (can be expressed as a finite number) when the dimension is greater than two dimensions.

A VC-dimensional representation of two known perceptron. VC dimension of 1-D Perceptron: 2-Dimension Perceptron VC dimension:.

Can we then get the VC dimension of D-Dimension perceptron:?

The above is only a conjecture, the next is to prove this conjecture, proof of the idea is also very traditional, proving that equal number is divided into two steps: the proof is greater than equals and less than equals.

Proof of the idea of greater than equals: to prove that there is a d+1 number of a data set can be completely two points, proof of less than equal idea: to prove that any d+2 number of datasets can not be completely two points.

First prove greater than equals. Since only the proof exists, it is advisable to construct an input sample set, assuming that the sample is a row vector, where the first sample is a 0 vector, the second sample is a vector whose first component is 1 other component 0, the third sample is a vector whose second component is 1 other component 0, and so on, section d+ 1 samples are vectors whose d component is 1 other component 0, such as:,,,, ..., in the Perceptron sample x as shown in Equation 7-5, where each sample is prefixed with the No. 0 component, and its value is 1 (from the threshold B to the multiplied sample component).

(Equation 7-5)

It is easy to prove that the matrix is reversible: each row except the first line is subtracted from the first line to get a diagonal matrix, so it is full-rank, reversible.

What needs to be proved is that it can be completely binary, focusing on the output marker vector. It is possible to map the input sample set X to all of the two sub-conditions as long as the various weights are found.

A known perceptron can use representations. And as long as the weight of the vector so that the establishment, it must meet the needs. Assuming that the input matrix as shown in Equation 7-5, that is, X is a reversible matrix, the output vector y of any one of the two classification can be divided by a hypothetical function w, because the weight vector satisfies, that is, any kind of two classification will have a weight vector w corresponding to it, so full set.

Proof less than equals: prove less than equals is not as above, to cite a special input data set, because it is to prove that in all cases, the process of proving a little more complex, first with a 2-dimensional space example as a starting point.

Assuming a 2-dimensional space, you need to observe the amount of input data on the X-scale, it is advisable to assume that the four input samples are,,, respectively. Enter the DataSet x as shown in Equation 7-6.

(Equation 7-6)

It can be imagined that when labeled as-1, and +1, it is not possible to think of -1,7-3 as shown.

Figure 7-3 the 2-dimensional data sample is not a two-point case

How to express in the form of mathematics? First of all, according to these four samples can be ensured that equation 7-7 is established.

(Equation 7-7)

The formula for both sides of the equation equals the left-weighted vector still, but when it satisfies the condition of-1, and +1, the formula left must be greater than 0, as shown in Equation 7-8.

(Equation 7-8)

The relationship between this sample of linear dependence (linear dependence) leads to the inability of two points.

What about the results in the high-dimensional space?

Suppose you d+2 a sample in a D-dimensional space, and its input sample set is shown in Equation 7-9.

(Equation 7-9)

Because of the linear dependence between samples, you can get the equation 7-10, which represents the coefficient, the coefficient can be positive or negative, or equal to 0, but not all 0.

(Equation 7-10)

Here is the use of contradiction: there is one such dichotomy, in equation 7-10 on both sides of the equals sign left the weight vector formula 7-11.

(Equation 7-11)

Because the first component of the same number (i=1,2,..., d+1), so the left is bound to be positive, the hypothesis is not established, so in any d+2 input data set There must be an unmet two classification, that is.

At this point greater than equals and less than equals both prove to be over, the initial conjecture is established.

7.3 Physical intuition of VC Dimension

VC-dimensional intuition.

In the previous section, the dimension of data in the Perceptron was connected with VC dimension, and the meaning of VC dimension was also understood. In the Perceptron, the dimension of the data sample is consistent with the dimension of the weight vector, and the different weight vectors correspond to different hypothetical functions, so the parameters of the assumption function are the degrees of freedom in the hypothetical space (degree of freedom). If the number from the hypothetical space | H|, the degree of freedom is infinitely large, but the limitation of the Perceptron on the two-tuple classification allows the use of VC dimensions as a measure of freedom.

A more specific view of the relationship between the VC and the hypothetical spatial parameters is given by the two columns previously studied, as shown in 7-4.

Figure 7-4 A) positive ray

b) The classification of the interval is positive

where figure 7-4 a) VC dimension, assuming that the space has 1 parameters, that is, the threshold value; Figure 7-4 B) VC dimension, assume that the space has 2 parameters, that is, the left and right boundary points. Therefore, in most cases, the VC dimension is roughly equal to the number of hypothetical spatial parameters.

The two issues mentioned in section 5.1, this section can use the VC dimension instead of M to describe the relationship to these two issues, as shown in table 7-1.

Table 7-1 The relationship between the size of VC dimension and two conditions

&NBSP;

very big time.

first question

Meet,

, the odds of a bad situation become smaller

Not satisfied,

When it's big, the odds of a bad situation are getting bigger

A second question

Not satisfied, in the variable hour, assuming the number is smaller, the choice of the algorithm is smaller, may not be able to find nearly 0 hypothesis.

Satisfied, when it becomes larger, the number of assumptions becomes larger, the choice of the algorithm becomes larger, and the probability of finding a hypothesis close to 0 becomes larger.

Here to write their own thinking: in some books, the parameters of more models are called complex models, very little explanation, this section of the content gives a good explanation, because in many cases the number of model parameters is roughly equal to the number of VC dimensions. The more parameters or the more complex the model, the more likely it is to find the smallest hypothetical function g, but this requires a lot of training sample support, because only if the training sample number n is large enough to make the model more complicated (that is, more parameters or more VC dimensions) the probability of a bad situation is smaller, such as the right column of the table.

7.4 Interpreting VC Dimension

Explain the VC dimension.

The VC limit formula is shown in 7-12.

(Equation 7-12)

The right side of the inequality is represented by a symbol, and the probability of a good event must be greater than or equal, and the proximity can be expressed using the formula that contains it, as shown in Equation 7-13.

(Equation 7-13)

The degree of proximity, known as the generalization error (generalization error), proves that the error is less than or equal by the above formula.

The range can therefore be expressed in equation 7-14.

(Equation 7-14)

One of the most left-hand formulas is generally not concerned with the emphasis on the right side, which indicates the upper limit of the error.

It is also written as a function called model complexity.

Look at what important information the VC dimension is giving through a graph, as shown in 7-5.

Fig. 7-5 the relationship between error rate and VC dimension

The Blue line indicates the change with VC dimension, the red part is the change of model complexity, and the purple part is the change with VC dimension, which can be expressed in the form of expression.

With the increase and decrease, it is not difficult to understand, because the larger the choice of hypothetical space is larger, it is possible to choose smaller, the complexity of the model increases and the increase is not difficult to understand, can be derived from the relationship, and because of the sum of the previous two, so there is a first reduction after the increase of the process, so that its minimum So that the minimum is the ultimate goal of learning, so looking very important.

The VC dimension can also represent the complexity of the sample in addition to the model complexity (sample complexity).

Assuming a given requirement, the number of n is the size of the input sample to satisfy these conditions. We will substitute the various orders of N into the formula, and compare it to get figure 7-6.

Figure 7-6 The relationship between the values of n

It can be concluded that when N is almost 29,000, the model satisfying the condition can be trained, which is a large number, that is, the complexity of the data and the VC dimension has a theoretical relationship. But in practical applications, multiples are much smaller than this, presumably. The reason for this phenomenon is that the self-VC limit is a very loose constraint. The reasons for the easing are mainly from the following four:

howding don't need to know the unknown, VC restrictions can be used for various distributions, various objective functions;

The growth function instead of the true two number is itself a loose upper bound, VC restrictions can be used for various data samples;
using two-item as the upper bound of the GROWTH function makes the constraint more relaxed, VC restrictions can be used to arbitrarily have the same VC The hypothetical space of the dimension;
Joint Restrictions ( Union bound does not necessarily choose a hypothetical function that does not appear to be a bad thing, VC restrictions can be used with arbitrary algorithms.

In fact, it is difficult to find out in these four areas can be arbitrarily selected and more restrictive than the limitations of VC constraints, the focus of understanding the VC is not its constraints loose or not, but it gives us a philosophical revelation.

Machine learning Cornerstone Note 7--Why machines can learn (3)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More