Foundataions of machine learning: Rademacher complexity and VC-dimension (2)
(1) growth Function)
Before introducing the growth function, let's introduce an example which will help you understand the growth function.
When the input space is $ \ mathbb {r} $, assume that the space is a threshold function, that is, when the input vertex $ x> V $, The point is marked as positive. For example, Figure 1 shows the six assumptions.
Figure 1 threshold function example
Obviously, the size of this set is infinite. However, we can easily find that for a sample with a sample size of M, these hypothesis sets can be divided into $ m + 1 $ categories, the assumptions in each category share the same empirical error for this sample. Therefore, we can easily think of a hypothesis in each category as a representative. That is to say, there are only $ m + 1 $ valid assumptions. Therefore, $ | H | $ in the second article can be replaced by $ m + 1 $.
However, unfortunately, this is a specific example, and effective function dependencies and samples are not easy to promote. But in another way, can we find a way to divide an infinite number of hypothesis sets into finite classes, that is, to use a finite valid hypothesis set to equalize this infinite hypothesis set. This is what growth functions do.
First, define a binary set:
$ \ Pi_h (s) =\{ (H (X_1),..., H (x_m): H \ In H \}$
Every $ h $ binary classification of samples produces a result. The set of all results is $ \ pi_h (s) $. Obviously $ | \ pi_h (s) | \ Leq 2 ^ m $.
Define a 2.3 Growth FunctionThe growth function $ \ pi_h: n \ rightarrow N $ of a hypothetical set is defined:
$ \ Forall m \ In N, \ pi_h (m) = \ Max _ {s \ In \ mathcal {x} ^ m} \ mid \ pi_h (s) \ mid $
$ \ Pi_h (m) $ indicates the maximum number of different methods that can be generated by classifying any M points using hypothesis H. The growth function provides another method to measure the complexity of hypothesis set h, which is different from Rademacher complexity, and does not rely on the distribution of generated samples.
First, let's first introduce the hoeffding theorem.
Hoeffding lemma:Make $ A \ Leq x \ Leq B $ a random variable with the expected $ E [x] = 0 $. Then, for all $ T> 0 $, the following inequality is true:
$ E [Exp (TX)] \ Leq exp (\ frac {t ^ 2 (B-a) ^ 2} {8}) $
Theorem 2.3 Massart's lemma:Make $ A \ In \ mathbb {r} ^ m $ a finite set, remember $ R =\max _ {x \ In a} \ parallel x \ parallel_2 $, the following inequality is true:
$ \ Mathop {e }_{\ Delta} [\ frac {1} {m} \ sup _ {x \ In a} \ sum _ {I = 1} ^ m \ sigma_ix_ I] \ Leq \ frac {r \ SQRT {2log \ mid A \ mid }}{ m }. $
Here, $ \ sigma_ I $ is an independent and even random variable with a value of $ \ {+ 1,-1 \} $, $ x_1 ,..., x_m $ is the component of the vector $ x $.
Proof: For any $ T> 0 $, the following can be obtained using the jenns' inequality:
\ Begin {Align *} \ exp (T \ mathop {e }_\ Sigma [\ sup _ {x \ In a} \ sum _ {I = 1} ^ m \ sigma_ix_ I ]) & \ Leq \ mathop {e }_\ Sigma [\ exp (T \ sup _ {x \ In a} \ sum _ {I = 1} ^ m \ sigma_ix_ I)] \\& =\ mathop {e }_\ Sigma [\ sup _ {x \ In a} \ exp (\ sum _ {I = 1} ^ Mt \ sigma_ix_ I)] \\& \ Leq \ sum _ {x \ In a} \ mathop {e }_\ Sigma [\ exp (\ sum _ {I = 1} ^ Mt \ sigma_ix_ I)] \ end {Align *}
Based on the independence of $ \ sigma_ I $, the app hoeffding's lemma can be used:
\ Begin {Align *} \ exp (T \ mathop {e }_\ Sigma [\ sup _ {x \ In a} \ sum _ {I = 1} ^ m \ sigma_ix_ I ]) & \ Leq \ sum _ {x \ In a} \ prod _ {I = 1} ^ m \ mathop {e }_{ \ sigma_ I} [\ exp (T \ sigma_ix_ I)] \\& \ Leq \ sum _ {x \ In a} \ prod _ {I = 1} ^ m \ exp [\ frac {t ^ 2 (2x_ I) ^ 2} {8}] \ & = \ sum _ {x \ In a} exp [\ frac {t ^ 2} {2} \ sum _ {I = 1} ^ mx_ I ^ 2] \ & \ Leq \ sum _ {x \ In a} exp [\ frac {t ^ 2R ^ 2} {2}] = | A | E ^ {\ frac {t ^ 2R ^ 2} {2 }}\ end {Align *}
Take $ \ log $ on both sides and divide it by t at the same time. You can get:
$ \ Mathop {e }_{ \ Sigma} [\ sup _ {x \ In a} \ sum _ {I = 1} ^ m \ sigma_ I X_ I] \ Leq \ frac {\ log | A |}{ t }+ \ frac {tr ^ 2} {2 }$ $
Set $ T> 0 $ to $ T = \ frac {\ SQRT {2 \ log | A |}{ R }$ to minimize the formula on the right,
$ \ Mathop {e }_{ \ Delta} [\ mathop {sup }_{ x \ In a} \ sum _ {I = 1} ^ m \ sigma_ I X_ I] \ leq r \ SQRT {2 \ log | A |}$ $
Then, divide by m to obtain the formula in the theorem. Certificate completed!
With Massart's lemma, you can use the growth function to define Rademacher complexity.
Inferences 2.1If G is a function family with a value of $ \ {+ 1,-1 \} $, the following inequality is true:
$ \ Mathfrak {r} _ m (g) \ Leq \ SQRT {\ frac {2log \ pi_g (m)} {m}. $
Proof: for a fixed sample $ S = (x_1,..., x_m) $, define $ g _ {| S} $
$ G _ {| S }=\{ (G (X_1),..., g (x_m) ^ t: G \ in G \} $
Because the value of $ g (x) $ is $ \ {-1, 1 \} $, for $ \ forall U \ in G _ {| S} $, $ \ | U \ | _ 2 \ Leq \ SQRT {m} $. According to the definition of $ g _ {| S} $, we can see $ g _ {| S} = \ pi_g (s) $, therefore, $ | G _ {| S} | \ Leq \ pi_g (m) $.
Therefore, the application Massart's lemma includes:
\ Begin {Align *} \ mathfrak {r} _ m (g) & =\ mathop {e} _ s [\ mathop {e }_\ Sigma [\ sup _ {u \ in G _ {| S }}\ frac {1} {m} \ sum _ {I = 1} ^ m \ sigma_iu_ I] \ & \ Leq \ mathop {e} _ s [\ frac {\ SQRT {m} \ SQRT {2 \ log | G _ {| S} |}{ m}] \\& \ Leq \ mathop {e} _ s [\ frac {\ SQRT {m} \ SQRT {2 \ log \ pi_g (m )}} {m}] \ Leq \ SQRT {\ frac {2 \ log \ pi_g (m)} {M }}\ end {Align *}
Certificate completed!
Combine it with theorem 2.2 to obtain the generalized boundary of the growth function representation:
Inferred 2.2 growth function GeneralizationSet h to a function family of $ \ {-1, + 1 \} $. Then, for any $ \ Delta> 0 $ with probability $1-\ Delta $, for all $ H \ in h $, the following inequality is true:
\ Begin {Align} \ label {equ: 9} r (h) \ Leq \ widehat {r} (h) + \ SQRT {\ frac {2log \ pi_h (m )} {M }}+ \ SQRT {\ frac {log \ frac {1 }{\ Delta }}{ 2 m }}\ end {Align}
In addition, the growth function does not need to be defined by Rademacher complexity, that is:
$ \ Mathop {PR} [\ mid \ mathcal {r} (H)-\ widehat {\ mathcal {r} (h) \ mid> \ Epsilon] \ Leq 4 \ pi_h (2 m) exp (-\ frac {M \ Epsilon ^ 2} {8}) $
This inequality has only one constant difference with formula \ ref {equ: 9.
(2) VC Dimension (VC-dimension)
When introducing the VC dimension, we need to first understand two concepts: dichotomy, which we mentioned earlier in the introduction of the growth function, and shattering ).
The so-called "binary" refers to a given sample s and a hypothesis $ h $. The result of classifying s with $ h $ is called binary. Therefore, for a hypothetical set h, multiple different "binary" can be generated, and these binary sets constitute the bipartite set $ \ pi_h (s) $ of the hypothetical set h for sample S.
The concept of "scatter" is also for the hypothesis set and sample. We say that "a sample s can be split by the hypothesis set H". If we use the hypothesis in this hypothesis set only, we can implement all possible "binary" of the sample S ", that is, $ \ mid \ pi_h (s) \ mid = 2 ^ m $.
Now we define the VC dimension. Assume that the VC Dimension of set h refers to the maximum number of samples that can be split by H. The stricter definition is as follows:
Define 2.4 VC-DimensionAssume that the VC Dimension of set h is the size of the largest sample set that can be dispersed by H:
$ \ Mathop {vcdim} (H) = \ max {M: \ pi_h (m) = 2 ^ m}. $
Note the following two points: If we assume the VC dimension D of set h, it means
- There is a sample with the sample size d that can be hashed (not that all samples with the sample size D can be hashed );
- No sample with the sample size of $ D + 1 $ can be hashed (that is, all samples with the sample size of $ D + 1 $ cannot be hashed)
Some examples (proof of these examples can be seen in my previous article: http://www.cnblogs.com/boostable/p/iage_VC_dimension.html ):
- When S is a plane, H is a rectangular box, $ vcdim (H) = 4 $.
- When S is the X axis, H is the interval, $ vcdim (H) = 2 $.
- When S is the circumference, H is the Convex Set, $ vcdim (H) = \ infty $.
- When S is a K-dimensional space, H is a semi-space, $ vcdim (H) = k + 1 $.
Next we will prove a theorem that associates the growth function with the VC dimension.
Theorem 2.4 Sauer's LemmaMake h a hypothesis set named $ \ mathop {vcdim} (H) = d $. Then, for all $ m \ In N $, the following inequality is true:
$ \ Pi_h (m) \ Leq \ sum _ {I = 0} ^ d {M \ choose I}. $
Proof: first, determine that the valid value of M is $,... $, and the valid value of D is $, 2,... $,
- When $ M = d $, $ \ pi_h (m) = 2 ^ d $, while
$ \ Sum _ {I = 0} ^ d {M \ choose I} = \ sum _ {I = 0} ^ d {d \ choose I} = 2 ^ d $ $
Therefore, the above method was established.
- When $ m <D $, $ \ pi_h (m) = 2 ^ m $, while
$ \ Sum _ {I = 0} ^ d {M \ choose I} = \ sum _ {I = 0} ^ m {M \ choose I} + \ sum _{ I = m + 1} ^ d {M \ choose I} = 2 ^ m $
Therefore, the above method was established.
- When $ m> d $, it is proved by induction.
- When $ D = 0 $, it is true for all M. This is because $ D = 0 $ \ pi_h (m) = 2 ^ 0 $, $ \ sum _ {I = 0} ^ d {M \ choose I} = 1 $.
- Assume that $ (M, D-1) $ is valid, that is, $ \ pi_h (m) \ Leq \ sum _ {I = 0} ^ {D-1} {M \ choose I} $, then
$ \ Sum _ {I = 0} ^ {D-1} {M \ choose I} \ Leq \ sum _ {I = 0} ^ {D-1} {M \ choose I} + {M \ choose D }=\ sum _ {I = 0} ^ {d} {M \ choose I }$ $
$ (M, d) $ is also true.
See figure 2. Certificate completed!
Figure 2 Sauer's lemma example
Inferences 2.3Set h to $ \ mathop {vcdim} (H) = d $. Then, for all $ m \ geq d $:
$ \ Pi_h (m) \ Leq {\ frac {em} {d} ^ d = O (M ^ d) $
Proof:
\ Begin {Align *} \ pi_h (m) & \ Leq \ sum _ {I = 0} ^ d {M \ choose I }\\& \ Leq \ sum _ {I = 0} ^ d {M \ choose I} (\ frac {m} {d }) ^ {d-I }\\& \ Leq \ sum _ {I = 0} ^ m {M \ choose I} (\ frac {m} {d }) ^ {d-I }\\& = (\ frac {m} {d }) ^ d \ sum _ {I = 0} ^ m {M \ choose I} (\ frac {m} {d }) ^ I \\& = (\ frac {m} {d}) ^ d (1 + \ frac {d} {m }) ^ m \ Leq (\ frac {m} {d}) ^ de ^ d \ end {Align *}
The last inequality is because $ () \ Leq e ^ {-x} $. Certificate completed!
Therefore, we can replace $ \ pi_h (m) $ in inference 2.2 with its upper bound $ (\ frac {em} {d}) ^ d $, and get the following inference:
Inference 2.4 VC-dimension GeneralizationSet h to $ \ {+ 1,-1 \} $, and the VC dimension to D. Then, for any $ \ Delta> 0 $, at least $1-\ Delta $ is used as the probability. The following inequality applies to all $ H \ in h $:
$ \ Mathcal {r} (h) \ Leq \ widehat {\ mathcal {r} (h) + \ SQRT {\ frac {2D \ log \ frac {em} {d }}{ M }}+ \ SQRT {\ frac {\ log \ frac {1} {\ Delta }}{ 2 m }}$ $
You can simply enter
$ R (h) \ Leq \ widehat {r} (h) + O (\ SQRT {\ frac {log (M/d)} {M/d }}) $
That is to say, the upper bound is determined by $ M/d $. The larger the value of M, the smaller the upper bound. The larger the value of D, the more complex the h, and the larger the upper bound.