1. Howding Inequalities
In a jar, there are a lot of small balls, they are divided into two colors {orange, green}. Randomly grab n balls from a jar. Set: The percentage of orange balls in the jar is μ (unknown), and the ratio of the orange balls in the sample is ν (known). According to the howding inequality in probability theory (hoeffding's inequality) if n is large enough, ν is likely to be close to μ.
Similarly, in machine learning: N is large enough to use [H (x) ≠f (x)] on dataset D to speculate on the [H (x) ≠f (x)] on {χ}. That is, if the sample is large enough, the ratio of the alternative function h to the error on D is close to the proportion of its error on {χ}. An alternative function h is e-in (h) on D, and the percentage of errors in the entire input set is E-out (h):
By means of the above formula, it is possible to measure its correctness according to the performance of the alternative function h on D, and finally to select the optimal h as G from the alternative function set H, and g≈f.
2. Real machine learning
To give an example, 150 people each toss a coin 5 times, at least one person 5 times is the probability of the head upward is 1-(31/32) ^150 = 99.15% so a small probability event if repeated several times, the probability of his occurrence will become very large.
Similarly, the following scenario is possible: Learning algorithm A in the alternative function set H (contains a lot of h) diligently selected H, suddenly found a hi, found that it did not make a mistake on D or only a few mistakes, a happy shout: I found G, is this hi! But in fact this hi on {χ} made a lot of mistakes (Ein (HI) and Eout (HI) is far from the difference. For this hi, D is a bad sample. H may extract several samples di,{i= 1, 2,3 ...}, for some h, some of these samples are bad sample. Because Eout big (far from f), but Ein small (correct on the most examples)
For arbitrary sample D and given H, there is
Bad data for many H
??No ' freedom of choice 'ByA
??there exists some h such thateout (hand ein (hfar away
The following 4 propositions are equivalent on the entire set of alternate function sets H (with M elements):
---d is the bad sample of H---D is some H's poor sample--learning algorithm A cannot be freely screened in h---Existence of some H makes e-in (h) Far from E-out (h)
According to the above table, it can be seen that training data sets such as D-1126 are relatively high-quality.
Given any d, it is the probability that some H's bad sample is:
The less the number of alternative functions in H, the larger the sample data n, the smaller the probability that the sample becomes a bad sample. At an acceptable probability level, learning algorithm a only needs to pick the best-performing H as the G-line. That is, the number of h in the above requirement is limited.
Howding Inequalities and real machine learning