The question of whether machine learning is feasible is introduced in the Forth.
1. From the given data d, it is feasible to find a hypothetical G close to the target F. Like PLA. However, it is hard to say whether the found G can be used in places other than D.
2. Hoeffding's inequality answered whether G could be used for problems other than D:
(1) in probability theory, hoeffding ' s inequality provides a upper bound on the probability that the sum of the random variabl Es deviates from its expected value.
(2) Think of all possible input x as a jar, and each ball in the jar represents an input data point x. For the found one hypothesis H and the target F, if H (x) ≠f(x), the x is painted orange, and if H (x) = f (x), the x is painted green. Because there are many ball xin the jar x , it is not possible to directly get the ratio of the orange ball, so extract n balls from the jar as a sample to estimate the percentage of orange balls in the whole jar. It is known from the hoeffding inequality that when n is large enough, the gap between the orange ball ratio in the sample and the orange bulb in the jar is upper bound.
(3) for a given h, the error rate of h in the sample is Ein (h), and the error rate in the entire input space is eout (h), by the hoeffding inequality, p[| Ein (h)-eout (h) | >ε]≤2exp ( -2ε2n). Therefore, Eout (h) is not required to know. When Ein (h) ≈eout (h) and Ein (h) are very small, it can be said that Eout (h) is very small and h is probably very close to F.
3. The above gives a way to verify that a certain h is close to F, but it is still not learning. The real learning is to make choices from a bunch of assumptions, not to give the same h each time. PLA, for example, learns from different materials and gets different lines, rather than getting the same line. If an algorithm always gives the same h, then the algorithm is probably useless and cannot be learned.
4. When there are many assumptions, you can imagine each different h to paint the ball in a jar in a different color:
It is possible to choose the assumption that H's Ein is small, but the Ein is very small h, which may be accidental. Example: Toss a coin 5 times, 5 times are positive probability is very small. But toss 50 coins, each coin toss 5 times, one of the coins 5 times is positive probability is very big. The hoeffding inequality shows that Ein and eout differ very little when there is only one H. It is said that Ein and eout differ greatly in bad events. If a certain piece of data makes the Ein and eout of an H very large, the data is called bad. The hoeffding inequality shows that for a certain h, the upper bound of the probability of a bad is 2exp ( -2ε2n). If there is at least one assumption of bad for a given set of data, it is considered bad for the entire set of assumptions,
From above, when the assumed set size is limited, the probability that the data is bad will still have an upper bound, so long as n is large enough to ensure that the Ein is approximately equal to eout. If algorithm a can find a small hypothesis about ein, it can be thought that the machine learns something.
Machine Learning Cornerstone IV Lecture Notes