Coursera Machine Learning Cornerstone 4th talk about the feasibility of learning

Last Update:2015-03-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This section describes the core of machine learning, the fundamental problem-the feasibility of learning. As we all know about machine learning, the ability to measure whether a machine learning algorithm is learning is not how the model behaves on an existing set of training data, but rather the performance of the model in the data outside the training data (which we call test data), We call this performance generalization capability (generalization ability), our goal in machine learning is to look for models with high generalization capability, while some models work well on the training data set, even if the accuracy rate is 100%, but the results are poor on the test data set. This model has a poor generalization capability, and this phenomenon is also called fitting (Overfitting).

If we can get the test data, we can estimate the generalization ability of a model accordingly. The problem is that, in many cases, we have only one training set, and it is very difficult to get the extra sample, such as a medical clinical diagnosis for a patient that requires a lot of human and financial resources. On how to estimate the generalization capability of a model, which I will talk about later, and this section I would like to discuss a more interesting question, can we use the training error to estimate the generalization error?

Here, we refer to the training error in the sample error, which is the internal error, which is recorded as $e_{in}=\frac{1}{n}\sum_{i=1}^n i[h (x) \neq f (x)]$, where $n$ is the sample number of training sets, $h (x) $ for the hypothesis, $f (x ) $ is the target function.

An out-of-sample error is called an out-of-sample error, defined as $e_{out}=e_{x\sim p}[i (H (x) \neq f (x))]$

Out-of-sample error is what we often call the expected loss.

In the machine learning problem, the objective function $f$ and the distribution $p$ are generally unknown, that is, only given the prerequisite of training data set, $E _{out}$ we cannot know (unless some assumptions are added). So how do we choose a model when there is no way to know the sample error? The easiest way to think of it is to choose from the hypothesis that the training error is small, intuitively this is easy to understand, in the training data set to perform a good model, the data outside the training set should also perform well, can really this? The answer is.

Let's take a look at an example and then compare the learning problem with the problem.

Suppose we have a jar with a lot of orange or green marbles in it, and if we want to know how much orange beads are (recorded as $\mu$), how do we do that?

Of course, you will say, this is not simple, a number in the past is not clear? But the problem is, if the jar is big, like 10,000 marbles, will you be counting one? Obviously, it's not feasible. We all know that there is another way to study statistics, that is, sampling (sampling). For example, draw 10 marbles, then calculate the ratio of orange marbles in this sample (recorded as $\nu$) as an estimate of the percentage of orange marbles in the whole jar. So $\nu$ tell us some useful information about $\mu$?

First of all, $\mu$ must be equal to $\nu$? Not necessarily, because it is possible for us to pick up a marble, and the marbles are all green. But we have great assurance that $\nu$ is very close to $\mu$. Mathematically, describing how close the $\mu$ and $\nu$ are, is by a well-known inequality to regulate, this inequality is called hoeffding ' s inequality.

$\epsilon$ is an error limit, $N $ is the sample size. The hoeffding inequality tells us the fact that the larger the sample set, the smaller the probability that $\nu$ and $\mu$ differ greatly. In other words, the $\nu=\mu$ is probably right, because as the sample grows, the upper limit of the probability will be smaller; it's almost right because we can narrow it down so that $\nu$ and $\mu$ are close. Mathematically, we call this property a PAC (probably approximately correct, which may be approximately correct). If n is large, we can use $\nu$ to estimate $\mu$.

After introducing the hoeffding inequality, let's look at the relationship between this problem and the learning problem.

In the spherical tank model, we do not know the proportion of orange marbles, and the corresponding to the learning problem we want to know is the overall assumption and objective function is close. Each sample point in the sample space $x\in\mathcal{x}$ corresponds to each of the marbles in the spherical tank, and when it is assumed that H (X) and F (x) are different, we paint the ball orange; and we paint the ball green. The pebblessintosthe from the spherical tank corresponds to the training set D in the learning problem, which is also sampled by the IID. The goal of the spherical tank model is to estimate the actual proportions using the calculated proportions of the sample, and the purpose of the learning problem is to estimate out sample error with in sample error.

With this analogy, we can draw a similar conclusion by applying the Hoeffding inequality:

For a fixed h, if n is large, then the probability of the difference between Ein and Eout is very small, that is, the two are very close. Similarly, the establishment of this inequality has nothing to do with $\epsilon,n,e_{out}$, Ein=eout is PAC. If the Ein is very small, and $e_{in}\approx e_{out}$, then we have a great assurance that eout is also very small, so that H and F are very similar (professionally, $h=f$ is PAC). Similarly, if the Ein is very large, then eout can be very large, then we say $h\neq f$ is PAC. However, there is an exception, Ein is very small, eout is very large, that is often said over-fitting.

Prml said that the increase in the number of samples can reduce the phenomenon of overfitting, has not been able to understand the reason, and now learned the hoeffding inequality and PAC framework a little understand. By increasing the number of samples N, we can reduce the upper limit of the right probability of the hoeffding inequality, increase the probability that in sample error is close to the out sample error, so that we use the resulting in sample error to the Out sample Error estimates are more accurate.

Coursera Machine Learning Cornerstone 4th talk about the feasibility of learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Coursera Machine Learning Cornerstone 4th talk about the feasibility of learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Coursera Machine Learning Cornerstone 4th talk about the feasibility of learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support