California Institute of Technology Open Course: machine learning and data mining _ quasi-generalization (11th)

Source: Internet
Author: User
Course introduction

This section describes the problem of over-generalization in machine learning. The author points out that one of the ways to differentiate a professional-level player from a hobbyist is how they deal with the problem of preparation. Through this course, we can know that the merging of sample data is not as high as possible, because the existence of noise will lead to the problem of over-fitting. Finally, two methods for processing overfitting are introduced.

Course Outline

1. What is hyper-anthropomorphic? (What is overfitting ?)

2. Noise in over-fitting. (The role of noise)

3. Deterministic noise. (Deterministic noise)

4. How to deal with the problem of preparation. (Dealing with overfitting)


1. What is hyper-anthropomorphic? (What is overfitting ?)

Let's first look at an example, as shown in:

The sample data is two-dimensional. Now there are only five samples. Assume that our target function is a blue line, and the learned function is a red line. To fit all points, the maximum coefficient is 4. But we can see that although Ein = 0, but Eout = huge

Let's look at another one:

As you can see, as the number of iterations increases, Ein keeps decreasing, but Eout first drops and then begins to rise.

Therefore, Overfitting means that when the model we get fits more data than we want.

You may ask, why is the fitting degree not as good as possible? This is because of the inevitable noise in the sample data. When the degree of fitting exceeds a certain degree, we are actually fitting the noise, which is certainly not what we want to happen. Ideally, it is the lowest possible point. That is the ideal situation.

However, because we do not know what the real function is, we cannot directly measure the Eout. Therefore, we need a series of methods and theories to guide us to find a good stop point. The next two courses will explain this knowledge. The following describes how noise affects machine learning.

2. Noise in over-fitting. (The role of noise)

There are two types of noise,

One is the noise caused by data collection. This noise is randomized and thus called stochastic noise ).

There is also a noise caused by the complexity of the hypothesis set, which we call deterministic noise ).

Random noise is easy to understand, but why is the noise still related to the hypothesis set?

Let's first look at the figure below: In Lesson 8, we mentioned the trade-off between deviation and variance. H2 represents two items, and H10 represents 10 items. (Corresponding to the complexity of the hypothesis set)

When the hypothesis set increases (corresponding to the increase of polynomial items), Eout is very large at the beginning. As the data points increase, Eout begins to drop sharply, finally, assume that the Eout corresponding to the set size is smaller than that of the hypothesis set. However, when the number of data points is limited, when the assumption set increases, the Eout decreases first and then increases. When we increase the number, we think that the complexity of the hypothesis set has led to over-fitting (although the Ein is reduced). In this case, we also think that the complexity of the hypothesis set is also a source of noise.

To better understand the above two kinds of noise, let's look at the figure shown in the experiment below. The color indicates the degree of over-fitting. The more the color tends to red, the more serious the over-fitting is.

Where:

This formula is a data generator, and f (x) is the target function (normalized to 0-1 ). The second item is noise. Assume that the function learned from x and y data points is g.

The following is a quasi-calculation formula:

Overfitting of stochastic noise = g ^ a-f (x) // g ^ a indicates that the maximum polynomial entry is

Overfitting of deterministic noise = Eout (g ^ a)-Eout (g ^ B) // a is greater than B in the formula

Through analysis, we can draw the following conclusions:


3. Deterministic noise. (Deterministic noise)

Definition: the part of f that cannot be learned from H. F (x)-h * (x). (The part of f that H cannot capture ).

If you try to add H to match f with a limited number of resources, the noise will increase and the Eout will eventually increase. Therefore, the best way is not to match all the data. Do what you can.

(Do not think of the plot in martial arts novels. When your internal skills are not strong enough, do not try to practice martial arts. Otherwise, you will be defeated by fire)


The biggest difference from random deviation:

1. Determined by H

2. For a given x, the deviation is determined.


Influence on over-fitting: H tries to match the noise of the given data.

Now we want to combine noise and variance deviation:

We already know:

The above formula describes the possibility of noise in the absence of noise:

Y (x) = f (x) + e (x) // e (x) indicates noise.

In addition, f (x) is transformed to y (x), which is substituted into the above formula to get:

It can be proved that cross terms = 0, so the left side of the above formula is equal:

The last one illustrates the gap from the target function to the real output (because noise is hard to avoid, it is difficult to reach 0 in the last one ).


4. How to deal with the problem of preparation. (Dealing with overfitting)

There are two methods:

1. Regularization: putting the brakes.

2. Validation: Checking the bottom line.

What is Regularization?

Add a limit so that g cannot completely match all the sample data.

What is Validation?

It is to test a series of learned g and find the g that minimizes the Eout as the final output.

The two methods will be explained in the next two sections. The final result obtained by the first method is as follows:

Course Summary:

In the past, when studying data mining courses, I also heard that we should not over-fitting, but the book does not seem to explain why over-fitting is not good (as the teacher did not say ), I didn't understand it at the time, but I didn't ask the teacher or check the information (-_-!). The content of this section is used to complete the blank space. It seems that my enthusiasm for knowledge is not high enough. It seems that you are not stupid enough to be hungry!

California Institute of Technology Open Course: machine learning and data mining _ quasi-generalization (11th)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.