Together Chew PRML-1.1 example:polynomial Curve Fitting

Source: Internet
Author: User

Together Chew PRML-1.1 example:polynomial Curve Fitting

@copyright Reprint Please specify the source http://www.cnblogs.com/chxer/

This is too bad, the local formula and pictures stick up all feed wang ...

We begin by introducing a simple regression problem, with an example to wear these fragmentary knowledge points.

Recalling the front of the mathematical notation:a superscript T denotes the transpose of a matrix or vector, so that x T'll be A row Vector. Uppercase Bold Roman letters, such as M, denote matrices. The notation (w1, ..., Wm ) denotes a row vector with M elements, while the corresponding column vector is written as W = (w1,..., wM)t. Superscript t represents a matrix or a vector.

@define we use x ≡ (x1 , ..., xN ) t to represent our training set,t ≡ (t1, ..., t N )T denotes the corresponding value.

The input data set X in this chart is generated by choosing values of Xn, for n = 1,..., n, spaced uniformly in range [0,1 ], and the target data set T is obtained by first computing the corresponding values of the function sin (2πx) and then ad Ding a small level of random noise.

These points are derived from sin (2πx) and some noise.

So, when we have these points, what we need to do is to find out the underlying patterns between them according to the circumstances of these points. On the one hand, statistical induction, on the other hand, to overcome noise interference. In other words, when we get the next new X, we have the ability to predict its corresponding T.

In order to curve fitting. We usually need to use polynomial function.

@define polynomial function

Ps. Documents with a lot of formulas are really careful with pages

For linear models (linear models) are also covered in subsequent chapters.

As we continue to learn, our coefficients will be constantly perfected and determined.

So what is a "good" fitting result? We introduce the error function to measure our model.


@define Error function

It can be seen that the error function directly reflects the gap between the model and the training model. The error function is never negative, and is taken to the Extremum 0 when and only if the existing model is fully successfully fitted (that is, exactly through each training data point). We're just going to minimum this error function.

The first one-second of the coefficient is to be convenient for drawing, the author did not say clearly. If you know the details, please enlighten me ...

such as the geometrical interpretation of the error function.

We can minimum the E (w) by choosing a suitable w.

This involves the model comparison & model selection.

Let's take a look at the fitting results for the function,w in Figure2 to 0,1,3,9:

As can be seen, the constant function and the linear function do not respond well to the movement of the sine function, nine times the function passes through each point precisely, its E (w) = 0, but the result is very bad, this phenomenon is called overfitting (over fitting).

At the same time, we see that the effect of the three functions is the best, which basically reflects the trend of the sine function.

We can use test set to see if the value of this m is good. We used the E (W) function in the training process, so we can also use it when testing set.

To demonstrate the effect of the test set size on the error, we usually use the more convenient RMS error (Root-mean-square)

@define (RMS) Error function

You can look at the root mean square error and our E (W) image about M:

As can be seen, the value of M between 3-8 is very reasonable, and 9 of the Mrs Error is very bad, for guessing almost does not work. Can say exhibits wild oscillations ...

But this is very paradoxical, on the one hand we mentioned earlier we want to minimum the E (W) as possible as we can, on the other hand, E (w) =0 "Theoretical best results" in the test has achieved the worst results. Why?

Furthermore, we might suppose the best predictor of new data would is the function sin (2πx) from which the data is G Enerated (and we shall see later, that's indeed the case). We know that a power series expansion of the function sin (2πx) contains terms of all orders, so we might expect that Resul TS should improve monotonically as we increase M.

In other words, what we are going to do is creat the best predictor of new data. Then we hope we're looking at this best predictor a little bit closer, in other words, we want a monotonically process that constantly improve the model.

In our case, for example, our w is constantly growing, and when we reach 9, we find that ERM is starting to fall, so we don't pick the 9 point. In this way, the optimal solution is 3-8. This idea can basically guarantee that our model will not over fitting.

So why in the end will over-fitting when m reaches a value?

The author explains this: we might as well take a look at this polynomial coefficients:

Can be seen, with the increasing of m, polynomial coefficients more and more complex, have appeared millions other large number, visible, with the increase of m, coefficients increase, Our polynomial are becoming more and more complicated by noise interference. Correspondingly, when M is very small and the number of coefficients is very simple, it is not possible to reflect this function. Therefore, in general, the selection of M is skilled, can not be too large to lead to over-fitting, and not too little lead to under-fitting (less than fit).

But at the same time, M is the same, if the data volume is large enough, usually over-fitting performance is not obvious, the following two graphs:

That is to say, we should choose the appropriate model strength for the different data quantity to compare the good fitting.

However, one rough heuristic that's sometimes advocated is the number of data points should are no less than some multiple (say 5 or) of the number of adaptive parameters in the model. However, as we shall see in Chapter 3, the number of parameters are not necessarily the most appropriate measure of model C Omplexity.

I see another explanation in "Artificial Intelligence: Complex problem solving", which I think is also very good, probably when your model is too strong, the model tends to focus on individual changes, and when the model strength is low, the model tends to focus on the overall trend. The change of individual is always disturbed by noise, and the change of trend is always inaccurate. Either reduce the noise to reduce the individual uncertainty, or increase the amount of data increase the accuracy of the trend. If both ends, choose a good intensity balance for both extremes.

However, after learning the maximum likelihood method (maximum likelihood), over-fitting & under-fitting seems to be not a problem, so do not have a special tangle, a perceptual understanding is almost.

So how to solve this problem simply? In fact, by Table 1.1 we can see that the larger the M is the first that affects our coefficients, and then our coefficients again affects our model, so if you can control the size of the coefficients it controls the model.

How to control: the complexity of the model can be directly controlled by a simple penalty mechanism (punishment mechanism). When the coefficients is too big we punish it on E (W), so the probability of the polynomial being chosen is greatly reduced. Considering the larger the coefficients, the more unreasonable the model may be, so we add (in fact, the coefficients) the sum of squares of the coefficients directly after E (W):

This method is called weight decay (weight decay) in Ann.

If used to measure the intensity of punishment:

  

It can be seen that even if the m=9,=-18 situation still has a very good performance.

Let's take a look at this time polynomial's coefficients:

Indeed, the effect on the coefficients is very obvious. This penalty mechanism directly controls the complexity of the function.

Let's look at the situation again:

As you can see, a suitable and well-controlled model predicts the ability to predict. We can determine this parameter in the validation set.

This chapter mainly describes and solves the pattern recognition problem from an intuitive point of view. The next chapter will cover the principled approach, the principle method, on probability theory.

Together Chew PRML-1.1 example:polynomial Curve Fitting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.