PRML Chapter 1 learning summary Least Squares Data Fitting and Regression

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Written by goldenlock

Introduction:

This article mainly summarizes the first chapter of PRML, combining Moore's regression courseware predicting real-valued outputs: An Introduction to regression.

What is regression )?

1. Linear Regression with a single parameter

If you want to use a straight line from the origin to fit the sample point, y = wx, then the value of W, an unknown parameter, can minimize the overall fitting error, this is a problem of least square method fitting.

The objective is to minimize the sum of (Xi-yi) ^ 2.

2. Consider the above problems from the perspective of Probability

That is to say, we assume that the model is y = wx, but the specific (XI, Yi) corresponds to the probability model generated according to Gaussian distribution, with wxi as the center, and the variance is unknown. Each sampling point is independent.

As mentioned above, our goal is to go through the actual observed values of the sample set.Prediction parameter W. There are two ways to predict the value of W:

The MLE maximum likelihood method is used to determine the value of the parameter W, which maximizes the probability of an observed actual sample set.Argmax (P (Y1, Y2... Yn | x1, x2... XN, W), but is this a bit strange? Our goal is to estimate the most likely W, argmax (w | x1, x2... XN, Y1, Y2... YN)

We can see thatThe goal of optimization is actually the same as that of the least square method..

Map uses Bayesian rules, which will be discussed later.

3. polynomial curve fitting

The example in chapter 1 of PRML is polynomial curve fitting ).

Consider a polynomial curve whose order is m, which can be expressed in the following form:

The goal of curve fitting is to minimize E (w) below (of course you may choose different error functions, which is only one of them ):

For the minimum value, we mean that the optimal minimum distance is.

If we select a different order value, that is, M, different Polynomial Curves for fitting, for example, the fitting result of M =, least square method is as follows:

We can see that M = 9, the curve and the sampling observation point fit well but deviate from the whole, and it cannot be well reflected. This is the legendOver fitting over-fittingProblem.

The higher the order value m, the more flexible the corresponding curve, can be a better approximation of the sampling point, after all, the high order curve contains (can be expressed) All the low order curves. In addition

Is containing all the order, so we can expect that the larger m, the better the fitting of the sample points. However, we can see that the larger the m, the more flxile the curve is.More sensitive to noise.

As we mentioned above, how can we judge whether there is excessive fitting? What is our ultimate goal,The ultimate goal is:

For a new data, we can give accurate value predictions, that is, give accurate estimates for the new data.

We can generate another test dataset, such as 100 data records. For each m value, we can calculate the training set trainning data and test data for the test set. Sometimes

It may be better to use the following error functions:

In this way, we can have a fair comparison benchmark for different N, that is, the size of the data set.

For the over-fitting problem, ifAdd observation points, You can see that the problem of over-fitting can be mitigated, such as M = 9:

The larger the data set size, the larger the complexity of the model we can afford. One common practice is that data points should be a multiple (for example, 5, 10) more than the number of parameters to achieve better results.

In chapter 2, we can see that,The number of parameters is not the best measure of model complexity..

At the same time, it is quite uncomfortable thatWe need to limit the number of parameters of the model based on the size of the training set (the available training set.It seems more natural to choose the complexity of the model based on the complexity of the problem to be solved.

We will seeThe least square method and the maximum likelihood method are consistent.(The previous example of linear regression with a single parameter has provided a proof :). IfBayesian method can avoid over-fitting.From the perspective of Bayes, it is feasible to use a model with a much larger number of parameters than data points. In fact, in the Bayesian model, the number of valid parameters can be automatically adjusted according to the size of the data set.

From the perspective of least square method, to solve the problem of over-fitting, we can change the optimization goal and add reularization to limit that the value of | w | is too large.

4. Bayesian Probability

Consider a coin three times. If the result we observed three times is the back, then from the maximum likelihood perspective, we will determine that the possibility of the coin observing the back is 100%, if we have a certain degree of prior knowledge, we will not come to this conclusion.

We have two boxes, red and blue. The red box contains two apples and six oranges, and the blue box contains three apples and one orange.

Assuming that the probability of selecting a red box is 40%, and the probability of selecting a blue box is 60%, the probability of an apple from the two boxes is (2/(2 + 6 )) * 0.4 + (3/(3 + 1) * 0.6 = 0.1 + 0.45 = 0.55 = 11/20. The probability of getting an orange is 0.45.

Suppose we are told that we have obtained a fruit. This fruit is an orange. In which box do we get it from? How likely is this box red? Obviously, the possibility of trying the red box is no longer 40% (Prior Knowledge P (B = r)), But is bigger, because the red box is more likely to get oranges. That isThe probability that the box is red increases when the orange is obtained.(Posterior Probability p (B = r | f = O ),Note that if oranges appear in the same red and blue colors as P (F = O) and P (F = o | B = R), the posterior probability is the same as the anterior probability, in this case, P (B = R) = P (B = r | f = O) indicates that the obtained fruit is irrelevant to the probability of the selected box P (B = r, f = O) = P (F = O) * P (B = r | f = O) = P (F = O) * P (B = r )).

Bayesian theory is used to help convert prior probability to posterior probability, and the conversion is based on the information obtained by observing the data.

We can also use Bayesian theory for the parameter W in curve fitting. Before observing the training data, we have a prior probability distribution about W. The observed data can be expressed as, so we have

Indicates the possibility of observed data under specific circumstances.

5. Re-look at curve fitting from the probability Angle

The curve fitting problem is actually like this. We have N input data records.X = (x1, x2 ,... XN ),And their target value:T = (T1, T2 ,... Tn ),The target is the prediction t of the target value for the given new x. (If the T value is discrete, this is actually a classification problem.) Like the linear fitting at the beginning, we assume that the data points conform to the independent Gaussian distribution, and the mean value is Y (x, W) that is, when the parameter is set to W, the target value of the corresponding model is set to X. The variance is, so there are

Considering the parameter W of the curve, the optimization goal is actually the same as that of the least square method. We can determine it. After confirming it, we can determine

So nowAfter the new x is introduced, we can predict that the value of T is

Now let'sA bit more Bayesian. We assume we know a prior probability of W.For simplicity, we assume it isGaussian distribution

Bayes ~

In this way, the goal of log calculation is to become

This is actually the curve fitting problem with regularization that has been taken into consideration before. The reuliarization parameter hereYes (compare the formula at the end of section 3rd)

6. Bayesian Curve Fitting

The above practice is still not Bayesian enough. Although the prior assumption is given, it cannot be called a complete Bayesian practice. The following provides a powerful Bayesian solution. It continuously applies the addition and multiplication rules in probability.

Which of the following concepts needs to be changed?In fact, our essential goal is not to find the most likely parameter W, but to find the most reliable predicted value t for the new x.. So we have

Final Fantasy ~

After derivation, we can get the following formula (TodoNot yet deduced :)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PRML Chapter 1 learning summary Least Squares Data Fitting and Regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PRML Chapter 1 learning summary Least Squares Data Fitting and Regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support