One of the explanations of model variance and deviation: overfitting

Source: Internet
Author: User

Original article: http://blog.csdn.net/vivihe0/article/details/33317041

When talking about model over-fitting, we often hear about the variance and deviation of the model. This series uses polynomial fitting as an example to describe what is the variance and deviation of the model, the relationship between model complexity and Model Prediction effect is further illustrated.

We use computers to generate sample data points for polynomial fitting. To facilitate the display of fitting functions on a two-dimensional plane, our input vectors and output vectors are all one-dimensional. The data generation function is:

Y = sin (2 * pI * X)

According to this function, we use (0, 0.1 ,...., 0.9, 1) These 11 points are used as input sample X to generate y values, and then a normal distribution noise item with the mean value 0 and standard deviation of 0.3 is superimposed on Y, the target output sample T is generated. We use these 11 sample points to fit the polynomials of four different numbers, and then apply the corresponding curves to the real functional relationships respectively. Y = sin (2 * pI * X). In this way, we can see whether the fitted curves have extracted the hidden functional relationships behind 11 sample points.

Note: In actual applications, we do not know the real functional relationships that generate data. In that case, our goal is to fit a curve by just 10 sample points without knowing the real function relationship, and then predict its T value for the unknown x value. In this article, we know the real functional relationships that generate data.

The following is the corresponding Matlab code. Here we directly use the polynomial fitting functions polyfit and polyval of Matlab. The polyfit function can calculate the polynomial fitting coefficient of a specified order. The polyfit function calculates the output value of a given input variable based on the coefficient. For details about its usage, see the following code.

[Plain] View plaincopy
  1. % Generates 10 data points for polynomial fitting
  2. Xtrain = 0: 0. 1:1;
  3. TTrain = sin (2 * pI * xtrain) + normrnd (0, 0.3, 1, 11 );
  4. % Fit four polynomials of different order
  5. Poly = cell (1, 4 );
  6. P = [1, 2, 3, 10];
  7. For I = 1: 4
  8. Poly {I} = polyfit (xtrain, tTrain, P (I ));
  9. End
  10. % Set the sample point for curve drawing
  11. Xgrid = 0: 0.01: 1;
  12. % Create a graph
  13. Figure
  14. % Polynomial fitting
  15. For I = 1: 4
  16. Subplot (2, 2, I );
  17. Plot (xgrid, sin (2 * pI * xgrid), 'B ');
  18. Hold on
  19. Plot (xtrain, tTrain, 'O ');
  20. Plot (xgrid, polyval (poly {I}, xgrid), 'R ');
  21. Set (GCA, 'ylim', [-2, 2]);
  22. Title (sprintf ('order: % d', P (I), 'fontsize', 20 );
  23. End


Note that the Blue Line in the figure is the function relation that generates the real data points, the circle is the data points, and the red line is the fitted polynomial curve.

It can be seen that when the order is 1 and 2, the fitting effect is not good, and the fitting curve is far from the sine curve y = sin (2 * pI * X. When the order is 3, the fitting effect is better. When we increase the order to 9, the polynomials perfectly fit the 10 data points. In fact, the fitting curve accurately passes through 10 sample points, however, the fitting curve is very different from Y = sin (2 * pI * X.

The first two subgraphs are called underfitting. In this case, because the order of the model is low, the model used for fitting is not flexible enough, therefore, the information contained in the data is not effectively extracted. The last subgraph is called overfitting. In this case, the model is too flexible to adapt to any random fluctuations in data, in this way, the noise contained in the data is used as valuable information. Therefore, both cases need to be avoided. What we need is to make a compromise between the two cases, that is, the fitting model cannot be too complex or simple.

Of course, you will say that in actual application, we do not know what function the data is generated by (that is, we cannot draw the Blue Line in the graph ), so how can we determine whether there is any fitting? Listen again.

Well, the conclusion of this article is: Too far away.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.