Original article: http://blog.csdn.net/vivihe0/article/details/33317041
When talking about model over-fitting, we often hear about the variance and deviation of the model. This series uses polynomial fitting as an example to describe what is the variance and deviation of the model, the relationship between model complexity and Model Prediction effect is further illustrated.
We use computers to generate sample data points for polynomial fitting. To facilitate the display of fitting functions on a two-dimensional plane, our input vectors and output vectors are all one-dimensional. The data generation function is:
Y = sin (2 * pI * X)
According to this function, we use (0, 0.1 ,...., 0.9, 1) These 11 points are used as input sample X to generate y values, and then a normal distribution noise item with the mean value 0 and standard deviation of 0.3 is superimposed on Y, the target output sample T is generated. We use these 11 sample points to fit the polynomials of four different numbers, and then apply the corresponding curves to the real functional relationships respectively. Y = sin (2 * pI * X). In this way, we can see whether the fitted curves have extracted the hidden functional relationships behind 11 sample points.
Note: In actual applications, we do not know the real functional relationships that generate data. In that case, our goal is to fit a curve by just 10 sample points without knowing the real function relationship, and then predict its T value for the unknown x value. In this article, we know the real functional relationships that generate data.
The following is the corresponding Matlab code. Here we directly use the polynomial fitting functions polyfit and polyval of Matlab. The polyfit function can calculate the polynomial fitting coefficient of a specified order. The polyfit function calculates the output value of a given input variable based on the coefficient. For details about its usage, see the following code.
[Plain] View plaincopy
-
- % Generates 10 data points for polynomial fitting
-
- Xtrain = 0: 0. 1:1;
-
- TTrain = sin (2 * pI * xtrain) + normrnd (0, 0.3, 1, 11 );
-
-
- % Fit four polynomials of different order
-
- Poly = cell (1, 4 );
-
- P = [1, 2, 3, 10];
-
- For I = 1: 4
-
- Poly {I} = polyfit (xtrain, tTrain, P (I ));
-
- End
-
-
- % Set the sample point for curve drawing
-
- Xgrid = 0: 0.01: 1;
-
- % Create a graph
-
- Figure
-
-
- % Polynomial fitting
-
- For I = 1: 4
-
- Subplot (2, 2, I );
-
- Plot (xgrid, sin (2 * pI * xgrid), 'B ');
-
- Hold on
-
- Plot (xtrain, tTrain, 'O ');
-
- Plot (xgrid, polyval (poly {I}, xgrid), 'R ');
-
- Set (GCA, 'ylim', [-2, 2]);
- Title (sprintf ('order: % d', P (I), 'fontsize', 20 );
-
- End
Note that the Blue Line in the figure is the function relation that generates the real data points, the circle is the data points, and the red line is the fitted polynomial curve.
It can be seen that when the order is 1 and 2, the fitting effect is not good, and the fitting curve is far from the sine curve y = sin (2 * pI * X. When the order is 3, the fitting effect is better. When we increase the order to 9, the polynomials perfectly fit the 10 data points. In fact, the fitting curve accurately passes through 10 sample points, however, the fitting curve is very different from Y = sin (2 * pI * X.
The first two subgraphs are called underfitting. In this case, because the order of the model is low, the model used for fitting is not flexible enough, therefore, the information contained in the data is not effectively extracted. The last subgraph is called overfitting. In this case, the model is too flexible to adapt to any random fluctuations in data, in this way, the noise contained in the data is used as valuable information. Therefore, both cases need to be avoided. What we need is to make a compromise between the two cases, that is, the fitting model cannot be too complex or simple.
Of course, you will say that in actual application, we do not know what function the data is generated by (that is, we cannot draw the Blue Line in the graph ), so how can we determine whether there is any fitting? Listen again.
Well, the conclusion of this article is: Too far away.