Original article: http://blog.csdn.net/vivihe0/article/details/33319969
As we have said, how can we create a model in practical applications? It is impossible for us to know the real functions that generate data. How can we evaluate the quality of a model? Because the goal of fitting a curve is to make a good prediction of the new x value. To verify the quality of the model we have created, we need a test set which is independent from the training set of our training model. That is to say, the sample data in the test set must be the sample data that has not been seen in the model during model training. The performance of a trained model when it encounters a new sample is calledGeneral Performance.
Now we still use the Function Y = sin (2 * pI * X) to generate data. We generate two datasets respectively,Training set and Test Set. The generation process of the training set is the same as that in the preceding example, which contains 11 sample points. the test set randomly generates 100 input sample x values in the range of 0 to 1, then, the noise items are superimposed like the training set to generate the target output value T. We fit 11 training set samples with polynomials from level 1 to level 10, and then use the fitting model to test on the test set. In this way, we can get the error size of the model in the training set and test set.
We use mean square error to define the error size, which is the sum of squares of the residual and divided by the number of samples. In this way, the error of the model in the sample set of different samples is comparable. The Matlab code is as follows.
[Plain]View plaincopy
- % Generate 11 samples of the training set
- Xtrain = 0: 0. 1:1;
- TTrain = sin (2 * pI * xtrain) + normrnd (0, 0.3, 1, 11 );
- % Generate 100 samples of the Test Set
- Xtest = unifrnd (0, 1, 1,100 );
- Ttest = sin (2 * pI * xtest) + normrnd (0, 0.3, 1,100 );
- % Use the training set to fit 10 polynomials of different order, and the coefficient is stored in pcell
- Polycell = cell (10, 1 );
- For I = 1: 10
- Polycell {I} = polyfit (xtrain, tTrain, I );
- End
- % Error of calculation model in training set and Test Set
- Rmstrain = zeros (1, 10 );
- Rmstest = zeros (1, 10 );
- For I = 1: 10
- E = polyval (polycell {I}, xtrain)-tTrain;
- Rmstrain (I) = E * E '/11;
- E = polyval (polycell {I}, xtest)-ttest;
- Rmstest (I) = E * E'/100;
- End
- % Plot
- Plot ([], rmstrain, '-ob', 'linewidth', 3, 'markersize', 10)
- Hold on
- Plot ([], rmstest, '-or', 'linewidth', 3, 'markersize', 10)
- Legend ({'training set', 'test set'}, 'fontsize', 15 );
- Xlabel ('model order number', 'fontsize', 15)
- Ylabel ('mean squared error', 'fontsize', 15)
As we can see, with the increase of the order, the mean square error of the model in the training set gradually decreases, but the error in the test set gradually increases after the third order reaches the minimum value. When the order reaches 9, the mean square error of the model in the training set reaches 0, that is, the fitting curve perfectly passes through the training sample point, however, the model performs poorly in the test set. That is to say, when the model is too complex, the perfect fitting of the model in the training set cannot guarantee its good generalization performance.
We can also observe the coefficients of Polynomial Models of different order numbers. In the above Code, the coefficient of the model is stored in the cell array pcell, so you can use the following command in the MATLAB command line to view the coefficient of the I-th polynomial.
Polycell {I}
For example, enter:
Polycell {10}
You can see that when the number of polynomials is 10, the coefficient of the model is very large. Through these large coefficients, the model curves perfectly pass through 10 sample points, but the volatility of the curves around these sample points is large, as shown in 1 4th subgraphs. That is to say, the larger the order of the polynomial model, the more flexible the model, the easier the fitting curve to adapt to random noise.