The address of this article is http://www.cnblogs.com/kemaswill/. the contact person is kemaswill@163.com.
The goal of machine learning is to learn a model with better generalization ability. The so-called generalization ability refers to the performance of the model trained based on the training data on new data. This involves two very important concepts in Machine Learning: underfitting and overfitting. If a model performs very well in the training data, but the performance in the new data set is poor, it is overfitting. Otherwise, if the performance in the training data set and the new data set is poor, it is underfitting, as shown in
The Blue Cross points represent the training data, and the blue lines represent the learned model. The model learned on the Left cannot properly describe the training data. The model is too simple and is Under-fitting ). The intermediate model can better describe the training data. The model on the right overfits the training data (the so-called overfitting means that the training data set actually contains a certain amount of noise. If the training data is fully fit, these random noises will also be fit in), resulting in the model being too complex, and it is very likely that the performance of the new dataset is very poor, called Over-fitting ).
Bias-Variance Decomposition is a statistical view of the complexity of the model. The details are as follows:
Suppose we have K datasets, each of which is extracted independently from a distributed p (t, x) (t represents the variable to be predicted, and x represents the feature variable ). For each dataset D, we can train a model y (x; D) based on the learning algorithm. Different models can be obtained through training on different datasets. The performance of the learning algorithm is measured based on the average performance of the K models trained on these K datasets, that is:
H (x) indicates the real function for generating data, that is, t = h (x ).
We can see that the error between the model and the real function h (x) learned by the given Learning Algorithm in multiple datasets is caused by the Bias (Bias) and Variance (Variance) composed of two parts. Here, the offset describes the mean error between the learned multiple models and real functions, the variance describes the mean error (a bit round, the original statement on PRML is variance measures the extent to which the solutions for individual data sets vary around und their average ).
Therefore, there will be a balance between bias and variance during learning. A flexible model (polynomial with a relatively high number of times) has a relatively low offset and a relatively high variance, while a more rigorous model (such as a linear regression) A relatively high offset and a relatively low variance will be obtained. The following two situations are illustrated:
100 datasets are used for training. each dataset contains 25 random points generated by h (x) = sin (2 π x) [Green Line in the right figure. The parameter λ controls the flexibility (complexity) of the model. The larger the λ, the simpler (strict) the model, and the more complex (flexible) the reverse ). We generate multiple models (the red line in the left graph) and partition the average values of multiple models (the red line in the right graph ). We can see that when Lambda is large (the top two graphs), the average model is relatively simple (the top right graph) and the real function h (x) cannot be well fitted ), that is, the deviation is large, but multiple models are similar, with a small gap and a small variance (the top left ). When λ is relatively small (the bottom two figures), the mean model can fit the real function h (x) Very well, that is, the deviation is small (the bottom right figure ), however, there is a large gap between multiple models and a large variance (the bottom left ).
The Bagging method can effectively reduce the variance. Bagging is a re-sampling method (resampling). It samples training data for K times and generates K pieces of new training data, training K models on these K new training data, and then using the mean of K models as the new model. Random Forest (Random Forest) is a powerful Algorithm Based on Bagging.
In addition to the differences in learning methods and parameters (such as λ), the cause of bias and variance also affects datasets. If the distribution of the Training dataset and the new dataset is different, the offset is increased. If the number of training datasets is too small, the variance is increased.
Bias-variance decomposition is the opinion of the statistician who explains the complexity of the model, but it is of little value (bagging may be an exception ~), Because offset-variance decomposition is based on multiple datasets, there is only one training dataset in reality, training a dataset as a whole is better than dividing it into multiple fixed-size datasets for training and then obtaining the average value.
References
[1]. Bishop. PRML (Pattern Recognization and machine learning). p11-16
[2]. Understanding the bias-variance decomposition.
[3]. Andrew Ng. cs229 lecture note1: supervised learning, discrimitive Algorithms
[4]. Machine Learning-random forest algorithm Overview