http://www.cnblogs.com/levone/p/3531054.html#2898984
1.4 Model Evaluation and model selection
generalization ability (generalization ability): the ability to predict unknown data with learning methods.
over Fitting (over-fitting): the model selected in the study contains too many parameters, so that the model is expected to known quantity well,
But the very bad image of the unknown is predicted.
Experience risk minimization (empirical risk minimization, ERM): The minimization of the solution loss function:
When the model is a conditional probability distribution and the loss function is a logarithmic loss function, ERM is equivalent to the maximum likelihood estimate (maximum likelihood estimation).
structural risk minimization (structural risk minimization, SRM): When sample capacity is very small, it is easy to have a fit (overfitting) problem, SRM is to prevent overfitting. SRM equivalent Hermetical (regularization). SRM is the addition of a regularization term (regularizer) or penalty (penalty term) that represents the complexity of the model on the basis of ERM:
The need to meet the empirical risk and the complexity of the model is small. When the model is a conditional probability distribution, the loss function is logarithmic loss function, and the model complexity is represented by the prior probability of the model, SRM is the maximum posteriori probability estimate (maximum posterior probability estimation, MAP) in Bayesian estimation.
To minimize the error of the test, a model with appropriate complexity needs to be selected. There are two commonly used model selection methods: regularization and cross-validation .
1.5 regularization and cross-validation
Structural risk = empirical risk + regularization
As shown above, the first item is empirical risk and the second item is regularization
regularization : Penalty in structural risk, can choose L1 norm of parameter vector, L2 norm of parameter vector and so on.
The role of regularization is to choose a model with both empirical and structural risk.
Regularization conforms to the law of the Ames Razor (Occam ' s Razor, Ockham ' Srazor): A better model is a simpler model that interprets known data well.
Datasets are often cut into three parts: the training set (training set), the validation set (validation set), and the test set, respectively, the user training model, the model selection, and the evaluation of the model. But in the premise of insufficient data, it is obviously unscientific to slice the data again.
Therefore, cross-validation (cross validation) is introduced into the following methods:
Simple cross-validation: simply cut the dataset into two parts, training set and test set
S-fold cross-validation: Cut the DataSet into a subset of the same size as S, select the S-1 subset training model, the remaining subset to test the model, repeat s times and then select.
Leave a cross-validation: Used in case of lack of data. is a special case of S-fold cross-validation s=n.
1.6 generalization capability
generalization capability (generalization ability): the model learned by this method predicts the ability to predict unknown data.
Generalization error (generalization error): is the desired risk of the model you are learning.
Generalized error upper bounds (generalization error bound): It is a function of the sample capacity, when the sample capacity increases, the upper bound of the generalization tends to 0; it is a function of assuming space capacity (capacity), assuming the larger the space capacity, the more difficult the model will be. The greater the upper bounds of the generalization error.
The first is the empirical error (training error)
The second, n is the number of samples, when n tends to infinity, this is 0, that is, the expected error equals the empirical error
D represents the number of functions in the assumed space, the larger the more difficult to learn, the greater the generalization error
1.7 generation model and discriminant model
The generation method (generative approach) learns that the model is called the Generative model (generative models), the data learns the joint probability distribution P (x, y), and then the conditional probability distribution P (y| X) as a predictive model, p (y| x) =p (x, y)/P (×), the typical generation model is naive Bayesian model and hidden Markov model.
Advantages:
can get a joint probability distribution
Faster convergence rate
When there are hidden variables, you can still use the
The discriminant method (discriminative approach) learns the model called discriminant model (discriminative models), which is directly learned by the data of the decision function f (X) or the conditional probability distribution P (y| X), typical discriminant models include: K nearest neighbor algorithm, Perceptron, decision tree, logistic regression model, maximum entropy model, support vector machine, lifting method and conditional random field.
Advantages
Higher learning accuracy Rate
Easy to abstract data to simplify learning problems
1.8 Classification Problems
When the output variable is a finite discrete value, it is a classification problem
A classification model or categorical decision function that is learned is called a classifier (classifier)
1.9 Labeling (tagging) issues
generalization of classification problems , input is an observation sequence, and output is a sequence of markers
Typical applications, part-of-speech tagging, input word sequences, output is a sequence of markers (words, parts of speech)
1.10 Regression Problems
regression (regression): input and output are continuous variables, used to predict the relationship between input variables and output variables, that is, the selection of input variables to the output variable mapping function, equivalent to function fitting, select function curve to fit the known data and good prediction of unknown data.
According to the number of input variables, it is divided into unary regression and two-yuan regression; According to model type, it is divided into linear regression and nonlinear regression.
The first chapter mainly introduces some basic concepts, and it is necessary to understand these concepts.
The 1th Chapter: An Introduction to statistical learning methods