1. Statistical Learning
Statistical Learning is a discipline in which computers build Probability Statistics Models Based on data and use models to predict and analyze data, also known as statistical machine learning. Statistical Learning is a data-driven discipline. Statistical Learning is a cross-discipline in many fields, such as probability theory, statistics, information theory, computing theory, optimization theory, and computer science.
The object of statistical learning is data. It starts from data, extracts data features, abstracts data models, discovers data knowledge, and returns to data analysis and prediction. The basic assumption of statistical learning about data is that similar data has certain statistical regularity, which is the premise of statistical learning.
The purpose of statistical learning is to consider what models to learn and how to learn models.
Statistical Learning methods include model hypothesis space, model selection criteria, and model learning algorithms. To achieve statistical learning, follow these steps:
(1) Obtain a finite set of training data;
(2) determine the hypothetical space that contains all possible models, that is, the set of learning models;
(3) determine the model selection criteria, that is, the learning strategy;
(4) algorithms for Optimal Model Solving, that is, learning algorithms;
(5) select the optimal model through the learning method;
(6) use the optimal learning model to predict or analyze new data.
2. Supervised Learning
Supervised Learning learns models from training data and predicts test data. The training set is usually represented
People give different names to the prediction task based on different types of input and output variables: The prediction problem where both input and output variables are continuous variables is called a regression problem; classification is a prediction problem where the output variables are finite discrete variables. The prediction problem where both input and output variables are variable sequences is called labeling.
Supervised Learning assumes that random input and output variables X and Y follow the joint probability distribution p (x, y), P (x, y) to represent the distribution function, or the distribution density function. Statistical Learning assumes that data has certain statistical rules. The assumption that X and Y have joint probability distribution is the basic assumption of supervised learning about data.
The supervised learning model can be either a probability model or a non-probability model, expressed by conditional probability distribution P (Y | x) or decision function y = f (x, depends on the specific learning method.
Supervised Learning is divided into two processes: Learning and prediction. It consists of a Learning System and a prediction system, for example:
During the learning process, the learning system uses a given training dataset to obtain a model, represented as conditional probability distribution P (Y | x) or decision function y = f (x ). During the prediction process, the prediction system inputs the given test sample set
3. Three elements of Statistical Learning
Statistical Learning = model + strategy + Algorithm
3.1 Model
In statistical learning, we must first consider what kind of model to learn. In supervised learning, the model is the conditional probability distribution or decision-making function to be learned. The model represented by the decision-making function is not a probability model, the model represented by conditional probability distribution is a probability model.
3.2 Policy
With the assumption space of the model, statistical learning then needs to consider the criteria for learning or selecting the optimal model. Supervised Learning is actually an optimization problem of empirical risks or structural risk functions. The risk function measures the quality of Model Prediction in an average sense. The loss function is used to measure the quality of each prediction of the model.
Supervised Learning is to select model F from the hypothesis space F as the decision function. For a given input x, f (x) gives the corresponding output y, and the output predicted value f (x) it may be the same or different from the true value Y. A loss function is used to measure the degree of prediction error. The loss function is recorded as L (Y, f (x )). Common loss functions include the following:
3.3 Algorithm
The statistical learning problem is attributed to the above optimization problem. In this way, the statistical learning algorithm is the algorithm used to solve the optimization problem. If the optimization problem has a displayed analytical solution, this optimization problem is relatively simple, but usually this resolution solution does not exist, so we need to use the numerical calculation method to solve it. Statistical Learning can use existing optimization algorithms or develop independent optimization algorithms. 4. Model Evaluation and Model Selection
When the loss function is assigned, the training error and test error of the model based on the loss function naturally become the criteria for Learning Method Evaluation.
The training error is the average loss of the Training dataset in the model Y = f (x:
The polynomial function Fitting Conditions of M = 0, M = 1, m = 3, M = 9 are given. The green curve is the true model, and the red curve is the prediction model.
Among them, m = 0 and M = 1 models are simple, lack of fitting, and the training error is large. M = 9 models are complex, over-fitting, and the training error is 0, but they are basically not promotional; M = 3 the model has moderate complexity, strong generalization ability, and the best effect.
Describes the relationship between the training error and the test error and the complexity of the model:
When the complexity of the model increases, the training error gradually decreases and tends to 0, and the test error decreases first, and then increases after reaching the minimum value. A typical method for model selection is regularization and cross-validation. 5. The typical method for selection of regularization and cross-validation models is regularization. The general form of regularization is as follows:
Here, the first item is experience risk, and the second item is regularization item. Normalization items can take different forms. For example, normalization items can be the norms of model parameter vectors. In regression problems, the loss function is a square loss, and the regularization item can be the L2 norm of the parameter vector: The regularization item can also be the l1 norm of the parameter vector:
A model with lower experience risk may be more complex. At this time, the value of the regularization item will be greater. regularization is used to select a model with lower experience Risk and model complexity.
Regularization conforms to the Occam Razor principle. Among all possible models, a model that can well interpret known data and is very simple is the best model. From the perspective of Bayesian estimation, regularization items correspond to the prior probability of a model. It can be assumed that a complex model has a lower anterior probability, and a simple model has a higher anterior probability.
Another method for model selection is cross-validation. The premise of cross-validation is that the data is insufficient. common cross-validation methods include simple cross-validation, S-fold cross-validation, and leave one cross-validation. If the data is sufficient, a simple method for selecting a model is to randomly divide the dataset into three parts: training set, verification set, and test set. The training set is used to train the model, the verification set is used for model selection, and the test set is used for final evaluation of the learning method. If the data is insufficient, you can use the cross-validation method to select a model. 6. generalization ability
7. Generate model and discriminant model
Discriminant model
This model mainly models P (Y | X) and uses X to predict y. During modeling, you do not need to pay attention to the joint probability distribution. Only focus on how to optimize P (Y | X) to split the data. Generally, discriminative models are better than generative models in classification tasks. However, the process of discriminative model modeling is usually supervised and cannot be expanded to unsupervised.
Common discriminant models include:
Logisticregression
Lineardiscriminant Analysis
Supportvector machines
Boosting
Conditionalrandom Fields
Linearregression
Neuralnetworks
Generate Model
This model models the joint probability distribution p (x, y) of the observed sequence. After obtaining the joint probability distribution, conditional probability distribution can be obtained through Bayesian formula. Generative models carry more information than discriminative models. In addition, generative models are easy to implement incremental learning.
Common generative models include:
Gaussian mixture model and othertypes of Mixture Model
Hiddenmarkov Model
Naivebayes
Aode
Latentdirichlet allocation
Restrictedboltzmann Machine
As we can see from the above, the most important difference between the discriminative model and the generative model is that the objective during training is different. The Discriminative model mainly optimizes the conditional probability distribution to make the X and Y correspond more, in classification, it is more severable. The model is mainly used to optimize the joint distribution probability of training data. Meanwhile, the generated model can be obtained through Bayesian, but the generated model cannot be obtained. 8. Classification, labeling, and Regression
As mentioned above, the prediction of both input and output variables are continuous variables is called a regression problem. The prediction of output variables as finite discrete variables is called a classification problem; the prediction problem that both input and output variables are variable sequences is called labeling.
For binary classification problems, the common evaluation indicators are accuracy rate and recall rate. Generally, the following classes are positive and other classes are negative. The prediction or correctness or inaccuracy of the classifier on the test dataset is recorded as follows:
TP -- predict positive to positive;
FN -- predict positive to negative;
FP -- predict the negative type as the positive number of classes;
TN -- predict the number of negative classes as the number of negative classes.
Then, the accuracy is defined:
Many statistical methods can be used for classification, including K-Nearest Neighbor Method, sensor machine, Naive Bayes method, decision tree, decision list, logistic regression model, support vector machine, lifting method, Bayesian Network, neural network, winnow, etc.
The input of the annotation problem is an observation sequence, and the output is a marking sequence. Labeling is widely used in information extraction, natural language processing, and other fields. For example, the part-of-speech tagging in natural language processing is a typical labeling problem: given a sentence composed of words, each word in the sentence is marked with parts of speech, that is, the part-of-speech mark sequence corresponding to a word sequence is predicted. Common statistical learning methods for tagging include Hidden Markov Model and Conditional Random Field.
Regression learning is equivalent to function fitting: A function curve is used to fit known data and predict unknown data. Regression Problems are classified into one-dimensional regression and multiple regression based on the number of input variables, and linear regression and Nonlinear Regression Based on the type of the relationship between input and output variables, that is, the model type. Regression learning is the least commonly used loss function. In this case, the regression problem can be solved by the famous least square method.
(Sina Weibo: @ quanliang _ machine learning)
Note comes from statistical learning method-Li Hang