Chapter I. Introduction to Statistical learning methods The main features of statistical learning are:
(1) Statistical learning is based on computers and networks, and is based on computer and network ; (2) Statistical learning takes data as the research object and is a data-driven discipline; (3) Unified The purpose of the study is to predict and analyze the data; (4) Statistical learning Method-centric, statistical learning methods The model was constructed and the model was used to forecast and analyze. (5) Statistical learning is the probability theory, statistics, letter computational theory, optimization theory, and computer science, as well as the development of cross-disciplinary gradually form The independent theory system and methodology.
The object of statistical learning is data CDATA) The purpose of statistical learning is to pre-guillotine and analyze the data, especially to predict and divide the unknown new data.analysis.Category: Supervised learning ((supervised leaning) unsupervised learning (unsupervised leaning) semi-supervisorLearning (semi-supervised leaning)Enhanced Learning (reinfoucement leaning) Three elements of statistical learning methods: abbreviationfor models, policies (strategy), and Algorithms (ALGOXITINM).The steps to achieve the statistical learning method are as follows (1) to obtain a limited set of training data, (2) to determine the hypothetical space that contains all possible models, namely the set of learning models, (3) to determine the criteria for model selection, the Learning Strategy, and (4) to implement an algorithm for solving the optimal model, namely the learning algorithm; The optimal model is selected through the Learning method, and (6) The new data is predicted or analyzed by using the optimal model of learning.
Supervised Learning ((supervised leaning)The eigenvector of the input instance x is recorded as the training set: the input variable and the outputThe prediction problem that the out variables are continuous variables is called the
regression problem ;the output variable is a finite discrete variable of the pre-The problem of measurement is called
classification problem ;the input and output variables are the pre-side problems of the variable sequence called
callouts
problem .Supervised learning assumes that the input and output random variables x and y follow the joint probability distribution
P(x, y).a model for supervising learning problemsThe learning system uses a given set of training data to get a learning (or training)a model, expressed as a conditional probability distribution p^ (y| x) or decision function y=f^ (x). Conditional probability distributionp^ (y| X)or decision functiony=f^ (X)describes the mapping relationship between input and output random variables.
three elements of statistical learningMETHOD = model + strategy + algorithm
Model:in the supervised learning process, the model isis the conditional probability distribution or decision function to be studied. hypothetical space for models (hypothesis spaces) packageincludes all possible conditional probability distributions or decision functions. Suppose that space can be defined as a decision function orConditional Probabilitiesthe collection, byfunction Families determined by the parameter vectors:
Strategy
loss function:measure with a loss function (loss functions) or cost functionthe extent to which errors are predicted. The loss function is the nonnegative real value function of
f (X) and Y, which is recorded as L (Y,
F (x)). Common loss Function: (1) 0-1 loss functions (0-1 loss function) (2) square loss function(quadratic loss function) (3) Absolute loss function(absolute loss function) (4) Logarithmic loss function(logarithmic loss function) or logarithmic likelihood loss functions (logLikehood loss Function)The expectation of the loss function is that this is a theoretical model
F(X) loss in the mean sense of the joint distribution P (x, y), called the risk function(risk function) or desired loss (expected loss). The daily standard of learning is to choose the model with the least expected risk. ButThe joint probability distribution is not known. Model F (X) The average loss on the training data set is known as empirical risk (empirical risk) or experience loss(empirical loss):Expected risk Rexp (f) is the model about the expected loss of the joint distribution, the empirical riskREmp(f)is the modelabout the average loss of the training sample set. According to the law of large numbers, when the sample capacity n tends to infinity, the empirical windrisk tends to be expected. So a natural idea is to estimate expectations with empirical riskrisk. However, because the number of training samples is limited and even small in reality, the empirical risk estimation periodrisk is often not ideal, to the experience of the risk of a certain correction. This is related to supervised learning twoA basic strategy: experience risk minimization and structural risk minimization.
Experience risk minimization (empirical risk minimization, ERM),That is to solve the optimization problem: When the sample capacity is large enough, the experience risk minimization can guarantee a good learning effect. Cases:Maximum likelihood estimation (maximum likelihood estimation). However, when the sample capacity is very small, the experiential risk minimization learning effect may not be very good, will produce"over fitting (over-fitting)" phenomenon.
structural risk minimization (structural risk minimization, SRM)is to prevent overfitting,in the experience winda regularization term (regulatizer) or penalty (penalty term) that represents the complexity of the model,definitions are:which
J(f) For the complexity of the model, it is defined in the hypothetical space of the functional. Knotthe risk of construction is small and the complexity of the model is small. Examplemaximum posteriori probability estimation in Bayesian estimation (maximum posterior probability, MAP). The structure Risk minimization strategy is:
algorithm:The specific calculation method of the learning model.The statistical learning problem boils down to the optimization problem, and the algorithm of statistical learning becomes the optimization question.The algorithm of the problem.
model evaluation and model selection
Training error and test errorAssuming that the learning model is y=f^ (X), the training error is model y about the training data setaverage loss of:The test error is the average loss of the model Y about the test data set: For example, when the loss function is a 0-1 loss, the test error becomes a common test data setError Rate (Eaor rates)
Accordingly, the accuracy rate on the common test data set (accuracy) is
over fitting and model selection
overfitting (over-fitting):The complexity of the chosen model tends to be more complex than the true if it is pursued to improve the pre-side capabilityThe model is much higher. This phenomenon is called overfitting (over-fitting). Over-fitting refers to the model chosen when learningtoo many parameters are included so that the model is well-predicted for the known data, but the unknown data is pre-a poorly measured phenomenon. Example: polynomial-fitting problem:
In polynomial function fitting, we can see that with the increase of polynomial number (model complexity), the trainingThe error will decrease until it tends to 0, but the test error is not, it will follow the polynomial frequency (modulocomplexity) increases first and then increases. to prevent overfitting, the optimal model selection is to choose a model with appropriate complexity to achieve the test errorthe minimum learning purpose. method of model selection: Regularization and cross-validation
regularizationis the structural risk minimization strategya little bit of implementation:
Cross-validation: use data repeatedly to slice a given data andthe segmented datasets are combined into training sets and test sets, which are based on repeated training, testing, and modelingtype selection.
- Simple cross-validation
Firstly, we randomly divide the data into two parts, one part as the training set and the other part as the test set; Training the model with training set under various conditions (for example, the number of different parameters) to get different Model, the test error of each model is evaluated on the test set, and the model with the smallest test error is selected.
- s fold cross face card (s-fold Cross validation)
here's how to do it: first randomly divides the data into subsets of the same size that are disjoint from each other, and then leverages data from S-1 subsets Train the model, use the remainder of the subset to test the model, and repeat the process for possible s choices; the model of the average side test error in S evaluation is chosen.
- Leave a fork to verify (Leave-one-out Cross validation)
The special case of S-fold cross-validation is that S=n,n is the capacity of a given data set
The generalization ability of generalization ability learning method (generalization ability) refers to the model learned by this method.the predictive ability of unknown data is an essential nature of learning methods. If the model learned isF^(X), then use this model to unknownthe predicted error is the
generalization error (generalization error):in fact, the generalization error isThe expected risk of the model being learned.
Upper bounds of generalization error(generalizarion error bound). Specifically, it is by comparing the two types of learnersThe size of the upper bounds of the generalization error of the method to compare their merits and demerits. The upper bounds of the generalization error usually have the following properties:is the function of the sample capacity, when the sample capacity increases, the upper bound of the generalization tends to 0; it is assumed space capacity(capacity) function, assuming that the larger the space capacity, the more difficult the model to learn, the greater the upper bounds of the generalization error. The following is a simple example of the upper bounds of the generalized error, the upper bounds of the generalization error of the second class classification problem,
FN Generalization Ability: Generation model and Discriminant model supervised learning method can be divided into generation method (generative approach) and discriminant method.(discriminative approach). The models you learned are called Generation models, respectively (Geuemtive model)and discriminant models (discriminative model).
Build MethodBy the data Learning Joint probability distribution P (x, y) and then find the conditional probability distribution P (Yix)as a predictive model, the model is generatedSuch a method is called a build method because the model represents a given input x that produces the output Y of the rawinto the relationship. Typical generation models are: Naive Bayesian method and hidden Markov model.
discriminant MethodDirect learning by data decision function f (X) or conditional probability distribution P (y| X) as a pre-The discriminant model. The method of discrimination is concerned with the given input x, which should predict what kind of lossout Y. Typical discriminant models include K-nearest neighbor method, Perceptron, decision tree, logistic regression model, mostlarge entropy models, support vector machines, lifting methods, and conditional random fields.
features of the build method:The generation method can restore the joint probability distributionP (x, y), and the Discriminant methodis not possible;The learning of the generation method converges faster, that is, when the sample capacity increases, the model learnedcan converge faster to the real model;when there is an implicit variable, you can still learn from the generation methodNo way to use it.
characteristics of the discriminant method:The direct learning is the conditional probability P (y| X) or decision functionf (X), the direct face of the prediction, often learning more accurate rate;due to direct learningP (y| X)orf (X), you canIt simplifies learning by abstracting, defining features, and using features on a variety of levels of data.Classification Issues(classification)Classification problems include learning and classifying two process classification accuracy (accuracy), which is defined as: for a giventest data set, the ratio of the number of samples correctly categorized by the classifier to the total number of samples. Which means the loss function is 0-1 .the accuracy rate on the test data set at the time of loss. tp--the positive class is predicted as the positive class number fn--the positive class is predicted as the negative class number; fp--predicts negative classes as positive classes: tn--predicts negative classes as negative. The definition of the recall rate is defined as the F1 value, which is the harmonic mean of the accuracy and recall, applied: text classification, spam, bank loan credit and other labeling issues (tagging) labeling is a problem of classificationa generalization, labeling problem is also a simple form of more complex structural prediction (structure prediction) problems. MarkNote the input to the problem is an observation sequence, and the output is a sequence of markers or states. Labeling the problemThe Mark is to learn a model so that it can give a marker sequence as a predictor of the view-side sequence. Note that it maythe number of tokens is limited, but the number of the series of tokens is increased by the number of sequence lengths.the long. The learning system constructs a model based on the training data set, indicating that the index of the evaluation labeling model for conditional probability distribution is the same as the index of the evaluation classification model.There are hidden Markov models and conditional random-field in the statistical learning methods used in labeling. Application: Part-of- speech tagging, information extraction, etc. Example
Regression problems(regression)Regression is used to predict input variables (fromvariable) and the output variable (dependent variable), especially if the value of the input variable is changed, theChanges in the value of the out variable. Regression models represent mappings from input variables to output variablesthe function. Regression problem learning is equivalent to function fitting: Select a function curve to fit well with thedata and well-known data on the side. The regression problem is divided into unary regression and multivariate regression according to the number of input variables.The type of relationship between output variables is the type of model, which is divided into linear regression and nonlinear regression. regression learning the most common loss function is the square loss function, in which case the regression problem can beThe well-known least squares (least squares) solution. Application: Stock price forecast, etc.
classification issues : The output variable is a finite discrete variable, qualitative analysis, such as predicting sunny days, rain, etc.
Regression questions:input variables and output variables are continuous variables, quantitative, for example, predict the temperature of how much, the price is how much
Labeling Issues:Both the input and output variables are variable sequences, as in the above examples of part-of-speech analysis
From for notes (Wiz)
Statistical learning methods Hangyuan LI---1th chapter Introduction to Statistical learning methods